Particle Learning and Smoothing Carlos M. Carvalho, Michael Johannes, Hedibert F. Lopes and Nicholas Polson∗ First draft: December 2007 This version: February 2009 Abstract In this paper we develop particle learning (PL) methods for state filtering, sequential parameter learning and smoothing in a general class of nonlinear state space models. The approach extends existing particle methods by incorporating static parameters and utilizing sufficient statistics for the parameters and/or the states as particles. State smoothing with parameter uncertainty is also solved as a by product of particle learning. In a number of applications, we show that our algorithms outperform existing particle filtering algorithms as well as MCMC. Keywords: Particle Learning, Filtering, Smoothing, Mixture Kalman Filter, Bayesian Inference, Bayes factor, State Space Models. ∗ Carvalho, Lopes and Polson are at The University of Chicago, Booth School of Business, 5807 S. Woodlawn, Chicago IL 60637. email: carlos.carvalho@chicagobooth.edu, hlopes@chicagobooth.edu and ngp@chicagobooth.edu. Johannes is at the Graduate School of Business, Columbia University, 3022 Broadway, NY, NY, 10027. email: mj335@columbia.edu. 1 1 Introduction There are two general inference problems in state space models. The first is inherently sequential and involves sequential learning about fixed parameters (θ) and latent states (xt ) using observations up to time t denoted by y t = (y1 , . . . , yt ). This problem is solved by computing the joint posterior distribution p(θ, xt |y t ) sequentially over time, as new data arrives. Most of the literature focuses exclusively on state filtering conditional on parameters, computing p(xt |θ, y t ) using particle filtering methods. Extending these algorithms to handle parameter learning is an active area of research with important developments appearing in Liu and West (2001), Storvik (2002) and Fearnhead (2002) to cite a few. The second problem is state smoothing, computing p (xt |y t ), whilst accounting for parameter uncertainty. On the theoretical side, we provide a new particle learning (PL) approach for computing the sequence of filtering and smoothing distributions, p(θ, xt |y t ) and p (xt |y t ) for a wide class of models and document its performance against common competing approaches. The central idea behind PL is the creation of a particle filter that directly samples from the joint posterior distribution of xt and conditional sufficient statistics for θ, st . This is achieved by updating the particle approximation with a resample-propagate framework as opposed to the typical propagate-resample approaches. Moreover, we explicitly treat conditional sufficient statistics as states to improve efficiency. On the empirical side, we show that the performance of particle learning can be further improved by marginalization of the states, i.e. by extending the use of sufficient statistics to states. We provide extensive simulation evidence comparing our approach to competing methods. Other concrete contributions are listed below. Sequential Learning. For Gaussian DLMs and conditionally Gaussian DLMs (CDLM), particle filters are defined over both state and parameter sufficient statistics, such as Kalman filter’s mean and variance. Our approach extends Liu and Chen’s (2000) mixture Kalman filter (MKF) method by allowing for parameter learning. By re-sampling first it also provides a more efficient algorithm for pure filtering. We also show that nonlinearity in the state evolution, but linearity in the observation equation, is straightforward to implement within the PL framework. This dramatically extends the class of models that MKF particle methods apply to. Although we are no longer able to marginalize out the state variables and use state sufficient statistics, we can still utilize parameter sufficient 2 statistics and generate exact samples from the particle approximations. In a comprehensive simulation study, we demonstrate that PL provides a significant improvement over the current benchmarks in the literature such as Liu and West (2001). Smoothing. We extend Godsill, Doucet and West’s (2004) smoothing results to all models considered above. In doing so, PL also extends Liu and Chen’s (2000) MKF in computing the sequence of smoothed distributions, p(xt |y t ). Posterior inference is typically solved via Markov chain Monte Carlo (MCMC) methods as p (xt |y t ) is a difficult-to-compute marginal from p (xt , θ|y t ). Key references are Carlin, Polson, and Stoffer (1992), Carter and Kohn (1994) and Frühwirth-Schnatter (1994). Our simulations provide strong evidence that PL dominates the standard MCMC strategies in computing time and delivers similar accuracy when computing p (xt |y t ). Gains are most remarkable in models with nonlinearities (CGDM), where the common alternative of MCMC with single-states updates are well known to be inefficient (Papaspiliopoulos and Roberts, 2008). Another advantage of PL over MCMC in these classes of models is the direct and recursive access to approximations of predictive distributions a key element in sequential model comparison. The paper is outlined as follows. The rest of this section review the most popular sequential Monte Carlo schemes for pure filtering and parameter learning. Section 2 presents the general filtering and smoothing strategy for PL. Section 3 considers the class of conditional dynamic linear models. In this context we present, in the Appendix, theoretical results that motivate and justify the good performance of PL approximations. Section 4 discuss the implementation of our methods in nonlinear specifications. Sections 5 and 6 compares the performance of PL to MCMC and alternative particle methods. 1.1 Particle filtering review First, we review the three main types available for pure filtering. Traditionally researchers have focuses on propagate-resample (or resample-move) algorithms. Our approach will instead exploit resample/propagate algorithms and marginalization/Rao-Blackwellization over states and parameters. Moreover, we show how using conditional sufficient statistics replaces the static sequential parameter learning problem with a state filtering problem. Traditional particle filters use sequential importance sampling (SIS) which we now describe: 3 Sequential importance sampling with resampling (SISR). This algorithm uses the following representation of the filtering density: Z p (yt+1 |xt+1 ) p (xt+1 |y t ) t+1 t t p xt+1 |y = and p x |y = p (x |x ) p x |y dxt , t+1 t+1 t t p (y t+1 |yt ) which leads to a particle approximation of the form N X (i) wi (xt+1 , yt+1 )p xt+1 |xt . pN xt+1 |y t = i=1 which leads to the following blind importance sampling algorithm (i) (i) 1. Draw xt+1 ∼ p xt+1 |xt n oN (i) (i) 2. Compute weights wi (xt+1 , yt+1 ) ∝ p yt+1 |xt+1 and resample xt+1 . i=1 SISR with optimal importance function. Here, the approximation of the filtering density is the same but we use a more efficient propagation distribution which in turn changes the resampling weights (i) (i) 1. Draw xt+1 ∼ p xt+1 |xt , yt+1 2. Compute wi (xt , yt+1 ) ∝ p (yt+1 |xt+1 ) p (xt+1 |xt ) = p (yt+1 |xt ) and resample (xt+1 , xt )(i) . p (xt+1 |xt , yt+1 ) Using yt+1 in the first step allows the algorithm to “look-forward” in propagating particles leading to efficiency gains. Auxiliary particle filter (APF). APF uses a different representation of the filtering distribution, Z p(yt+1 |xt ) t+1 p xt+1 |y = p (xt+1 |xt , yt+1 ) p (xt |yt ) dxt p(yt+1 |y t ) that leads to a different mixture approximation: p N xt+1 |y t = N X (i) wi (xt , yt+1 )p (i) xt+1 |xt , yt+1 . i=1 n oN (i) (i) k(i) to get xt 1. Compute weights wi (xt , yt+1 ) ∝ p yt+1 |xt and resample xt i=1 (i) k(i) 2. Propagate xt+1 ∼ p xt+1 |xt , yt+1 4 This requires that Z p (yt+1 |xt ) = p (yt+1 |xt+1 ) p (xt+1 |xt ) dxt+1 is known or can be easily computed and that p (xt+1 |xt , yt+1 ) is known. Notice that var (w (xt , yt+1 )) > var (w (xt+1 , yt+1 )), which implies that the weights are more evenly distributed. In section 2.4 we show how to further marginalize using state sufficient statistics to improve efficiency. 1.2 Sequential parameter learning review Storvik’s Algorithm. Storvik assumes that the posterior distribution of θ given xt and y t depends on a low-dimensional set of sufficient statistics that can be asily updated recursively although as noted by Liu and Chen (2002) this is not necessary. The recursion for sufficient statistics is given by st+1 = S (st , xt+1 , yt+1 ). This then leads to the following SISR algorithm for parameter learning: Storvik’s Algorithm 1. Sample θ ∼ p(θ|xt−1 , y t ). 2. Sample xt ∼ p(xt |xt−1 , θ, yt ). 3. Resample xt with weights wt ∝ p(θ|st−1 )p(xt |xt−1 , θ)p(yt |xt , θ) p(θ|xt−1 , y t )p(xt |xt−1 , θ, yt ) 4. Update sufficient statistics st . Here we have used the optimal importance functions for the parameters and states. Storvik’s algorithm filters the whole vector xt and hence will decay very quickly Liu and West’s algorithm. To solve the sequential state parameter learning problem, Gordon, Salmond and Smith (1993) suggest incorporating artificial evolution noise for θ so that we can treat it as a state variable. The main drawback of their idea is simple: parameters are not states and the introdution of this noise can degrade results. Their 5 approach imposes a loss of information in time as artificial uncertainties added to the parameters eventually resulting in a very diffuse posterior density for θ. To overcome this, Liu and West (2001) suggest a kernel smoothing to approximate p(θ|y t ) which relies upon West (1993) mixture modeling ideas, which approximates a possibly transformed posterior density p(θ|y t ), by a mixture of multivariate normals. Specifically, let n oM (l) (l) (l) xt , θt , wt ∼ pb(xt , θ|y t ) l=1 then the parameter posterior approximation is pb(θ|Y1:t ) = M X (j) (j) wt N (aθt + (1 − a)θ̃t ; b2 Vt ) j=1 with moments θ̃t = M X (j) (j) wt θt and Vt = j=1 M X (j) (j) (j) wt (θt − θ̄t )(θt − θ̄t )T j=1 The constants a and h measure, respectively, the extent of the shrinkage and the degree of over dispersion of the mixture. As in Liu and West (2001), the choice of a and h will depend on a discount factor δ in (0, 1], typically around 0.95 − 0.99, so that h2 = 1 − ((3δ − 1)/2δ)2 √ and a = 1 − h2 . This is a way to specify the controlling smoothing parameter b as a function of the amount of information preserved in each step of the filter. Analogously to Pitt and Shephard’s (1999) algorithm, the sequential scheme proceeds as follows: Liu and West’s algorithm (j) (j) (j) (j) 1. k l is sampled from p(yt+1 |µt+1 , θt )wt , with µt+1 = E(xt+1 |xt , θ); (l) (kl ) + (1 − a)θ̃t ; h2 Vt ); 2. θt+1 is sampled from N (aθt (l) (kl ) (l) 3. xt+1 is sampled from p(xt+1 |xt , θt+1 ), which leads to weights (l) (l) p yt+1 |xt+1 , θt+1 (l) wt+1 ∝ (kl ) (kl ) p yt+1 |µt+1 , mt and samples from the posterior, n oM (l) (l) (l) xt+1 , θt+1 , wt+1 ∼ pb(xt+1 , θ|y t+1 ) l=1 6 2 Particle Learning and Smoothing Consider a general state space model defined by the observation and evolution equations yt+1 ∼ p (yt+1 |xt+1 , θ) xt+1 ∼ p (xt+1 |xt , θ) , with initial state distribution p (x0 |θ) and prior p(θ). The sequential parameter learning and state filtering problem is characterized by the joint posterior distribution, p (xt , θ|y t ) whereas smoothing is characterized by p xT , θ|y T , with y t = (y1 , . . . , yt ), xt = (x1 , . . . , xt ) and T denoting the last observation time. Sequential sampling from the above posteriors is difficult due to their dimensionality and the complicated functional relationships between the parameters, states, and data. MCMC methods have been developed to solve the smoothing problem but are too slow for the sequential problem, which requires on-line simulation based on a recursive or iterative structure. The classic example of recursive estimation is the Kalman filter (Kalman, 1960) in the case of linear Gaussian models with known parameters. In this paper, we use a particle approach to sequentially filter p (xt , θ|y t ) and obtain p xT |y T by backwards propagation, effectively solving the filtering, learning and smoothing problems. Particle methods use a discrete representation of p (xt , θ|y t ) via N 1 X p (xt , θ|y ) = δ (i) , N i=1 (xt ,θ) N t where N is the number of particles, (xt , θ)(i) denotes a particle vector and δ(·) is the Dirac delta function, representing the distribution degenerate at the particles. Given this particle approximation, the key problem is how to jointly propagate the parameter and state particles once yt+1 becomes available. This step is complicated because the state propagation depends on the parameters, and vice versa. To circumvent the co-dependence in a joint draw, it is common to use importance sampling. This, however, can lead to particle degeneracies, as the importance densities may not closely match the target densities. Our approach will rely on three main insights to deal with these problems: (i) conditional sufficient statistics are used to represent the posterior of θ. This allows us to think of the sufficient statistics as additional states that are sequentially updated. Whenever possible, sufficient statistics for the latent states are also introduced. This increases the efficiency of our algorithm by reducing the variance of sampling weights in what can be called 7 a Rao-Blackwellized filter. (ii) We use a re-sample/propagate framework to provide exact samples from our particle approximation when moving from pN (xt , θ|y t ) to pN (xt+1 , θ|y t+1 ). This avoids sample importance re-sampling (SIR) (Gordon, Salmond and Smith, 1993) and the associated “decay” in the particle approximation. Finally, (iii) we extend the backward propagation smoothing algorithm of Godsill, Doucet and West (2004) to incorporate uncertainty about θ. 2.1 Parameter Learning For all models considered, we assume that the posterior for the parameter vector θ admits a conditional sufficient statistics structure given the states and data (xt , y t ), so that p θ|xt , y t = p (θ|st ) , where st is a recursively defined sufficient statistics, st = S (st−1 , xt , yt ) . (1) The use of sufficient statistics is foreshadowed in Gilks and Berzuini (2001), Storvik (2002) and Fearnhead (2002). Previous approaches to parameter learning include the addition of θ in the particle set or the introduction of evolutionary noise, turning θ into an artificial state (Gordon et al., 1993). The first alternative breaks down quickly as there is no mechanism for replenishing the particles of θ whereas the second overestimate the uncertainty about θ by working with a different model. Liu and West (2001) propose a learning scheme adapted to the auxiliary particle filter that approximates p(θ|y t ) by kernel mixture of normals. While being a widely used and applicable strategy, it still degenerates fairly quickly because the propagation mechanism for xt+1 and θ does not take into account the current information yt+1 . This is highlighted in our simulation studies of Section 6. Another alternative is to consider sampling over (θ, xt ) with a Metropolis kernel as in Gilks and Berzuini (2001) and Polson, Stroud and Müller (2008). The complexity of this strategy grows with t and suffers from the curse of dimensionality as discussed in Bengtsson, Bickel, and Li (2008). Our approach is of fixed dimension as it targets the filtering distribution of (st , xt ). The introduction of sufficient statistics provides the mechanism to break particle decay by sequentially targeting (st , xt ), where st is automatically replenished as a function of the current samples for xt and observation yt . One way to interpret this strategy is to think of 8 st as an additional latent state in the model, with a deterministic evolution determined by (1). This can be seen by the following factorization of the joint conditional posterior p xt , st , θ|y t = p(θ|st )p xt , st |y t . Our algorithm will be based on developing a particle approximation pN (xt , st |y t ) to the joint distribution of states and sufficient statistics. Parameter learning p(θ|y t ) will then be performed “offline” by simulating particles from p(θ|st ). 2.2 Re-sample-Propagate Let zt = {xt , st , θ} and assume that at time t, i.e., after observing y t , we have a particle n oN (i) approximation pN (zt |y t ), given by zt . Once yt+1 is observed, our method updates i=1 the above approximation using the following re-sample/propagation rule: p zt |y t+1 ∝ p (yt+1 |zt ) p zt |y t Z t+1 p zt+1 |y = p (st+1 |xt+1 , st , yt+1 ) p (xt+1 |zt , yt+1 ) p zt |y t+1 dxt dst (2) (3) From (2), we see that an updated approximation pN (zt |y t+1 ) can be obtained by ren oN (i) sampling the particles zt with weights proportional to the predictive, leading to i=1 a particle approximation N X (i) N t+1 p zt |y = w zt δz(i) t i=1 with weights given by (i) w zt (i) p yt+1 |zt . =P (i) N p y |z t+1 t i=1 This updated approximation is used in (3) to generate propagated samples from the posterior p (xt+1 |zt , yt+1 ) that are then used to update st+1 , deterministically, by the recursive map st+1 = S (st , xt+1 , yt+1 ) . Notice that (3) is a slight abuse of notation as p (st+1 |xt+1 , st , yt+1 ) is either 0 or 1. However, since st and xt+1 are random variables, the conditional sufficient statistics st+1 are also random and are replenished, essentially as a state, in the filtering step. This is the key 9 insight for handling the learning of θ. The particles for st+1 are sequentially updated and replenished as a function of xt+1 and updated samples from p(θ|st+1 ) can be obtained at the end of the filtering step. Our approach departures from the standard particle filtering literature by inverting the order in which states are propagated and updated. By re-sampling first we reduce the compounding of approximation errors as the states are propagated after being “informed” by yt+1 . This is discussed in detail in the Appendix. In addition, borrowing the terminology of Pitt and Shephard (1999), PL is a fully-adapted filter that starts from a particle approximation pN (zt |y t ) and provides direct (or exact) draws from the particle approximation pN (zt+1 |y t ). This reduces the accumulation of Monte Carlo error as there is no need to re-weight or compute importance sampling weights at the end of the filter. To clarify n oN n oN (i) (i) this notion we can re-write the problem of updating the particles zt to zt+1 as i=1 i=1 the problem of obtaining samples from the target p (xt+1 , zt |y t+1 ) based on draws from the proposal p (zt |y t+1 ) p (xt+1 |zt , y t+1 ) yielding importance weights wt+1 p (xt+1 , zt |y t+1 ) = 1, ∝ p (zt |y t+1 ) p (xt+1 |zt , y t+1 ) (4) and therefore, exact draws. Sampling from the proposal is done in two steps: first zt ∼ n oN (i) p (zt |y t+1 ) are simply obtained by re-sampling the particles zt with weights propori=1 tional to p (yt+1 |zt ); we can then sample (xt+1 ) ∼ p (xt+1 |zt , y t+1 ). Finally, updated samples for st+1 are obtained as a function of the samples of xt+1 , with weights 1/N , which prevents particle degeneracies in the estimation of θ. This is a feature of the “re-sample-propagate” mechanism of PL. Any “propagate-re-sample” strategy will lead to decay in the particles of xt+1 with significant negative effects on pN (θ|st+1 ). This strategy will only be possible whenever both p (yt+1 |zt ) and p (xt+1 |zt , y t+1 ) are analytically tractable which is the case in the classes of models considered here. Notice, however, that the above interpretation provides a clear intuition for the construction of effective filters as the goal is to come up with proposals that closely match the densities in the denominator of (4). 10 Particle learning 1. Resample particles zt = (xt , st , θ) with weights wt ∝ p(yt+1 |zt ) 2. Propagate new states xt+1 from p(xt+1 |zt , yt+1 ) 3. Update state and parameter sufficient statistics 4. Sample θ from p(θ|st+1 ) 2.3 State marginalization A more efficient approach in dynamic linear models is to marginalize states and just track conditional state sufficient statistics. Here we use the fact that Z t p xt |y = p (xt |sxt ) p sxt |y t dsxt = E p (xt |sxt ) |y t . Thus, we are interested in the distribution p (sxt |y t ). The filtering recursions are given by Z 1 x t+1 p sxt+1 |sxt , xt+1 , yt+1 p sxt , xt+1 |y t+1 dsxt dxt+1 p st+1 |y = t+1 p (y |yt ) Z x 1 x x x x x t p s |s , x , y p (y |s ) p (x |s , y ) p s = |y dst dxt+1 , t+1 t+1 t+1 t+1 t+1 t+1 t t t t p (y t+1 |yt ) and notice the double integral (extra marginalization). Instead of marginalizing xt , you marginalize sxt and xt+1 . Note that we need to know Z x p (xt+1 |st , yt+1 ) = p (xt+1 |xt , yt+1 ) p (xt |sxt ) dxt . The algorithm is then 1. Resample w (sxt , yt+1 ) ∝ p (yt+1 |sxt ) and resample {sxt }N i=1 (i) 2. Propagate xt+1 ∼ p (xt+1 |sxt , yt+1 ) . The weights are flatter in the first stage p (yt+1 |xt ) versus p (yt+1 |sxt ). Because of RaoBlackwellisation, this will add to the efficiency of the algorithm. 11 2.4 Smoothing After one sequential pass through the data, our particle approximation computes samples from pN (xt , st |y t ) for all t ≤ T . However, in many situations, we are required to obtain full smoothing distributions p(xT |y T ) which is typically carried out by a MCMC scheme. We now show that our filtering strategy provides a direct backward sequential pass to sample from the target smoothing distribution. To compute the marginal smoothing distribution, we need to start by writing the joint posterior of xT , θ as T p x , θ|y T = p xT , θ|y T 1 Y p xt |xt+1:T , θ, y T T −1 = p xT , θ|y T 1 Y p xt |xt+1 , θ, y t . T −1 By Bayes rule and conditional independence we have p xt |xt+1 , θ, y t ∝ p xt+1 |xt , θ, y t p xt |θ, y t . (5) From (5) we can now derive a recursive backward sampling algorithm to jointly sample from p xT , θ|y T by sequentially sampling from filtered particles with weights proportional to p (xt+1 |xt , θ, y t ). In detail, randomly choose, at time T , (x̃T , s˜T ) from the particle ap(i) proximation pN (xT , sT |y T ) and sample θ̃ ∼ p(θ|s˜T ). For t = T − 1, . . . , 1, choose x̃t = xt n oN (i) from the filtered particles xt with weights i=1 (i) (i) wt|t+1 ∝ p x̃t+1 |xt , θ̃ . This algorithm is an extension of Godsill, Doucet and West (2004) to state space models where the fixed parameters are unknown. 12 Particle smoothing 1. Forward filtering via Particle Learning (xT , θ)(i) 2. Backwards smoothing with weights (i) (i) wt|t+1 ∝ p x̃t+1 |xt , θ̃ . 2.5 Discussion Another way to see why the algorithm performs well is as follows. Consider the pure filtering problem (ignoring parameters). Traditionally, filtering is done in two steps: prediction and updating (see Kalman, 1960) following Z t p xt+1 |y = p (xt+1 |xt ) p xt |y t dxt (6) p xt+1 |y t+1 ∝ p (yt+1 |xt+1 ) p xt+1 |y t . (7) This has lead to numerous particle filtering algorithms that are first based on propagation (the prediction step) and then on re-sampling (the updating step). These algorithms have well known degeneracy problems, most clearly discussed in Pitt and Shephard (1999). Our strategy uses Bayes rule and reverses the logic with: p xt |y t+1 ∝ p (yt+1 |xt ) p xt |y t Z t+1 p xt+1 |y = p (xt+1 |xt , yt+1 ) p xt |y t+1 dxt . (8) (9) Therefore, our algorithm will first re-sample (smooth) and then propagate. Notice that this approach first solves a smoothing problem, re-sampling to characterize p (xt |y t+1 ), and then propagates these re-sampled states. Since information in yt+1 is used in both steps, the algorithm will be more efficient. Mathematically, the weights used for re-sampling are Z p (yt+1 |xt ) = p (yt+1 |xt+1 ) p (xt+1 |xt ) dxt+1 . (10) Key to the algorithm is the marginalization of xt+1 in the re-sampling step. This creates weights that are more uniformly distributed (when compared to the original approach of 13 Storvik, 2002) in the re-sampling step as var (p (yt+1 |xt )) < var (p (yt+1 |xt+1 )) . The marginalization step in (10) is possible for a vast class of state space models including the entire class of conditionally dynamic linear models (see Liu and Chen, 2000; Fearnhead and Clifford, 2003; Scott, 2002) and many non-linear models. Remember that the introduction of conditional sufficient statistics allows us to think about the learning of θ as a filtering problem and therefore, the above discussion is preserved in our general approach where θ is unknown. Throughout, we note that efficiency improvements can be made by marginalizing out as many variables from the predictive distribution in the re-sampling step as possible. Most efficient will be re-sampling with predictive p(yt+1 |sxt , st ), that is, given just the conditional sufficient statistics for parameters and states (sxt ) (see Section 3). This point can be casted in terms of the effective sample size (see Kong, Liu and Wong, 1994) and its relationship to the re-sampling weights in pure filtering context, conditionally on θ. The weights for standard approaches are based on blind propagation and are given by w(xt+1 ) = p(yt+1 |xt+1 , θ). Our algorithm reverses this logic and first re-samples and then propagates. This has weights w(xt ) = p(yt+1 |xt , θ). The most efficient approach is to marginalize,whenever possible, over the state vector and condition on sxt leading to weights w (sxt ) (this is the case for the models presented in Section 3). Our efficiency result is based on an ordering for the effective sample sizes (ESS) (precision of weights) which is given by var(w(sxt )) ≤ var(w(xt )) ≤ var(w(xt+1 )). This is based on the identity w(sxt ) = p(yt+1 |sxt ) = Ext |sxt (p(yt+1 |xt )) = E(w(xt )|sxt ), which using the conditional variance identity leads to var(w(xt )) ≥ var (E(w(xt )|sxt )) . Similarly, we know that the marginal p(yt+1 |xt ) = Ext+1 |xt (w(xt+1 )). In practice, the advantage is twofold: our strategy has better effective sample size and the particles will never decay to a point as the propagation step is performed last. 14 3 Conditional Dynamic Linear Models We now explicitly derive our PL algorithm in a class of conditional dynamic linear models which are an extension of the models considered in West and Harrison (1997). This consists of a vast class of models that embeds many of the commonly used dynamic models. MCMC via Forward-filtering Backwards-sampling or mixture Kalman filtering (MKF) (Liu and Chen, 2000) are the current methods of use for the estimation of these models. As an approach for filtering, PL has a number of advantages over MKF. First, our algorithm is more efficient as it is a perfectly-adapted filter. Second, we extend MKF by including learning about fixed parameters and smoothing for states. If compared to MCMC, smoothing via PL can be faster for the same level of accuracy as discussed in Section 5. The conditional DLM defined by the observation and evolution equations takes the form of a linear system conditional on an auxiliary state λt+1 yt+1 = Fλt+1 xt+1 + t+1 where t+1 ∼ N 0, Vλt+1 xt+1 = Gλt+1 xt + xt+1 where xt+1 ∼ N 0, Wλt+1 . The marginal distribution of observation error and state shock distribution are any combination of normal, scale mixture of normals, or discrete mixture of normals depending on the specification of the distribution on the auxiliary state variable p(λt+1 |θ), so that, Z p (t+1 |θ) = N 0, Vλt+1 p (λt+1 |θ) dλt+1 . Extensions to hidden Markov specifications where λt+1 evolves according to p(λt+1 |λt , θ) are straightforward and are discussed in the example of Section 3.2. 3.1 Particle Learning in CDLM In CDLMs the state filtering and parameter learning problem is equivalent to a filtering problem for the joint distribution of their respective sufficient statistics. This is a direct result of the factorization of the full joint as p xt+1 , θ, λt+1 , st+1 , sxt+1 |y t+1 = p(θ|st+1 )p(xt+1 |sxt+1 , λt+1 )p λt+1 , st+1 , sxt+1 |y t+1 , 15 where the conditional sufficient statistics for states (sxt ) and parameters (st ) satisfy the deterministic updating rules sxt+1 = K (sxt , θ, λt+1 , yt+1 ) (11) st+1 = S (st , xt+1 , λt+1 , yt+1 ) (12) where K(·) denotes the Kalman filter recursions and S(·) the recursive update of the sufficient statistics as in (1). More specifically, define sxt = (mt , Ct ) as Kalman filter first and second moments at time t. Conditional on θ, p xt+1 |sxt+1 , θ, λt+1 ∼ N (at+1 , Rt+1 ) where at+1 = Gλt+1 mt and Rt+1 = Gλt+1 Ct G0λt+1 + Wλt+1 . Updating the state sufficient statistics (mt+1 , Ct+1 ) is done by mt+1 = Gλt+1 mt + At+1 yt+1 − Fλt+1 Gλt+1 mt −1 Ct+1 = −1 Rt+1 + Fλ0 t+1 Fλt+1 Vλ−1 t+1 (13) (14) where At+1 = Rt+1 Fλt+1 Q−1 t+1 is the Kalman gain matrix and Qt+1 = Fλt+1 Rt+1 Fλt+1 + Vλt+1 . We are now ready to define the PL scheme for CDLM. First, without loss of generality, assume that the auxiliary state variable is discrete with λt+1 ∼ p(λt+1 |λt , θ). We start, at time t, with a particle approximation for the joint posterior p N xt , θ, λt , st , sxt |y t N 1 X x (i) = δ N i=1 (xt ,θ,λt ,st ,st ) that are then propagated to t + 1 by first re-sampling the current particles with weights proportional to the predictive p(yt+1 |(θ, sxt )). This provides a particle approximation to the smoothing distribution p(xt , θ, λt , st , sxt |y t+1 ). New states λt+1 and xt+1 are then propagated through the conditional posterior distributions p (λt+1 |λt , θ, yt+1 ) and p(xt+1 |λt+1 , xt , θ, yt+1 ). Finally the conditional sufficient statistics are updated according to (11) and (12) and new samples for θ are obtained from p(θ|st+1 ). Notice that in the conditional dynamic linear models all the above densities are available for evaluation and sampling. For instance, the predictive is computed via X p yt+1 |(λt , sxt , θ)(i) = p yt+1 |λt+1 , (sxt , θ)(i) p(λt+1 |λt , θ) λt+1 16 where the inner predictive distribution is given by Z x p (yt+1 |λt+1 , st , θ) = p (yt+1 |xt+1 , λt+1 , θ) p(xt+1 |sxt , θ)dxt+1 . n oN (i) x Starting with particle set (xt , θ, λt , st , st ) , the above discussion can be summai=1 rized in the following PL algorithm: Algorithm 1: CDLM Step 1 (Re-sample) Generate an index k(i) ∼ M ulti w(i) where w(i) ∝ p yt+1 |(λt , sxt , θ)(i) Step 2 (Propagate): States k(i) λt+1 ∼ p λt+1 | (λt , θ) , yt+1 xt+1 ∼ p xt+1 | (xt , θ)k(i) , λt+1 , yt+1 Step 3 (Propagate): Sufficient Statistics sxt+1 = K (sxt , θ, λt+1 , yt+1 ) st+1 = S (st , xt+1 , λt+1 , yt+1 ) In the case where the auxiliary state variable λt is continuous it is not always possible to integrate out λt+1 form the predictive in step 1. We extend by adding the above scheme (i) to the current particle set a propagated particle λt+1 ∼ p λt+1 | (λt , θ) and define the following PL algorithm: 17 Algorithm 2: Auxiliary State CDLM Step 1 (Re-sample) Generate an index k(i) ∼ M ulti w(i) where w(i) ∝ p yt+1 |(sxt , λt+1 , θ)(i) Step 2 (Propagate): States k(i) xt+1 ∼ p xt+1 | (xt , λt+1 , θ) , yt+1 Step 3 (Propagate): Sufficient Statistics sxt+1 = K (sxt , θ, λt+1 , yt+1 ) st+1 = S (st , xt+1 , λt+1 , yt+1 ) Both of the above algorithms can be combined with the backwards propagation scheme of Section 2.4 to provide a full draw from the marginal posterior distribution for all the states given the data, namely p(xT |y T ). 3.2 Example: Dynamic Factor Model with Switching Loadings Consider data yt = (yt1 , yt2 )0 , t = 1, . . . , T following a dynamic factor model with Markov switching in the states λt+1 . This will illustrate our algorithm with λt+1 being discrete in a conditional dynamic linear model (see also, Carvalho and Lopes, 2007). Specifically, we have ! 1 yt+1 = xt+1 + σt+1 βλt+1 xt+1 = xt + σx vt+1 p(λt+1 |λt , θ) ∼ MS(p, q) where MS denotes a Markov switching process with transition probabilities (p, q) and t and vt are independent standard normals. The initial state is assumed to be x0 ∼ N (m0 , C0 ). The parameters are θ = (β1 , β2 , σ 2 , σx2 , p, q)0 . For this model, we are able to develop the algorithm marginalizing over both (xt+1 , λt+1 ) by using state sufficient statistics sxt = (mt , Ct ) as particles. From the Kalman filter 18 recursions we know that p(xt |λt , θ, y t ) ∼ N (mt , Ct ). The mapping for state sufficient statistics (mt+1 , Ct+1 ) = K (mt , Ct , λt+1 , θ, yt+1 ) is given by the one-step Kalman update as in (13) and (14). The prior distributions are conditionally conjugate where ν10 d10 ν00 d00 2 2 2 2 , and p(σx ) ∼ IG , . p(βi |σ ) ∼ N bi0 , σ Bi0 for i = 1, 2, p(σ ) ∼ IG 2 2 2 2 For the transition probabilities, we assume that p ∼ Beta(p1 , p2 ) and q ∼ Beta(q1 , q2 ). n oN Assume that at time t, we have particles (xt , θ, λt , sxt , st )(i) approximating p (xt , θ, λt , sxt , st |y t ). i=1 The PL algorithm can be described through the following steps: Re-sampling: Draw an index k(i) ∼ M ulti(w(i) ) with weights w(i) ∝ p(yt+1 |(mt , Ct , λt , θ)k(i) ) p(yt+1 |mt , Ct , λt , θ) = 2 X fN 1 yt+1 ; ! mt , (Ct + σx2 ) βλt+1 λt+1 =1 1 βλt+1 βλt+1 βλ2t+1 ! ! + σ 2 I2 p (λt+1 |λt , θ) , where fN denotes the normal density function. Propagating λ: Draw auxiliary state variable (i) λt+1 ∼ p(λt+1 |(sxt , λt , θ)k(i) , yt+1 ) using P r(λt+1 = j|sxt , λt , θ, yt+1 ) ∝ fN 1 βj yt+1 ; ! mt , (Ct + σx2 ) 1 βj βj βj2 ! ! + σ 2 I2 p (λt+1 = j|λt , θ) . Propagating x: Draw states (i) (i) xt+1 ∼ p(xt+1 |λt+1 , (sxt , θ)k(i) , yt+1 ) Updating sufficient statistics for states: The Kalman filter recursion yield mt+1 = mt + At+1 (yt+1 − βλt+1 mt ) 0 Ct+1 = Ct + σx2 − At+1 Q−1 t+1 At+1 19 where Qt+1 At+1 ! β λt+1 = (Ct + σx2 ) + σ 2 I2 2 βλt+1 βλt+1 2 = (Ct + σx ) 1 βλt+1 Q−1 t+1 1 Updating sufficient statistics for parameters: The conditional posterior p(θ|st ) is decomposed into p(βi |σ 2 , st+1 ) ∼ N (bi,t+1 , σ 2 Bi,t+1 ) i = 1, 2 ν0,t+1 d0,t+1 ν1,t+1 d1,t+1 2 2 p(σ |st+1 ) ∼ IG , and p(σx |st+1 ) ∼ IG , 2 2 2 2 p(p|st+1 ) ∼ Beta(p1,t+1 , p2,t+1 ) and p(q|st+1 ) ∼ Beta(q1,t+1 , q2,t+1 ) where −1 Bi,t+1 = Bit−1 + x2t+1 Iλt+1 =i bi,t+1 = Bi,t+1 Bit−1 bit + xt yt2 Iλt+1 =i d0,t+1 2 X −1 = d0t + (yt+1,1 − xt+1 ) + (yt+1,2 − bj,t+1 xt+1 ) yt+1,2 + bj,t+1 Bj0 Iλt+1 =j 2 j=1 d1,t+1 = d1t + (xt+1 − xt )2 , and νi,t+1 = νi,t + 1. For pi,t+1 , qi,t+1 for i = 1, 2 we have p1,t+1 = p1t + Iλt =1,λt+1 =1 and p2,t+1 = p2t + Iλt =1,λt+1 =2 and similarly for qi,t+1 . Figure 1 and 2 illustrates our PL algorithm in a simulated 2-dimensional MarkovSwitching Dynamic Factor Model. The first panel of Figure 1 displays the true underlying λ process along with filtered and smoothed estimates whereas the second panel presents the same information for the common factor x. Figure 2 provides the sequential parameter learning plots. 20 4 Nonlinear Filtering and Learning We now extend our PL filter to a general class of non-linear state space models, namely conditional Gaussian dynamic model (CGDM). This class generalizes conditional dynamic linear models by allowing non-linear evolution equations. In this context we take advantage of most efficiency gains of PL as we are still able follow the re-sample/propagate logic and filter sufficient statistics for θ. Consider a conditional Gaussian state space model with non-linear evolution equation, yt+1 = Fλt+1 xt+1 + t+1 where t+1 ∼ N 0, Vλt+1 (15) xt+1 = Gλt+1 Z (xt ) + ωt+1 where ωt+1 ∼ N 0, Wλt+1 , (16) where Z(·) is a given non-linear function. Due to the non-linearity in the evolution we are no longer able to work with state sufficient statistics sxt but we are still able to evaluate the predictive p (yt+1 |xt , λt , θ). n oN In general, take as the particle set: (xt , θ, λt , st )(i) . For discrete λ we can define i=1 the following algorithm: Algorithm 3: CGDM Step 1 (Re-sample) Generate an index k(i) ∼ M ulti w(i) where w(i) ∝ p yt+1 |(xt , λt , θ)(i) Step 2 (Propagate): States k(i) λt+1 ∼ p λt+1 | (λt , θ) , yt+1 xt+1 ∼ p xt+1 | (xt , , θ)k(i) , λt+1 , yt+1 Step 3 (Propagate): Sufficient Statistics st+1 = S (st , xt+1 , λt+1 , yt+1 ) When λ is continuous we propagate λt+1 ∼ p λt+1 | (λt , θ)(i) and then re-sample the 21 particle (xt , λt+1 , st )(i) with the appropriate predictive p yt+1 |(xt , λt+1 , θ)(i) as in Algorithm 2. Finally it is straightforward to extend the backward smoothing strategy from Section 2.4 to obtain samples from p xT |y T . 4.1 Example: CGDM Consider the fat-tailed nonlinear state space model given by ν ν p , yt+1 = xt+1 + σ λt+1 t+1 where λt+1 ∼ IG 2 2 xt xt+1 = g(xt )β + σx ut+1 where g(xt ) = 1 + x2t where t+1 and ut+1 are independent standard normals and ν is known. The observation p error term λt+1 t+1 is standard t-Student with ν degrees of freedom. Filtering strategies for this model without parameter learning are described in deJong et al (2007). n oN (i) Assume that at time t, we have particles (xt , θ, λt+1 , st ) approximating p (xt , θ, λt+1 , st |y t ). i=1 The PL algorithm can be described through the following steps: Resampling: Draw an index k(i) ∼ M ulti w(i) with weights w(i) ∝ p(yt+1 |(xt , λt+1 , θ)(i) ) where p(yt+1 |xt , λt+1 , θ) = fN yt+1 |βg(xt ), λt+1 σ 2 + σx2 . (i) Propagating x: Draw xt+1 ∼ p(xt+1 |(λt+1 , xt , θ)k(i) , yt+1 ). This conditional distribution is given by p(xt+1 |λt+1 , xt , θ, yt+1 ) = N (xt+1 ; µ(xt , λt+1 , θ), V (xt , λt+1 , θ)) yt+1 βg(xt ) µ(xt , λt+1 , θ) = V (xt , λt+1 , θ) + λt+1 σ 2 σx2 2 −2 V (xt , λt+1 , θ)−1 = λ−1 t+1 σ + σx . Updating sufficient statistics: Posterior parameter learning for θ = (β, σ 2 , σx2 ) follow directly from conditionally normal-inverse gamma update. Figure 3 illustrates the above PL algorithm in a simulated example where β = 0.9, σ = 0.04 and σx2 = 0.01. The algorithm uncovers the true parameters very efficiently in a sequential fashion. In Section 5 we present a simulation exercise comparing the performance of PL versus MCMC (Carlin, Polson and Stoffer, 1992) in this same example while in Section 6 we compare PL to the filter of Liu and West (2001). 2 22 5 Comparing PL to MCMC PL combined with the backward smoothing algorithm (as in Section 2.4) is an alternative to MCMC methods for state space models. In general MCMC methods (see Gamerman and Lopes, 2006) use Markov chains designed to explore the posterior distribution p xT , θ|y T of states and parameters conditional on all the information available, y T = (y1 , . . . , yT ). For example, an MCMC strategy would have to iterate through p(θ|xT , y T ) and p(xT |θ, y T ). However, MCMC relies on the convergence of very high-dimensional Markov chains. In the purely conditional Gaussian linear models or when states are dicrete, p(xT |θ, y T ) can be sampled in block using FFBS. Even in these ideal cases, achieving convergency is far from a easy task and the computational complexity is enormous as at each iteration one would have to filter forward and backward sample for the full state vector xT . The particle learning algorithm presented here has two advantages: (i) it requires only one forward/backward pass through the data for all N particles and (ii) the approximation accuracy does not rely on convergence results that are virtually impossible to access in practice (see Papaspiliopoulos and Roberts, 2008). In the presence of non-linearities, MCMC methods will suffer even further as no FFBS scheme is available for the full state vector xT . One would have to resort to univariate updates of p(xt |x(−t) , θ, y T ) as in Carlin, Polson and Stoffer (1992). It is well known that these methods generate very “sticky” Markov chains, increasing computational complexity and slowing down convergence. The computational complexity of our approach is only O(N T ) where N is the number of particles. PL is also attractive given the simple nature of its implementation (especially if compared to more novel hybrid methods). Another important advantage of PL is that it directly provides the filtered joint posteriors p(xt , θ|y t ) whereas MCMC would have to be repeated T times to make that available. These distributions are not only important for sequential predictive problems but also key in the computation of Bayes factors for model assessment in state space models. More specifically, the marginal predictive can be approximated via p̂(yt+1 |y t ) = N 1 X p(yt+1 |(xt , θ)(i) ). N i=1 23 This then allows the computation Bayes factors BF t or sequential likelihood ratios for competing models M0 and M1 (see, for example, West, 1986): BF t = p(y t |M1 ) p(y t |M0 ) Q where p(y t |Mi ) = tj=1 p(yj |y j−1 , Mi ). We now present three experiments comparing the performance of PL against MCMC alternatives. Experiment 1. First we compare PL with FFBS in a first order normal DLM, defined as yt+1 = xt+1 + t+1 where t+1 ∼ N 0, σ 2 xt+1 = xt + xt+1 where xt+1 ∼ N 0, τ 2 . This is a special case of CDLM considered in Section 3. We simulate data from the above model with different specifications for σ 2 , τ 2 effectively varying the signal to noise ratio of the model. Table 1 presents the results for the comparison between PL and FFBS. It is clear that PL outperforms FFBS in computational time for similar levels of accuracy in estimating the state vector xT . It is also important to highlight that FFBS does not necessarily converge geometrically as showed by Papaspiliopoulos and Roberts (2008) so PL can be even more advantageous in more complex model specifications. Figure 4 presents draws from the posterior distribution p(σ 2 , τ 2 |y T ) for both PL and FFBS indicating that PL is also successful in learning about the fixed parameters defining the model. Experiment 2. We now consider the first order conditional Gaussian dynamic model with nonlinear state equation as defined in Section 4.1. Once again we generate data with different levels of signal to noise ratio and compare the performance of PL with MCMC. In this situation we are no longer able to implement a FFBS and need to resort to single move updates as in Carlin, Polson and Stoffer (1992). Table 2 presents the results for the comparisons. Again, PL outperforms MCMC by a large amount, and in most cases, provide better estimates of xT . In addition, Figure 5 shows the effective sample size (ESS) form PL and MCMC for the samples of x100 . Notice that 100 separate MCMC algorithms are necessary to obtain 24 the right hand side figure, whereas the figure on the left is trivially obtained from one run of PL. Due to its high level of autocorrelation the MCMC implementation has very small ESS relative to PL. This result emphasizes the vast computational advantage of PL as, for the same level of ESS, we would need to run the MCMC for many more iterations. Experiment 3. Finally, we consider the following example from Papaspiliopoulos and Roberts (2008). The model is defined as y = x1 + 1 where t+1 ∼ N (0, 1) p x1 = x0 + λ1 x1 where xt+1 ∼ N (0, 5) , √ where λ1 ∼ IG 21 , 21 and prior or initial distribution p(x0 ) ∼ N (0, 1000). Hence we have a heavy-tailed Cauchy error in the evolution equation. Only one observation is available: y = 0. The joint posterior exhibits a similar behavior as the witch’s hat distribution (Polson, 1991) as the posterior is a heavy-tailed Cauchy multiplied by a normal. Papaspiliopoulos and Roberts (2008, example 1 and proposition 3.7) show that the Gibbs sampler does not converge well, being highly dependent on the initial starting point. In this case, PL works as follows: first, generate particles (x0 , λ1 )(i) ∼ p(x0 )p(λ1 ). Then re-sample these particles with weights proportional to the predictive p(y|x0 , λ1 ) ∼ N (x0 , 1 + λ1 ). Then propagate particles for the next state x1 and then backwards sample x0 using p(x1 |(x0 , λ)k(i) , y) and p(x0 |(x1 , λ)k(i) , y). Figure 6 plots the joint smoothing posterior for one data-point y given by p(x1 , x0 |y). If compared to the results in Papaspiliopoulos and Roberts (2008), PL provides a much improved approximation for this posterior. 6 Comparing PL to “Liu and West” All the experiments presented so far provide a lot of evidence of PL’s ability to effectively solve the problem of learning and smoothing. We now compare the performance of PL against the most commonly used filter for situations where θ is unknown, i.e, the filter proposed by Liu and West (2001) (LW). This is a filtering strategy that adapts the auxiliary particle filter of Pitt and Shephard (1999) and solves the learning of θ by incorporating a 25 kernel density approximation for p(θ|y t ). We also use this section to discuss the robustness of PL when the sample size T grows. In the following two experiments, our focus is to assess the performance of each filter in properly estimating the sequence of p(θ|y t ) for all t. Experiment 4. First we compare PL with LW in the same set up of Experiment 1 of Section 5. Here we assume knowledge about τ 2 = 0.1 and make simulation with different values for σ 2 where the τ /σ = 0.5, 1 and 10. In this context we can analytically evaluate the true posterior distribution p(σ 2 |y t ) for all t. We perform a Monte Carlo experiment based on 1000 replications with T = 200 or T = 1000. In each of the six cases we compute the estimation risk of 5 percentiles of p(σ 2 |y T ). Figure 7 shows the risk ratios of PL versus LW. It is clear that PL is uniformly superior in all cases. Figure 8 and 9 are snapshots of the experiment in one particular replication with τ /σ = 1 and T = 1000. It is important to highlight that PL is able to correctly estimate p(σ 2 |y T ) even for a significantly large value for T showing the robustness of the filter to particle degeneration. This fact can also be seen by inspecting Figure 10 where we sequentially plot the distribution of the conditional sufficient statistics for σ 2 , along with their true simulated values. This combined with the accurate approximation of p(σ 2 |y T ) shows that filtering the conditional sufficient statistics is in fact providing a solution for learning. LW however, suffers from particle degeneracy fairly quickly as illustrated in Figure 8. Experiment 5. We revisit the non-linear, conditional Gaussian dynamic model of Section 4.1. In this case, θ = (σ 2 , β, τ 2 ). No “true” benchmark for comparison exists here so we decide to run a very long MCMC (for T = 100) and compare the posteriors for θ obtained with the outputs from PL and LW. Figure 8 presents both the sequential learning plots as well as the estimated densities for θ at time T = 100. Once again, PL outperforms LW and provide answers that are much close to the MCMC output. It is important to remember that PL runs in a fraction of time of MCMC while providing a richer output, i.e, the sequence of filtered distributions p(xt , θ|y t ). 7 Final Remarks In this paper we provide particle learning tools (PL) for a large class of state space models. Our methodology incorporates sequential parameter learning, state filtering and smooth- 26 ing. This provides an alternative to the popular FFBS/MCMC (Carter and Kohn, 1994) approach for conditional dynamic linear models (DLMs) and also to MCMC approaches to nonlinear non-Gaussian models. It is also a generalization of the mixture Kalman filter (MKF) approach of Liu and Chen (2000) that includes parameter learning and smoothing. The key assumption is the existence of a conditional sufficient statistic structure for the parameters which is commonly available in many commonly used models. We provide extensive simulation evidence and theoretical Monte Carlo convergence bounds to address the efficiency of PL versus standard methods. Computational time and accuracy are used to assess the performance. Our approach compares very favorably with these existing strategies and is robust to particle degeneracies as the sample size grows. Finally PL has the additional advantage of being an intuitive and easy-to-implement computational scheme and should, therefore, become a default choice for posterior inference in this very general class of models. 8 Appendix A: Monte Carlo Convergence Bounds The convergence of the particle approximation for mixture DLMs (see Section 3) can be described as follows. First, note that we have current sufficient statistic posterior distribution p(sxt , st |y t ) and particle approximation pN (sxt , st |y t ). By definition, pN (sxt , st |y t+1 ) is given by p(yt+1 |sxt , st ) N x p (st , st |y t ) pN (sxt , st |y t+1 ) = p (yt+1 |y t ) from which we use the deterministic mapping to draw st+1 = S (st , xt+1 , λt+1 , yt+1 ) and similarly sxt+1 . We can estimate the denominator via p N yt+1 |y t =E N (p (yt+1 |sxt , st )) N 1 X p yt+1 |(sxt , st )(i) = N i=1 where (sxt , st )(i) ∼ p(sxt , st |y t ). Consider estimating the next filtering distribution for the conditional sufficient statistics. Later we provide a bound for the sufficient statistics and parameters. Z x t+1 p(st+1 , st+1 |y ) = p(sxt+1 , st+1 |sxt , st , θ, yt+1 )p(sxt , st , θ|yt+1 )d(sxt , st , θ) Z p(yt+1 |sxt , st , θ)p(sxt , st , θ|y t ) x = p(sxt+1 , st+1 |sxt , st , θ, yt+1 ) d(st , st , θ) p(yt+1 |y t ) 27 and the algorithm provides density estimation for the filtering distribution of (sxt+1 , st+1 ) namely p(sxt+1 , st+1 |y t+1 ). Here p(sxt+1 , st+1 |sxt , st , θ, yt+1 ) is based on the deterministic mapping given λt+1 averaged over the appropriate conditional posterior for λt+1 . To complete the analysis of our algorithm we prove a convergence and MC error bound. Define the variational norm as Z N t t kpt − pt k = |pN t (φ|y ) − pt (φ|y )|dφ. Theorem (Convergence): The particle-based estimate pN sxt+1 , st+1 |y t+1 is consistent for p sxt+1 , st+1 |y t+1 namely N x p st+1 , st+1 |y t+1 − p sxt+1 , st+1 |y t+1 →P 0 √ with the Monte Carlo convergence rate N if Z 2 x p (st+1 , st+1 , sxt , st , θ|y t+1 ) x Ct = d(st , st , θ) < ∞. p(sxt , st , θ|y t ) Proof: First, notice that we can write PN x (i) p(yt+1 |(sx (i) 1 x t ,st ,λt+1 ,θ) ) , s , λ , θ) , y , s | (s p s t t+1 t+1 t+1 t t t+1 i=1 N p(yt+1 |y ) pN sxt+1 , st+1 |y t+1 = PN p(y |(sx ,s ,λ ,θ)(i) ) 1 N t+1 i=1 t t t+1 p(yt+1 |y t ) Taking expectations over the distribution of (sxt , st , λt+1 , θ) we have that the denominator converges to ! Z N 1 X p(yt+1 |(sxt , st , λt+1 , θ)(i) ) p(yt+1 |sxt , st , λt+1 , θ)p(sxt , st , λt+1 , θ|yt+1 ) x E = d(st , st , λt+1 , θ) = 1 N i=1 p(yt+1 |y t ) p(yt+1 |y t ) The numerator converges to ! N x (i) p(y |(s , s , λ , θ) ) 1 X t+1 t t+1 t E p(sxt+1 , st+1 , λt+1 |(sxt , st , λt+1 , θ)(i) , yt+1 ) N i=1 p(yt+1 |y t ) Z p(yt+1 |sxt , st , λt+1 , θ)p(sxt , st , λt+1 , θ|y t ) x = p(sxt+1 , st λt+1 |sxt , st , λt+1 , θ, yt+1 ) d(st , st , λt+1 , θ) p(yt+1 |y t ) Z = p(sxt+1 , st+1 |sxt , st , θ, yt+1 )p(sxt , st , θ|y t+1 )d(sxt , st , θ) = p(sxt+1 , st+1 |y t+1 ) 28 Hence, in expectation E pN (sxt+1 , st+1 |y t+1 ) = p(sxt+1 , st+1 |y t+1 ) For the variance bound, we use the inequality V ar(φ) ≤ E(φ2 ) leading to the error bound: V ar pN sxt+1 , st+1 |y t+1 ≤ 2 Z 1 p(yt+1 |sxt , st , θ) x x p(st+1 , st+1 |st , st , θ, yt+1 ) p(sxt , st , θ|y t )d(sxt , st , θ) N p(yt+1 |y t ) Z 2 x p (st+1 , st+1 , sxt , st , θ|y t+1 ) x 1 Ct+1 = d(s , s , θ) = t t x N p(st , st , θ|y t ) N where we have used Bayes rule identity p(sxt , st , θ|y t+1 )/p(sxt , st , θ|y t ) = p(yt+1 |sxt , st , θ)/p(yt+1 |y t ) Notice that the Monte carlo error depends on how the sufficient statistics update in time through the joint distribution p(sxt+1 , st+1 , sxt , st , θ|y t+1 ). Dramatic shifts in this distribution √ will require larger N . This can be used to find a N bound for the variational distance of pN (st , sxt |y t ) due to the inequality E(|φ|)2 ≤ E(φ2 ) for any functional φ. Finally, the variational distance for the particle approximation for the state and parameter distribution satisfies the following bound N p (xt , θ|y t ) − p(xt , θ|y t ) ≤ pN (st , sxt |y t ) − p(st , sxt |y t ) . This is due to the fact that the state and parameter posterior is a Rao-Blackwellised average Z t p(xt , θ|y ) = p(xt , θ|st , sxt )p(st , sxt |y t )dst dsxt . Therefore, Z N p (xt , θ|y t ) − p(xt , θ|y t ) = p(xt , θ|st , sxt )(pN (st , sxt |y t ) − p(st , sxt |y t ))dst dsxt Z ≤ p(xt , θ|st , sxt ) pN (st , sxt |y t ) − p(st , sxt |y t ) dθdxt = pN (st |y t ) − p(st |y t ) . This can be used inductively as in Godsill, Doucet and West (2004) as it is clearly true at time t = 0 by taking a random sample from p(x0 , θ). We note that we also have the typical Monte Carlo rate of convergence for expectations of functionals E N (φ) as well as density estimate pN . 29 References Bengtsson, T., Bickel, P. and Li, B. (2008). Curse-of-dimensionality revisited: Collapse of the particle filter in very large scale systems. Probability and Statistics: Essays in Honor of David A. Freedman. Institute of Mathematical Statistics Collections, 2, 316-334. Carlin, B, and Polson, N.G. and Stoffer, D. (1992). A Monte Carlo approach to nonnormal and nonlinear state-space modeling. Journal of the American Statistical Association, 87, 493-500. Carter, C. and Kohn, R. (1994). On Gibbs sampling for state space models. Biometrika, 82, 339-350. Carvalho, C.M. and Lopes, H.F. (2007). Simulation-based sequential analysis of Markov switching stochastic volatility models. Computational Statistics and Data Analysis, 51, 4526-4542. Doucet, A., de Freitas, J. and Gordon, N. (Eds) (2001). Sequential Monte Carlo Methods in Practice. New York: Springer. Fearnhead, P. (2002). Markov chain Monte Carlo, sufficient statistics, and particle filters. Journal of Computational and Graphical Statistics, 11, 848-862 Fearnhead, P. and Clifford, P. (2003). On-line inference for hidden Markov models via particle filters. Journal of the Royal Statistical Society, B, 65, 887-899 Frühwirth-Schnatter, S. (1994). Applied state space modelling of non-Gaussian time series using integration-based Kalman filtering. Statistics and Computing, 4, 259-269 Gamerman, D. and Lopes, H.F. (2006). Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Inference. Chapman and Hall / CRC. Godsill, S.J., Doucet, A. and West, M. (2004). Monte Carlo smoothing for nonlinear time series Journal of the American Statistical Association, 99, 156-168. Gordon, N., Salmond, D. and Smith, A.F.M. (1993). Novel approach to nonlinear/nonGaussian Bayesian state estimation. IEEE Proceedings, F-140, 107-113. Gilks, W. and Berzuini, C. (2001). Following a moving target: Monte Carlo inference for dynamic Bayesian models. Journal of Royal Statistical Society, B, 63, 127-146.. Johannes, M., Polson, N.G. and Yae, S.M. (2007). Nonlinear filtering and learning. Working Paper. The University of Chicago Booth School of Business. 30 Kalman, R.E. (1960). A new approach to linear filtering and prediction problems. Transactions of the ASME–Journal of Basic Engineering,82, 35-45. Kitagawa, G. (1987). Non-Gaussian state-space modeling of nonstationary time series. Journal of the American Statistical Association, 82, 1032-1041 Liu, J. and Chen, R. (2000). Mixture Kalman filters. Journal of Royal Statistical Society, B, 62(3), 493-508. Liu, J. and West, M. (2001). Combined parameters and state estimation in simulationbased filtering. In Sequential Monte Carlo Methods in Practice (Eds. A. Doucet, N. de Freitas and N. Gordon). Springer-Verlag, New York. Papaspiliopoulos, O., Roberts, G. (2008). Stability of the Gibbs sampler for Bayesian hierarchical models. Annals of Statistics, 36, 95-117. Pitt, M. and Shephard, N. (1999) Filtering via simulation: Auxiliary particle filters. Journal of the American Statistical Association, 94, 590-599 Polson, N.G. (1991). Comment on “Practical Markov Chain Monte Carlo” by C. Geyer. Statistical Science, 7, 490-491. Polson, N.G., Stroud, J. and Müller, P. (2008). Practical filtering with sequential parameter learning. Journal of the Royal Statistical Society B, 70, 413-428. Storvik, G., (2002). Particle filters in state space models with the presence of unknown static parameters. IEEE. Trans. of Signal Processing 50, 281–289. West, M. (1986). Bayesian model monitoring. Journal of the Royal Statistical Society, B, 48, 70-78. West, M. and Harrison, J. (1997). Bayesian forecasting and dynamic models SpringerVerlag Inc (Berlin; New York) 2nd Edition. Scott, S. (2002). Bayesian methods for hidden Markov Models: recursive computing in the 21st century. Journal of the American Statistical Association, 97, 335-351. 31 T 50 50 50 200 200 200 500 500 500 σ/τ 5 1 0.1 5 1 0.1 5 1 0.1 Time (in sec.) MSE SMC MCMC SMC MCMC 1.7 59.2 0.053783 0.041281 1.4 63.4 0.003069 0.002965 1.8 73.9 0.000092 0.000092 7.5 253.9 0.031188 0.022350 6.3 219.5 0.004751 0.004590 5.9 219.3 0.000081 0.000081 18.7 565.7 0.038350 0.033724 17.8 544.8 0.004270 0.004192 17.1 540.3 0.000090 0.000090 MAE SMC MCMC 0.1856861 0.159972 0.0444318 0.044884 0.0076071 0.007644 0.1382585 0.116308 0.0560816 0.055067 0.0073801 0.007379 0.1563062 0.146293 0.0516373 0.050642 0.0074950 0.007505 Table 1: Experiment 1. First order normal dynamic linear model. Prior distributions for 2 and σ 2 and τ 2 are IG(a, b) and IG(c, d), respectively, where a = c = 10, b = (a − 1)σtrue 2 d = (c − 1)τtrue . Similarly, x0 ∼ N (x0,true , 100). PL algorithm based on 5,000 particles. FFBS based on 5,000 draws, after 5,000 burn-in. For the MCMC runs, initial values for σ 2 and τ 2 are the true values. For the SMC runs, initial draws of σ 2 and τ 2 are taken from the priors, while KF0 = (x0,true , 100). Also, τtrue = 0.1. Finally, MSE and MAE refer to mean squared error and mean absolute error in estimating the state vector xT . 32 T 50 50 50 200 200 200 500 500 500 σ/τ 5 1 0.1 5 1 0.1 5 1 0.1 Time (in sec.) MSE MAE SMC MCMC SMC MCMC SMC MCMC 3.4 30.6 0.021718 0.022716 0.119374 0.114798 2.7 30.4 0.005046 0.005994 0.058300 0.061849 2.3 30.2 0.000080 0.000080 0.007538 0.007346 13.1 121.5 0.033697 0.024991 0.147363 0.120748 14.0 121.1 0.005901 0.006270 0.061113 0.061762 13.3 119.7 0.000135 0.000134 0.008730 0.008644 30.3 378.5 0.025721 0.022716 0.127526 0.119456 35.8 441.8 0.005177 0.005712 0.056890 0.060391 32.1 437.8 0.000122 0.000123 0.008661 0.008705 Table 2: Experiment 2. First order conditionally normal non-linear dynamic model with observation equation. The number of degrees of freedom is kept at ν = 10. True values for β and τ 2 are 0.95 and 0.01. Prior distributions for β|τ 2 , σ 2 and τ 2 are N (β0 , τ 2 ), IG(a, b) 2 and IG(c, d), respectively, where β0 is defined below, a = c = 10, b = (a − 1)σtrue and 2 d = (c − 1)τtrue . Similarly, x0 ∼ N (d0 , 100), with d0 defined below. SMC based on 5,000 particles. Initial draws of β, σ 2 , τ 2 and x0 are taken from the priors. MCMC based on 5,000 draws, after 5,000 burn-in. Initial values for β, σ 2 and τ 2 are the true values. The standard deviation of the random walk proposal for states x’s is ξ = 0.01. The first 20 observations are used to create ỹt = (yt+2 + yt+1 , yt , yt+1 , yt+2 )/5 and x̃t = ỹt−1 , for t = 3, . . . , 18, and ¯ run the regression ỹt ∼ N (β x̃t , σ̃ 2 ). Then, β0 = β̂ and d0 = ỹ. 33 Figure 1: Filtering and Smoothing for States. Top panel(λ): true (red), filtered expected value (black) and smoothed expected value (blue). Bottom panel(x): true (red), filtered expected value (black) and smoothed expected value (blue). 34 Figure 2: Parameter Learning. The black represent the filtered posterior mean of the parameter. The blue lines create a 95% credibility interval and the red line denotes the true parameter value. 35 Figure 3: Non-Linear CGM. The first 3 panels presents the mean, 95% credibility interval and true value (red line) in the sequential learning of parameters. The last panel shows a scatter plot of the true state x and its filtered posterior means. 36 0.025 0.020 0.010 0.015 τ2 0.015 0.010 τ2 0.020 0.025 0.030 (b) 0.030 (a) 0.005 ● 0.000 0.000 0.005 ● 0.1 0.2 0.3 0.4 0.1 0.2 0.3 σ σ (c) (d) 0.4 0.015 0.010 τ2 0.010 τ2 0.015 0.020 2 0.020 2 ● 0.005 0.005 ● 0.005 0.010 0.015 0.020 0.005 σ 0.010 0.015 0.020 σ 2 2 Figure 4: Experiment 1. Countour plots for the true posterior p(σ 2 , τ 2 |y T ) and draws form the PL(a,c) and FFBS(b,d). The blue dots represent the true values for the parameters in all cases. In (a,b) T = 50 and (c,d) T = 500. 37 Markov Chain Monte Carlo ESS 260 2000 265 4000 270 ESS 275 6000 280 8000 285 Sequential Monte Carlo 0 20 40 60 80 100 0 time 20 40 60 80 100 time Figure 5: Experiment 2. Effective sample size from PL (left) and single move MCMC (right). PL based on 10,000 particles and MCMC based on 10,000 iterations after 5,000 burn-in. 38 Figure 6: Experiment 3. PL draws (blue dots) for the posterior p(x0 , x1 |y) in Experiment 3 of Section 5. The red contours represents the true posterior. 39 Figure 7: Experiment 4. Risk ratios for the estimation of percentiles in the distribution p(σ 2 |y t ). 40 Figure 8: Experiment 4. The left panel shows the sequential estimate of the mean and 95% credible regions for σ 2 . The right panel displays the true versus estimated p(σ 2 |y 1000 ). The horizontal line (left) and the black dot (right) represent the true value for σ 2 = 0.1. 41 Figure 9: Experiment 4. True versus estimated p(σ 2 |y T ) for different values of T. 42 Figure 10: Experiment 4. Sequential estimates of the mean and 95% credible regions for the conditional sufficient statistics of σ 2 . The black solid lines represent the true values. 43 Figure 11: Experiment 5. The left panels show the sequential estimate of the mean and 95% credible regions for all parameters. The right panels display the MCMC versus PL estimates of p(θ|y T ). The horizontal line (left) and the black dot (right) represent the true value for each parameters. 44