An information-theoretic analysis of resampling in sequential Monte Carlo by Jonathan H. Huggins B.A., Columbia University (2012) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2014 c Jonathan H. Huggins, MMXIV. All rights reserved. The author hereby grants to MIT permission to reproduce and to distribute publicly paper and electronic copies of this thesis document in whole or in part in any medium now known or hereafter created. Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Department of Electrical Engineering and Computer Science May 8, 2014 Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Professor Joshua B. Tenenbaum Professor of Computational Cognitive Science Thesis Supervisor Accepted by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Professor Leslie A. Kolodziejski Chairman, Department Committee on Graduate Theses An information-theoretic analysis of resampling in sequential Monte Carlo by Jonathan H. Huggins Submitted to the Department of Electrical Engineering and Computer Science on May 8, 2014, in partial fulfillment of the requirements for the degree of Master of Science Abstract Sequential Monte Carlo (SMC) methods form a popular class of Bayesian inference algorithms. While originally applied primarily to state-space models, SMC is increasingly being used as a general-purpose Bayesian inference tool. Traditional analyses of SMC algorithms focus on their usage for approximating expectations with respect to the posterior of a Bayesian model. However, these algorithms can also be used to obtain approximate samples from the posterior distribution of interest. We investigate the asymptotic and non-asymptotic properties of SMC from this sampling viewpoint. Let P be a distribution of interest, such as a Bayesian posterior, and let P̂ be a random estimator of P generated by an SMC algorithm. We study P̄ , E[P̂ ], i.e., the law of a sample drawn from P̂ , as the number of particles tends to infinity. We give convergence rates of the Kullback-Leibler divergence KL(P ||P̄ ) as well as necessary and sufficient conditions for the resampled version of P̄ to asymptotically dominate the non-resampled version from this KL divergence perspective. Versions of these results are given for both the full joint and the filtering settings. In the filtering case we also provide time-uniform bounds under a natural mixing condition. Our results open up the possibility of extending recent analyses of adaptive SMC algorithms for expectation approximation to the sampling setting. Thesis Supervisor: Professor Joshua B. Tenenbaum Title: Professor of Computational Cognitive Science 2 Acknowledgments During the first year and a half of my PhD at MIT, I have been fortunate enough to have worked with, and been inspired by, numerous people on many projects. First, I would like to express my deep thanks to my advisor, Josh Tenenbaum, who has indulged my varied, and sometimes aimless, interests. He has been instrumental in shaping my thinking about what constitute the most interesting problems in the field of machine learning. Yet he has also graciously allowed me the freedom to explore and discover which problems are the most exciting to me, and in which areas I can have the most impact. It was his inquiries that instigated the research presented in this thesis. Special thanks go to Dan Roy, who was my close collaborator on this work. Without Dan’s innumerable insights and invaluable suggestions, this project would never have come to fruition. However, even more important than any particular contributions Dan made to this project are the ways he has helped me to become a far more effective theoretician and precise mathematical thinker. I hope to continue to learn from and follow his example. Thanks also to Vikash Mansinghka and Arnaud Doucet for their critical insights, suggestions, and support while this research was still in its formative stages. In particular, I would like to thank Vikash for suggesting (repeatedly, until I finally listened!) that we should study the expected value of the random measures produced by SIS and SIR. And thanks to Arnaud for recommending we investigate asymptotic stability properties via time-uniform bounds. Thanks also to Cameron Freer, Peter Krafft, Tejas Kulkarni, and Andy Miller for reading drafts of various versions this work. Finally, I would like to thank my family and, in particular, my wonderful and brilliant wife, Diana, for their love and support. This research was conducted with U.S. Government support under FA9550-11C-0028 and awarded by the DoD, Air Force Office of Scientific Research, National Defense Science and Engineering Graduate (NDSEG) Fellowship, 32 CFR 168a. 3 Contents 1 Introduction 1.1 6 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . 2 Sequential Monte Carlo 2.1 2.2 10 Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.1 IS for Variance Reduction and Sampling . . . . . . . . . . . . 11 2.1.2 The KL Divergence Perspective on IS . . . . . . . . . . . . . . 13 Sequential Importance Sampling with and without Resampling . . . . 17 2.2.1 20 SIS and SIR for Variance Reduction and Sampling . . . . . . . 3 Main Results 3.1 7 22 Convergence Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.1.1 Rates for the Filtering Distribution . . . . . . . . . . . . . . . 26 3.2 Time-uniform Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.3 Comparing SIS and SIR . . . . . . . . . . . . . . . . . . . . . . . . . 35 4 Conclusions and Future Work 4.1 38 Other Convergence Rates for SMC . . . . . . . . . . . . . . . . . . . 39 4.1.1 Lp Error Bounds . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.1.2 KL Divergence Bounds . . . . . . . . . . . . . . . . . . . . . . 40 4.2 Adaptive Resampling and αSMC . . . . . . . . . . . . . . . . . . . . 40 4.3 Global Parameter Estimation in State-space Models . . . . . . . . . . 43 4 A Auxiliary Results 45 A.1 Technical Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 A.2 Auxiliary SMC Results . . . . . . . . . . . . . . . . . . . . . . . . . . 48 B Proofs of SIS and SIR Comparison Theorems 53 B.1 Proof of Theorem 3.3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . 53 B.2 Proof of Theorem 3.3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5 Chapter 1 Introduction Sequential Monte Carlo (SMC) methods are a widely-used class of algorithms for approximate inference in Bayesian models [11, 14, 15, 16, 20, 21]. The SMC approach is attractive because it provides a flexible, efficient, and relatively simple solution to the problem of computing estimates of expectation functionals when the underlying distribution is analytically intractable, as is often the case for posterior distributions arising in Bayesian analysis. In the case of time-series models, SMC (which in the time series context is commonly called particle filtering) also provides a method for performing fast online inference, which is critical in many real-world applications such as robotics and tracking [2, 14, 18, 27]. A direct precursor to SMC was the importance sampling algorithm. The motivation for developing importance sampling was to produce estimators of expectation functionals with smaller variance than the standard Monte Carlo estimators [14, 17]. Most analyses of SMC continue to adopt this functional approximation perspective (or what we shall refer to as the operator perspective), attempting to quantify the performance of SMC algorithms not only in terms of asymptotic variance, but more generally by bounding approximation error [cf. 4, 9, 14, 16]. Two canonical SMC algorithms are sequential importance sampling (SIS) and sampling importance resampling (SIR). Like all SMC algorithms, SIS and SIR approximate a distribution P by a discrete, random distribution P̂ formed from a collection of N weighted samples called particles. The difference between SIS and SIR can be 6 understood as follows. In SIS, the particles are constructed incrementally and independently of each other. Information about the particles is only combined at the final step of the algorithm. SIR introduces resampling steps: during resampling, particles with large weights are likely to be duplicated, while those with extremely small weights might disappear altogether; after resampling, all particles are given equal weight. The original motivation for introducing the resampling variant of SIS was to prevent weight degeneracy [14, 24, 25]. Weight degeneracy arises because, when using the SIS algorithm, even after only a modest number of steps, a single particle may have vastly more weight than all the other particles combined. If this occurs, then the SIS approximation effectively consists of a single particle. Both theory and practice have shown that resampling provides more accurate estimates of expectations by reducing the variance of these estimates [9, 14, 16, 24, 29]. However, resampling is not always desirable. From the operator perspective, there are two reasons to avoid resampling. First, if the quality of the approximation is good, then, roughly speaking, resampling simply adds variance to the Monte Carlo estimate [4, 11]. Second, there is a computational price to pay for resampling. The SIS algorithm is “embarrassingly parallel” because the particles evolves independently. Resampling, however, breaks this parallelism, potentially leading to a substantial increase in the effective computational requirements [22, 28, 29]. Because of the impact on computational efficiency, it is important to also understand how resampling affects inference quality when the goal is to approximate the underlying measure, not calculate an expectation. Developing such an understanding is the primary goal of this work. 1.1 Summary of Contributions As noted above, previous theoretical investigations of SIS and SIR have primarily taken the operator viewpoint (or operator perspective), assessing the quality of P̂ when used to approximate the expectation operator EP [·] by EP̂ [·]. In this thesis we take the measure viewpoint (or measure perspective), focusing instead on the properties of 7 t=1 t=1 t=2 t=2 t=3 t=3 t=4 t=5 t=4 (a) SIS (b) SIR Figure 1-1: A cartoon depiction of the SIS and SIR algorithms. The size of the particles indicate their relative weights at each time step. Note that in the SIS case, most of the weight becomes concentrated on a single particle. SIS and SIR when employed to produce samples approximately distributed according to P. More precisely, we investigate the mean of P̂ , denoted P̄ , which can also be understood as the (marginal) distribution of a single sample drawn from P̂ . We use the Kullback-Leibler (KL) divergence from P to P̄ to measure how far P̄ is from P. Our motivation is twofold. First, we seek to better understand the quality of SMC algorithms when the object which we wish to approximate is the distribution P itself, as opposed to an expectation. The second goal is to understand how SIS and SIR compare to each other from the measure perspective, in a manner analogous to the way asymptotic variance allows for comparison of algorithms from the operator perspective. Our first main result is KL divergence convergence rates for both SIS and SIR as the number of particles N tends to infinity. We show that if the variances (with respect to the appropriate distributions) of the particle weights are finite, then SIS and SIR converge at a 1/N rate. The constants in these rates are shown to be asymptotically tight, leading to our second main result, which gives necessary and sufficient conditions for SIR to asymptotically dominate SIS from the KL divergence perspective. 8 In practice, SMC methods are often applied to state-space models to approximate the marginal distribution of the hidden state at the most recent time. We give analogues to our first two main results in this filtering case as well. Finally, for the filtering case, we also give time-uniform bounds on the KL divergence for SIR under a natural mixing condition. Our results provide analogues to a number of classical asymptotic and non-asymptotic SMC analyses, which all apply in the case of deterministic resampling. It is common, however, for practitioners to use the effective sample size (ESS) criterion to adaptively determine whether to resample at a particular step of the algorithm [23, 24, 25]. The traditional arguments for employing the ESS criterion were heuristic, though recently Whiteley, Lee, and Heine [29] provided a rigorous justification from the operator viewpoint for the use of ESS. We are hopeful that the work presented here provides a framework for deriving analogous results from measure viewpoint to those of Whiteley et al., though with an appropriately modified notion of effective sample size. The remainder of the thesis is organized as follows. Chapter 2 begins by considering the simpler case of importance sampling (Section 2.1), before formally defining SIS and SIR (Section 2.2). Our main results are presented in Chapter 3. Chapter 4 concludes with a discussion of previous research on the convergence properties of SMC, connections to our results, and speculation on important directions for future work. 9 Chapter 2 Sequential Monte Carlo 2.1 Importance Sampling To provide intuition for our main SIS and SIR results and to establish some notation, we begin by considering estimators arising from importance sampling (IS). Let P and Q be probability measures on a measurable space X. The goal is to form an estimate of the target distribution P when we are only able to sample from the proposal distribution Q. Assume that, for all measurable sets A ⊆ X, we have Q(A) = 0 =⇒ P(A) = 0, i.e., P is absolutely continuous with respect to Q, written P Q, and so dP there exists a Radon-Nikodym derivative of P with respect to Q, denoted by w , dQ , R R which satisfies φ dP = φ w dQ for every measurable function φ. We will refer to w as the weight function. The importance sampling algorithm is very simple: 10 Algorithm 1 Importance Sampling for n = 1, . . . , N do sample particle Xn ∼ Q end for Form the importance sampling estimator I P̂ , N X n=1 2.1.1 w(Xn ) δXn PN k=1 w(Xk ) IS for Variance Reduction and Sampling Importance sampling was originally designed with the operator perspective in mind as a variance reduction technique. Let Bb (X) be the set of all measurable bounded real functions on X. For measure ν and function φ, write ν(φ) , Eν [φ] for the expected value of φ w.r.t. ν. Consider the task of approximating the expectation φ̄ = P(φ) for i.i.d. some φ ∈ Bb (X). Given X1 , . . . , XN ∼ P , the standard Monte Carlo (MC) estimator P D for φ̄ is φ̄M C , N1 N n=1 φ(Xn ). Letting =⇒ denote convergence in distribution, the MC estimator satisfies the central limit theorem (CLT) √ D 2 N (φ̄M C − φ̄) =⇒ N(0, σM C ), N →∞ 2 with asymptotic variance (AV) equal to the variance of φ w.r.t. P, i.e., σM C = EP [(φ − φ̄)2 ]. The IS estimate of φ̄, I φ̄I , P̂ (φ) = PN n=1 w(Xn )φ(Xn ) , PN n=1 w(Xn ) on the other hand, satisfies the CLT √ D N (φ̄I − φ̄) =⇒ N(0, σI2 ), N →∞ 11 with AV σI2 = EP [(φ − φ̄)2 w]. For a fixed φ and an appropriate choice of Q, it is 2 possible to have σI2 σM C , making IS a superior choice to standard Monte Carlo [17, 26]. However, importance sampling is no longer used only for the estimation of integrals. For example, Del Moral, Doucet, and Jasra [11] give a unifying framework for using SMC methods to obtain, in a variety of scenarios, samples that are approximately distributed according to a measure of interest. Also, recently developed particle Markov chain Monte Carlo methods aim to combine the best features of SMC and MCMC approaches by using SMC as a proposal mechanism for a Metropolis-Hastings or approximate Gibbs sampler [1, 19]. In light of such alternative uses for SMC, consider a sample X | P̂ I ∼ P̂ I obtained from an IS estimator. Then X has a marginal distribution, which we denote by P̄ I , that is known to approach P as N → ∞. The quantity P̄ I (along with its SMC variants) will be the key quantity of interest in our study. We will seek to characterize how well P̄ I (and its SMC variants) approximates P when a finite number of particles are used. A very useful equivalent definition for P̄ I is that it is the expected value of P̂ I : P̄ I , EP̂ I . Formally, the measure EP̂ I is given by (EP̂ I )(A) , E[P̂ I (A)], for every measurable A ⊆ X. Note that since the distribution of X involves marginalizing over P̂ I , it has the same support as P. In particular, P̄ I and P are absolutely continuous with respect to each other. Fig. 2-1a gives an example of P , Q, and w in the case that X = R and P and Q have, respectively, densities p and q with respect to Lebesgue measure. Fig. 21b shows an example of an importance sampling estimate with N = 20 particles. Fig. 2-2 shows the density of P̄ I , along with p and q, for N = 4, 8, 16, 32 particles. Informally, note that for a “small” number of particles, P̄ I is strongly “biased” toward the proposal distribution Q: for N not too large, with non-trivial probability all the samples from Q will be in a region of low P -probability. Hence, all the weights will be small ( 1). But in order to form the probability measure P̂ I the sum of the weights is normalized to 1, creating a bias toward regions of high Q-probability. 12 p(x) q(x) p(x) q(x) w(x) (a) (b) Figure 2-1: An example in which X = R and P and Q have, respectively, densities p and q with respect to Lebesgue measure. (a) Plots of the densities p and q, and the weight function w = f /g. (b) An example of an IS estimate P̂ I with N = 20 particles. The heights of the lines indicate the weights of the particles samples from Q. However, as N increases, the probability of producing a sample from Q in a region of high P -probability (and thus with a large weight) increases, which induces a better approximation to P . 2.1.2 The KL Divergence Perspective on IS To measure the discrepancy between P̄ I and P we use KL divergence [6], which is a natural information-theoretic measure of closeness between two probability measures. Section 2.2 provides further discussion of our choice of KL divergence. For all measures µ ν, the KL divergence from µ to ν is given by KL(µ||ν) , Eµ log(dµ/dν). (2.1) Under the conditions on P and Q given above, we have the following result: Theorem 2.1.1. For the IS algorithm, VarQ [w] KL(P||P̄ ) ≤ log 1 + N I ≤ VarQ [w] . N (2.2) Hence, KL(P||P̄ I ) = O(1/N ) when the variance of w is finite. Remark 1. The theorem applies in the case where Q is chosen adaptively [26] via, for 13 number of particles = 4 number of particles = 8 p(x) q(x) P̄ I p(x) q(x) P̄ I (a) (b) number of particles = 16 number of particles = 32 p(x) q(x) P̄ I p(x) q(x) P̄ I (c) (d) Figure 2-2: The expected IS distribution P̄ I for N = 4, 8, 16, 32 particles. For small numbers of particles P̄ I is strongly biased toward the proposal distribution Q. The density of P̄ I in each plot was approximated using a kernel density estimate. example, the population Monte Carlo algorithm [3, 13]. Remark 2. The theorem shows that VarQ [w] measures how much “bias” the use of the proposal distribution Q introduces into the importance sampler and thus how many particles are required to remove “most” of the bias: once N = VarQ [w]/C, the KL divergence from P to P̄ I is at most the constant log(1 + C). The key to bounding the KL divergence in the IS case, as well as in the SIS and SIR cases considered later, is to upper bound the derivative term inside the log, which for IS is dP . dP̄ I To obtain such a bound, we first derive an explicit expression for dP̄ I , dP which can then be lower bounded. The following technical lemma will repeatedly prove useful: Lemma 2.1.2. Let ψ be a measurable function and let µ be a probability measure on the space Ω. If ν , EX∼µ [ψ(X)δX ] , then ν µ and ψ is a version of dν/dµ. 14 (2.3) Proof. Since for all measurable A ⊆ Ω Z Z ν(A) = Z ψ(x)δx (A)µ(dx) = ψ(x)1A (x)µ(dx) = Ω ψ(x)µ(dx), Ω A ψ is a version of the Radon-Nikodym derivative dν/dµ. With this result in hand, obtaining an expression for dP̄ I dP is straightforward: Lemma 2.1.3. For the IS algorithm, P̄ I P and # " dP̄ I N XN = x . (x) = E PN dP w(X ) n n=1 (2.4) Proof. Since " P̄ I = E N X PN w(Xk ) " N k=1 n=1 Z = # w(Xn ) δXn " N w(XN ) = E PN δXN k=1 w(Xk ) # # XN = x dP(x), δx E PN k=1 w(Xk ) the result follows from Lemma 2.1.2. Proof of Theorem 2.1.1. By Lemma 2.1.3 and Jensen’s inequality # " N N dP̄ I XN = x ≥ h i (x) = E PN P N dP w(X ) n E n=1 n=1 w(Xn ) | XN = x = N . N − 1 + w(x) Therefore, Lemma A.2.1 implies that dP dP̄ I I P̄ −1 = ( ddP ) , which together with Jensen’s inequality yields dP N − 1 + w(X) KL(P||P̄ ) = EX∼P log (X) ≤ EX∼P log N dP̄ I N − 1 + w(X)2 N − 1 + w(X) ≤ log EX∼P = log EX∼Q N N VarQ [w] = log 1 + . N I 15 Recall that the total variation distance between measures µ and ν is given by dT V (µ, ν) = supA⊆X |µ(A) − ν(A)|. The following corollary to Theorem 2.1.1 is immediate from Pinsker’s inequality [6]: Corollary 2.1.4. s dT V (P, P̄ I ) ≤ r VarQ [w] 1 VarQ [w] log 1 + ≤ . 2 N 2N (2.5) Remark 3. If P and Q are absolutely continuous with respect to each other, the variance of the weights VarQ [w] is actually equal to the χ2 distance dχ2 (P, Q). Let λ be a dominating measure for P and Q and let p = dP/dλ and q = dQ/dλ. Then Z 2 p (p − q)2 dχ2 (P, Q) = dλ = dλ − 1 p q Z 2 p = dQ − 1 = VarQ [w]. q Z Taking N = 1, we have KL(P||P̄ I ) = KL(P||Q), and so Theorem 2.1.1 leads to the classic inequality KL(P||Q) ≤ log(1 + dχ2 (P, Q)). (2.6) Remark 4. In the preceding discussion we have assumed that w can be computed exactly, whereas in practice it may only be possible to compute w∗ = cw. If w can only be calculated up to a constant, the above results still hold since I P̂ = N X w∗ (Xn ) PN n=1 k=1 w∗ (Xk ) δXn = N X n=1 w(Xn ) δXn PN k=1 w(Xk ) as before. Hence, throughout we will assume without loss of generality that w can be computed exactly. 16 2.2 Sequential Importance Sampling with and without Resampling In this section we present a general formulation of SIS and SIR, rather than one couched in the language of a Bayesian state-space model. Let W, Y, Z be measurable spaces and let K1 (w, dy) and K2 (w, y, dz) be probability kernels from W to Y and W × Y to Z, respectively. The kernel product (K1 ⊗ K2 )(w, dy × dz) is the probability kernel from W to Y × Z given by Z (K1 ⊗ K2 )(w, B × C) = K2 (w, y, C)K1 (w, dy) B for every measurable B ⊆ Y and C ⊆ Z. Let X1 , X2 , . . . , XT be a sequence of measurable spaces, let Xs:t , Xs × Xs+1 × · · · × Xt , let X(t) , X1:t , and let X , X(T ) be the full product space of interest. Let P , P1 ⊗ P2 ⊗ · · · ⊗ PT be the distribution of interest over X, where P1 is a probability measure on X1 and for 1 ≤ t ≤ T − 1, Pt+1 is a probability kernel from X(t) to Xt+1 . Define Ps:t , Ps ⊗ · · · ⊗ Pt to be the probability kernel from X(s−1) to Xs:t and let P(t) , P1:t , so P = P(T ) .1 Both SIS and SIR construct approximations to each distribution P(1) , P(2) , . . . , P(T ) in turn, and use earlier approximations to produce later ones. Like IS, these approaches make use of an importance distribution Q P, but decompose it into stages Q , Q1 ⊗ · · · ⊗ QT . This decomposition induces a corresponding sequence of (conditional) weight functions dP1 (x1 ) dQ1 dPt (x(t−1) , ·) wt (xt | x(t−1) ) , (xt ). dQt (x(t−1) , ·) w1 (x1 | hi) , w1 (x1 ) , Letting w(t) , w1:t , Qt s=1 (2.7) (2.8) wt , we have w = w(T ) . Here and throughout, xt will 1 We have abused notation slightly here as P1:t (and hence P(t) , P1:t ) is not a probability kernel, though it could be made one by introducing the one-point space X(0) . 17 denote a point in Xt . Write xs:t , hxs , xs+1 , . . . , xt i and x(t) , x1:t . SIS is simply a reformulation of IS in the case of a product space and operates by n N }n=1 with corresponding nonnegpropagating a collection of N particles X (t) = {X(t) n N ative weights W (t) = {W(t) }n=1 . The distribution P(t) is then approximated by S = P̂(t) N X n=1 n W(t) n . δX(t) PN k W k=1 (t) (2.9) The SIS algorithm for generating the particles and the weights is as follows: Algorithm 2 Sequential Importance Sampling for n = 1, . . . , N do Sample particle X1n ∼ Q1 n X(1) ← X1n n ← w1 (X1n ) Set weight W(1) end for for t = 2, . . . , N do for n = 1, . . . , N do n , ·) Sample next particle Xtn | X (t−1) ∼ Qt (X(t−1) n n , Xtn ) ← (X(t−1) X(t) n n n Update weight W(t) ← W(t−1) · wt (Xtn | X(t−1) ) end for end for In practice, the SIS procedure typically suffers from degeneracy, even when there k are only a few dimensions T . Specifically, for some k ∈ {1, . . . , N }, the weight W(T ) n is much larger than every other weight W(T ) , n 6= k. A single weight therefore comes S to dominate the others and the approximation takes the form P̂(T [16, 20]. k ) ≈ δX(T ) The standard solution to the degeneracy problem is to include a resampling step, in which, after each iteration (or some subset of iterations), particles with large weights will tend to be duplicated and those with small weights will tend to be removed [16]. All particles are then given equal weight, so there is no longer such severe weight 18 degeneracy. We analyze the simplest resampling scheme, called multinomial resampling. Sampling importance resample (SIR) is identical to SIS except for a resampling step performed after each iteration. Let W t = {Wtn }N n=1 denote weights for the particles X (t) R at time t. The SIR estimators P̂(t) are defined in an analogous manner to the SIS estimators: R P̂(t) = N X Wtn PN k k=1 Wt n=1 n . δX(t) (2.10) The SIS algorithm for generating the particles and the weights is as follows: Algorithm 3 Sampling Importance Resampling for n = 1, . . . , N do Sample particle X1n ∼ Q1 n X(1) ← X1n Set weight W1n ← w1 (X1n ) end for for t = 2, . . . , N do for n = 1, . . . , N do n R Resample particle X̃(t−1) | W t−1 , X (t−1) ∼ P̂(t−1) n , ·) Sample next particle Xtn | X̃ (t−1) ∼ Qt (X̃(t−1) n n , Xtn ) ← (X(t−1) X(t) n Set weight Wtn ← wt (Xtn | X̃(t−1) ) end for end for S R R Write P̂ S , P̂(T , P̂(T ) and P̂ ) for, respectively, the SIS estimator and the SIR estimator of the full distribution P = P(T ) . 19 2.2.1 SIS and SIR for Variance Reduction and Sampling As with importance sampling, the performance of SIS and SIR can be viewed from an operator perspective or a measure perspective. The majority of previous work has focused on the former, where for a test function φ ∈ Bb (P), P̂ S (φ) or P̂ R (φ) is used as an estimate of the expectation φ̄ , P (φ). Since SIS is an instantiation of IS, they share the same CLT √ D N (P̂ S (φ) − φ̄) =⇒ N(0, σS2 ), N →∞ where σS2 = EP [(φ − µ)2 w]. A CLT also holds for SIR [4, 16]: √ D N (P̂ R (φ) − φ̄) =⇒ N(0, σR2 ). N →∞ (2.11) See Chopin [4] for an explicit expression for σR2 . AV provides one method for comparing the efficiency of SIS and SIR. If σR2 < σS2 , then SIR is, in the AV sense, superior to SIS: asymptotically, the expected L2 error of the SIR estimator for φ̄ will be smaller than that of the SIS estimator. As described for IS in Section 2.1, however, our concern will be with the measure perspective, when P̂ S and P̂ R are used to produce samples that are approximately distributed according to P. To determine how far the distribution of a sample from P̂ S (or P̂ R ) is from P, we must therefore study the (marginal) expected estimators P̄ S , E[P̂ S ] and P̄ R , E[P̂ R ], which are directly analogous to P̄ I . We are only aware of a small amount of work investigating the properties of P̄ S or P̄ R . Del Moral [9] gives an upper bound on the total variation distance between P̄ R and P, dT V (P̄ R , P) ≤ c , N (2.12) and a bound on the KL divergence from P̄ R to P, KL(P̄ R ||P) ≤ 20 c0 . N (2.13) To the best of our knowledge, the quantities KL(P||P̄ S ) and KL(P||P̄ R ) have not been previously analyzed. By studying the asymptotic properties of KL(P||P̄ S ) and KL(P||P̄ R ), we aim to develop a criterion similar to AV that allows us to determine under what conditions SIR is (at least asymptotically) superior to SIS from the sampling perspective. Since our interest is in approximating P, it is, in a certain information-theoretic sense, more natural to study KL(P||P̄ S ) and KL(P||P̄ R ) than KL(P̄ S ||P) and KL(P̄ R ||P). This is because, for measures µ and ν, KL(µ||ν) is the expected number of additional bits required to encode samples from µ when using a code for ν instead [6]. In other words, it is the amount of information lost by using samples from ν instead of samples from µ. Another reason to investigate the KL divergence from P to P̄ R (and to P̄ S ), is that KL(P̄ R ||P) could be small even if P̄ R gave zero probability to a region with positive P-probability, whereas KL(P||P̄ R ) would be infinite in this case. Even in the less extreme scenario in which P̄ R puts small mass on a region with high P-probability, KL(P̄ R ||P) could be small whereas KL(P||P̄ R ) would be very large. 21 Chapter 3 Main Results 3.1 Convergence Rates Our first result gives upper bounds on the KL divergences for the SIS and SIR expected estimators P̄ S and P̄ R that are analogous to the upper bound given in Theorem 2.1.1 for the IS expected estimator. These convergence results motivate our necessary and sufficient condition for SIR to be superior to SIS, which is given at the end of the chapter. The key quantities in these analyses are V , V T , VarQ(T ) [w(T ) ] and Vt , VarP(t−1) ⊗Qt [wt ], (3.1) where for notational convenience we write P(0) ⊗ Q1 instead of Q1 . Theorem 3.1.1. For the SIS and SIR algorithms, V V KL(P||P̄ ) ≤ log 1 + ≤ N N S (3.2) and R KL(P||P̄ ) ≤ T X t=1 P Vt 1 t Vt ≤ +Θ . log 1 + N N N2 Hence, KL(P||P̄ S ) = O(1/N ) when V is finite and KL(P||P̄ R ) = O(1/N ) when 22 (3.3) P t Vt is finite. Remark 5. Heuristically, the sum P t Vt grows linearly in T since each term is the variance of the (conditional) weight from a single time step, while V grows exponentially with T since the (conditional) weights from each time step are being multiplied together. This behavior is similar to what practitioners observe empirically. Remark 6. Intuitively, the performance of SIR depends on a sum of variances of the individual wt because SIR resets the particle weights after each time step, so the wt ’s never “interact” with each other. The performance of SIS, on the other hand, depends on the variance of the wt ’s multiplied together because they are multiplied together in the SIS algorithm to give the final particle weights. Thus, the variance of the sum of the wt ’s measures the performance of SIR while the variance of the product of the wt ’s measures the performance of SIS. Remark 7. It is worth reiterating (cf. Remark 2) that V (in the case of SIS) and P t Vt (in the case of SIR) measure how much “bias” the use of the proposal distribution Q introduces into SIS/SIR, and thus how many particle are required to remove “most” of the bias. It is reasonable to ask for the KL divergence to be O(log T ).1 In the case of SIS, once N = V /(CT ) particles are used, the KL divergence from P to P̄ S is at most log(1 + CT ). Define VT∗ , sup1≤t≤T Vt . Then for SIR, once N = VT∗ T /(C 0 log T ) particles are used, the KL divergence from P to P̄ R is at most C 0 log T . However, we expect that V = Θ(αT ) for some constant α and we might suppose that supt Vt < ∞. If both assumptions hold, then to achieve O(log T ) KL divergence, we should expect to choose N = Ω(αT /T ) for SIS and N = Ω(T / log T ) for SIR. As with importance sampling, we can use Pinsker’s inequality to bound the total variation distance: Corollary 3.1.2. s dT V (P, P̄ S ) ≤ 1 V log 1 + ≤ 2 N 1 s V . 2N An analogous discussion to that which follows could be carried if we instead ask for the KL divergence to be a constant. In this case the required scale of N would only change by logarithmic factors. 23 and s P 1 t Vt −2 log 1 + + Θ(N ) dT V (P, P̄ ) ≤ 2 N rP t Vt ≤ + Θ(N −2 ) . 2N R Remark 8. The SIR convergence rate for TV distance given in the corollary is not optimal since, as noted in Section 2.2, Del Moral [9] shows that in fact dT V (P, P̄ R ) = O(1/N ). The proof of Theorem 3.1.1 follows the same strategy as that for Theorem 2.1.1: we obtain an explicit expression for R dP̄(T ) dP(T ) , which can then be lower bounded. Lemma 3.1.3. R dP̄(T ) dP(T ) NT (x(T ) ) ≥ QT − 1 + wt (xt | x1:t−1 )) t=1 (N . (3.4) The proof of Lemma 3.1.3 requires a pair of tedious inductive arguments, so we instead convey the key ideas by giving an expression for R dP̄(2) dP(2) and proving the lemma in the T = 2 case. The full proof is given in Appendix A. R P(2) , Lemma 3.1.4. For the SIR algorithm, P̄(2) # N X 1 = x1 , (3.5) N2 . (N − 1 + w1 (x1 ))(N − 1 + w2 (x2 | x1 )) (3.6) " R dP̄(2) N N (x(2) ) = E E dP(2) W1 W2 N X(2) = x(2) and R dP̄(2) dP(2) Proof. Let W t , (x(2) ) ≥ PN n=1 Wtn denote the sum of the SIR weights at time t. Since P̂2R = N X Wn 2 n=1 W2 n = δX(2) N n X w2 (X2n | X̃(1) ) n=1 24 W2 n , δX(2) we have " PN n n n=1 w2 (X2 | X̃(1) ) R P̄(2) =E W2 " " n δX(2) =NE N ) w2 (X2N | X̃(1) W2 # N δX(2) ## N k N X̃ =NE E δX(2) (1) = X(1) W1 W2 "k=1 N " ## N w1 (X(1) ) ) w2 (X2N | X̃(1) N 2 N N X̃ =N E E δX(2) (1) = X(1) W1 W2 # Z " Z N N N N N N N X̃ = E E δX(2) (1) = X(1) , X2 = x2 P2 (x1 , dx2 ) X1 = x1 P1 (dx1 ) W1 W2 # Z " N N N N N X (3.7) = E E δX(2) (2) = x(2) X1 = x1 P(2) (dx(2) ). W1 W2 N k X ) w1 (X(1) " # N ) w2 (X2N | X̃(1) Hence, applying Lemma 2.1.2 to (3.7) yields (3.5) and repeated application of Jensen’s inequality yields (3.6): " # N N N N E X(2) = x(2) X1 = x1 (x(2) ) = E dP(2) W1 W2 # " N N N X 1 = x1 ≥E W 1 N − 1 + w2 (x2 | x1 ) R dP̄(2) ≥ N2 . (N − 1 + w1 (x1 ))(N − 1 + w2 (x2 | x1 )) Proof of Theorem 3.1.1. The SIS bound follows immediately from Theorem 2.1.1. For the SIR bound, by Lemma 3.1.3 and Jensen’s inequality, " QT t=1 (N − 1 + wt (xt | x(t−1) )) R KL(P(T ) ||P̄(T ) ) ≤ EP(T ) log NT T X N − 1 + wt (xt | x(t−1) ) = EP(T ) log N t=1 T X wt (xt | x(t−1) ) − 1 ≤ log EP(T ) 1 − N t=1 25 !# = T X t=1 Vt . log 1 + N Remark 9. It is interesting to note that in the case of Q = P case, wt ≡ 1, so S R KL(P(T ) ||P̄(T ) ) = KL(P(T ) ||P̄(T ) ) = 0. So SIS and SIR are equivalent. Indeed, from the KL perspective all that is required is a single sample from Q(= P ). However, when P̂ S and P̂ R are used as estimators, P̂ S is clearly superior to P̂ R . Specifically, for φ ∈ Bb (X), SIS produces N independent samples, so Var[P̂ S (φ)] = VarP [φ] . N For simplicity, consider a version of P̂ R obtained by first generating N samples from P , then applying multinomial resampling to obtain the final N samples. In this case it is easy to show that Var[P̂ R (φ)] = (2N − 1) VarP [φ] ≈ 2 Var[P̂ S (φ)], N2 so SIS is superior to SIR. 3.1.1 Rates for the Filtering Distribution When SMC methods are applied to state-space models, often, instead of considering the full joint distribution of the latent states, only the marginal distribution of the most recent latent state is of interest [2, 14, 16, 27]. In this context SMC algorithms are often referred to as particle filters. Sequential Monte Carlo samplers also require estimating the marginal of the recent state [cf. 11]. It is therefore natural to consider the KL divergence between the marginal of P, P̃T , P(T ) (X(T −1) × ·), 26 (3.8) and the marginals of P̄ S and P̄ R , S R R P̄TS , P̄(T ) (X(T −1) × ·) and P̄T , P̄(T ) (X(T −1) × ·). (3.9) From the operator perspective, P̂TS and P̂TR generally approximate P̃T far better S R than P̂(T ) and P̂(T ) approximate P(T ) . It is quite natural for SIS and SIR to produce better estimates of the marginal expectation since, while both the marginal and joint estimators involve the same number of particles, the joint expectation involves an integral over a much higher dimensional space. So it is somewhat surprising that the KL divergence bounds we obtain in the marginal case are almost identical to those in the full joint distribution case already considered. But in fact, there are intuitively good reasons to expect the KL divergence case will behave very differently from that of functional approximation. Since only a single sample is being drawn from P̂ S (or P̂ R ), the quality of the full sample X1:T compared to the marginal sample XT does not suffer from the same curse of dimensionality. Theorem 3.1.5. KL(P̃T ||P̄TS ) V ≤ log 1 + N ≤ V N (3.10) and PT Vt KL(P̃T ||P̄TR ) ≤ log 1 + t=1 + Θ N P Vt 1 ≤ t +Θ . N N2 1 N2 ! To prove the theorem, we must first establish lower bounds for (3.11) (3.12) dP̄TS dP̃T and dP̄TR . dP̃T Define the reverse probability kernel P̃(T −1) (xT , dx(T −1) ) such that P(T ) (dx(T ) ) = P̃T (dxT )P̃(T −1) (xT , dx(T −1) ). 27 Proposition 3.1.6. dP̄TS N (xT ) ≥ . R dP̃T N − 1 + w(x(T ) )P̃(T −1) (xT , dx(T −1) ) (3.13) Proof. Using Proposition A.2.2 we have P̄TS (dxT ) Z S dP̄(T ) Z S dP̄(T ) (x(T ) )PT (dx(T ) ) = (x(T ) )P̃(T −1) (xT , dx(T −1) )P̃T (dxT ) dP(T ) dP(T ) " # Z N 1 X(T ) = x(T ) P̃(T −1) (xT , dx(T −1) )P̃T (dxT ), = N E PN n n=1 w(X(T ) ) = so dP̄TS = dP̃T " Z N E PN 1 # N X(T ) = x(T ) P̃(T −1) (xT , dx(T −1) ) ) n w(X(T ) N ≥ R N − 1 + w(x(T ) )P̃(T −1) (xT , dx(T −1) ) n=1 by Jensen’s inequality. Proposition 3.1.7. NT dP̄TR (xT ) ≥ R QT . dP̃T t=1 (N + wt (xt | x(t−1) ) − 1)P̃(T −1) (xT , dx(T −1) ) (3.14) Proof. We prove the theorem in the T = 2 case. The general case follows from a pair of inductions analogous to those used to prove Proposition A.2.4 and Lemma 3.1.3, so they are omitted. If T = 2, (3.7) implies that # N N N = x(2) X1N = x1 P(2) (dx(2) ) P̄2R = E E δX N X(2) W1 W2 2 # Z " N N N = E · δX N X(2) = x(2) P̃(1) (x2 , dx1 )P̃2 (dx2 ). W1 W2 2 Z " 28 Hence, dP̄2R (x2 ) = dP̃2 Z ≥R " N N E · W1 W2 # N X(2) = x(2) P̃(1) (x2 , dx1 ) N2 . N = x(2) ]P̃(1) (x2 , dx1 ) E[W 1 · W 2 | X(2) It remains to simplify the denominator: Z N E W 1 · W 2 | X(2) = x(2) P̃(1) (x2 , dx1 ) Z N = x(2) P̃(1) (x2 , dx1 ) = E W 1 · (N − 1 + w2 (x2 | x1 )) | X(2) Z = (N − 1 + w1 (x1 ))(N − 1 + w2 (x2 | x1 ))P̃(1) (x2 , dx1 ), concluding the proof. Proof of Theorem 3.1.5. For SIS, by Proposition 3.1.6, KL(P̃T ||P̄TS ) N −1+ R w(x(T ) )P̃(T −1) (xT , dx(T −1) ) N R N − 1 + w(x(T ) )P̃(T −1) (xT , dx(T −1) ) ≤ log EP̃2 N ! R w(x(T ) )P̃(T −1) (xT , dx(T −1) )P̃T (dxT ) − 1 = log 1 + N EP [w] − 1 V = log 1 + = log 1 + . N N ≤ EP̃T log For SIR, to upper bound the KL divergence, we can use the fact that T Y (N + wt (xt | x(t−1) ) − 1) t=1 T =N +N T −1 T T X T X X T −2 (wxt − 1) + N (wxt − 1)(wxs − 1) t=1 + N T −3 T X T X T X t=1 s<t (wxt − 1)(wxs − 1)(wxr − 1) + · · · + t=1 s<t r<s T Y (wxt − 1), t=1 29 (3.15) where wxt , wt (xt | x(t−1) ). So, by Proposition 3.1.7 and an application of (3.15), KL(P̃T ||P̄TR ) 3.2 R QT t=1 (N + wt (xt | x(t−1) ) − 1)P̃(T −1) (xT , dx(T −1) ) ≤ EP̃T log NT i hQ T EP t=1 (N + wt (xt | x(t−1) ) − 1) ≤ log NT PT ! E [w − 1] 1 P xt = log 1 + t=1 +Θ N N2 PT ! V 1 t = log 1 + t=1 + Θ N N2 Time-uniform Bounds In this section we give uniform convergence results over time in the marginal distribution case for SIR. For the time-uniform results we assume that for all t ≥ 1, we have probability kernels Pt and Qt , so we can consider the joint distribution P(t) and marginal distribution P̃t for unbounded t. We will assume that the wt are uniformly bounded from above and uniformly bounded away from zero, which is a standard assumption in asymptotic analyses of SMC methods [see, e.g., 9, 29]: Assumption (A). For all t ≥ 1, 0 < w ≤ wt (xt | x(t−1) ) ≤ w < ∞. (3.16) Define the modified proposal distribution Q∗(t,T ) , Q(t) ⊗ Pt+1:T which uses the standard proposal distribution for the first t time steps and then proposes from the R∗ true conditional distribution at times t + 1 through T . Let P̂t,T be the SIR estimator R∗ R∗ for P̃T when the proposal Q∗(t,T ) is used and let P̄t,T , E[P̂t,T ]. Time-uniform results require an asymptotic stability assumption on Pt and Qt . 30 The weakest such assumption we consider controls only the limiting behavior of the system: Assumption (B). R∗ lim sup KL(P̃T +t ||P̄t,T +t ) = 0. T →∞ t≥0 (3.17) We will also consider the following stronger condition in order to obtain time-uniform convergence rates: Assumption (C). There exists T0 ≥ 1 and γ > 0 such that for all T ≥ T0 R∗ −γT sup KL(P̃T +t ||P̄t,T . +t ) ≤ e (3.18) t≥0 Both assumptions can be understood as requiring the stochastic process defined by the conditional distributions {Pt }t≥1 to have a sufficiently strong mixing property with respect to the SIR algorithm: no matter how long the (typically incorrect) proposals from {Qt }t≥1 are used, once the true conditionals are used as proposals, the estimated SIR marginals converge to the truth. Assumption (B) only requires that mixing occur in the infinite-time limit while Assumption (C) requires an asymptotically exponential mixing rate. We can now state and prove our time-uniform bounds, which are analogous to those given by Del Moral and Guionnet [12] in the total variation setting. Theorem 3.2.1. If Assumptions (A) and (B) hold, then lim sup KL(P̃t ||P̄tR ) = 0. N →∞ t≥1 31 (3.19) If Assumptions (A) and (C) hold, then sup KL(P̃t ||P̄tR ) t≥1 w ≤ N log(N/w) 1+ γ (3.20) for any N ≥ 1 such that log(N/w) T = T (N ) , ≥ T0 . γ (3.21) Proof. The proof is similar in spirit to that of Theorem 3.1 in Del Moral and Guionnet [12]. First, note that by Assumption (A) and the proof of Theorem 3.1.5 Qt (N + w (x | x ) − 1) (N + w − 1)t s s (s−1) s=1 KL(P̃t ||P̄tR ) ≤ log ≤ log Nt Nt t t(w − 1) w−1 ≤ = log 1 + . N N EP Hence, sup KL(P̃t ||P̄tR ) ≤ t=1,...,T Tw . N (3.22) We also have that KL(P̃t ||P̄tR ) = EP(t) log S∗ dP̄t−T,t dP̃t dP̃t = E log + EP(t) log P(t) R R R∗ dP̄t dP̄t dP̄t−T,t R∗ dP̄t−T,t R∗ + KL(P̃t ||P̄t−T,t ) dP̄tR R∗ dP̄t−T,t = EP(t) log + εT , dP̄tR = EP(t) log where R∗ εT , sup KL(P̃T +s ||P̄T,T +s ). s≥0 32 (3.23) Reasoning analogously to the proof of Proposition 3.1.7 we see that R∗ dP̄t,T dP̃T Z = " Nt E Qt s=1 W s # N X(T ) = x(T ) P̃(T −1) (xT , dx(T −1) ). (3.24) R∗ Since dP̄tR = dP̄t,t , by (3.24), Jensen’s inequality, and Assumption (A) " # N X(t) = x(t) P̃(t−1) (xt , dx(t−1) ) " # Z N t−T N NT E Qt−T ≥ Qt X(t) = x(t) P̃(t−1) (xt , dx(t−1) ) s=t−T +1 (N − 1 + wxs ) s=1 W s # Z " t−T −1 N N NT X(t) = x(t) P̃(t−1) (xt , dx(t−1) ) E Qt−T ≥ (N − 1 + w)T W s s=1 dP̄tR = dP̃t = Z Nt E Qt s=1 W s R∗ dP̄t−T,t NT . (N − 1 + w)T dP̃t (3.25) Combining (3.23) and (3.25) yields that for all t > T , KL(P̃t ||P̄tR ) ≤ T log Tw N −1+w + εT ≤ + εT , N N which together with (3.22) implies sup KL(P̃t ||P̄tR ) ≤ t≥1 Tw + εT . N First letting N → ∞ and then taking T → ∞ proves (3.19). If Assumption (C) holds, then by the same reasoning as before, for all T ≥ T0 sup KL(P̃t ||P̄tR ) ≤ t≥1 Tw + e−γT . N So choosing log(N/w) T = T (N ) , γ 33 yields sup KL(P̃t ||P̄tR ) t≥1 w ≤ N log(N/w) 1+ γ . as long as T (N ) ≥ T0 , proving (3.20). Theorem 7.4.4 of Del Moral [9] states that under Assumption (A) and an assumption similar in spirit to Assumption (C), for any φ ∈ Bb (P) with sup |φ| ≤ 1, √ sup E[P̃t (φ) − P̂tR (φ)] = O(1/ N ). t≥1 The rate sup KL(P̃t ||P̄tR ) = O(log N/N ) t≥1 thus introduces an additional log N factor not present in Del Moral’s result, which may be possible to remove. If we measure the distance between P̃t and P̄tR using total variation, Pinsker’s inequality gives p sup dT V (P̃t , P̄tR ) = O( log N/N ). (3.26) t≥1 It is not clear that the rate given in (3.26) is optimal, since dT V (P̃t , P̄tR ) = O(1/N ), so we suspect that supt≥1 dT V (P̃t , P̄tR ) in fact converges at a 1/N or log N/N rate. 34 3.3 Comparing SIS and SIR Based on Theorems 3.1.1 and 3.1.5, one might conjecture that, in both the full and the marginal distribution settings, SIR dominates SIS when X Vt V , t and indeed this is the case under some additional hypotheses. In the joint distribution case, the proof of Theorem 2.1.1 establishes the lower bound dP̄ S N (x(T ) ) ≥ . dP N − 1 + w(x(T ) ) When V is finite, a matching upper bound of the form dP̄ S N (x(T ) ) ≤ + o(1) dP N − 1 + w(x(T ) ) can also be established, where for large N , the o(1)-term can essentially be ignored. Note that the finiteness of V is already a necessary condition for Theorem 3.1.1 to be non-trivial. Analogous statements hold for SIR in the joint case and SIS and SIR in the marginal case. To prove the conjecture we will require that Assumption (A) P holds. Clearly Assumption (A) implies that V and t Vt are finite, so Theorems 3.1.1 and 3.1.5 are non-trivial in this setting. Theorem 3.3.1. If Assumption (A) holds and P t Vt < V , then for N sufficiently large, KL(P||P̄ R ) < KL(P||P̄ S ) and KL(P̃T ||P̄TR ) < KL(P̃T ||P̄TS ). Remark 10. It is instructive to consider the T = 2 case and assume that P1 , Q1 , P2 (x1 , ·), and Q2 (x1 , ·) share a common dominating measure λ. Write pt = dPt dλ and qt = We can then write out the three variance terms slightly more explicitly as p1 (x1 )2 λ(dx1 ) − 1 q1 (x1 ) Z p2 (x2 | x1 )2 V2 = VarP(1) ⊗Q2 [w2 ] = p1 (x1 ) λ(dx(2) ) − 1 q2 (x2 | x1 ) Z V1 = VarQ1 [w1 ] = 35 dQt . dλ Z V = VarQ(2) [w(2) ] = p1 (x1 )2 p2 (x2 | x1 )2 λ(dx(2) ) − 1. q1 (x1 ) q2 (x2 | x1 ) Remark 11. Still considering the T = 2 case, note the similarity between V2 and V , with the only difference being that the latter has an additional w1 (x1 ) = p1 (x1 ) q1 (x1 ) term in the integral. Say Q1 is of low quality but Q2 is of higher quality for choices of x with high P1 -probability than for x of low P1 -probability. In this case, V will be very large compared to V2 because in the V integral, the w1 (x1 ) term will overweight the w2 (x2 | x1 ) term in exactly the places where it has high variance, whereas V2 will overweight the w2 (x2 | x1 ) term in exactly the places where it has low variance. The V1 may have reasonably large magnitude, but V will still be much larger. Thus SIR will be superior to SIS in cases where Q1 is of low quality, but Q2 is of better quality in regions of greatest importance. A converse of Theorem 3.3.1 also holds: Theorem 3.3.2. If Assumption (A) holds and P t Vt > V , then for N sufficiently large, KL(P||P̄ R ) > KL(P||P̄ S ) and KL(P̃T ||P̄TR ) > KL(P̃T ||P̄TS ). Hence, we have an asymptotically necessary and sufficient condition for SIR to be superior to SIS. Corollary 3.3.3. If Assumption (A) holds, then for N sufficiently large: KL(P||P̄ R ) < KL(P||P̄ S ) if and only if X Vt < V . (3.27) t and KL(P̃T ||P̄TR ) < KL(P̃T ||P̄TS ) if and only if X Vt < V . (3.28) t Proofs of Theorems 3.3.1 and 3.3.2 are given in Appendix B. Remark 12. Examining the proofs of Theorems 3.3.1 and 3.3.2, one can see that the P key quantities for determining when N is “sufficiently large” are ∆ , | t Vt − V |, P w, and w. That is, the greater the difference between t Vt and V , the larger w 36 is, and the smaller w is, the smaller N needs to be to reach the asymptotic regime. The dependence on ∆ here is quite natural. Indeed, the regime of ∆ large is exactly the one of greatest interest, since that is when the choice of SIR or SIS will have the greatest impact. As for the dependence on w (or w), if w is zero or extremely small (resp. w is infinite or extremely large), but the weights are only small (resp. large) with low probability, then versions of Theorems 3.3.1 and 3.3.2 and Corollary 3.3.3 that hold with high probability can easily be formulated. 37 Chapter 4 Conclusions and Future Work In this thesis we have investigated the quality of two SMC estimators — SIS and SIR — from what we have called the measure perspective. As discussed in Section 2.2, from we call the operator perspective, the asymptotic variance (AV) of SIS and SIR estimators can be used to judge their relative performance when used to approximate an expectation P (φ), φ ∈ Bb (X). Here, our analysis has instead centered on the KL divergence from P to P̄ S (and to P̄ R ). In addition to proving convergence rates for the KL divergences of both expected estimators, we obtained necessary and sufficient conditions for the SIR estimator to be superior to the SIS estimator in terms of KL divergence, providing an alternative to AV which is applicable when taking the measure viewpoint. In particular, we showed the “measure AV” — for both the joint P distribution and the filtering distribution — is V for SIS and t Vt for SIR. In the remainder of this chapter, we conclude by discussing some related results, drawing some connections between that work and our results, and speculating on promising directions for future research. 38 4.1 4.1.1 Other Convergence Rates for SMC Lp Error Bounds In addition to the CLT results already discussed, numerous other asymptotic and nonasymptotic analyses of P̂ S and P̂ R have been carried out. We mention just few of them here. Throughout this section we write constants in functional form to indicate which quantities they depend on. The values of the constants in this and subsequent sections may change from line to line. For interacting particle systems, which include SIR as a special case, Lp error bounds and Glivenko-Cantelli-type theorems have been established [cf. 9]. One such Lp bound states that, for any p ≥ 1 and any φ ∈ Bb (X), h i1/p a(p)b(φ)c(P, Q) R p √ E |P̂(T (φ) − P (φ)| ≤ (T ) ) N (4.1) A time-uniform Lp bound (cf. Section 3.2), h i1/p a(p)b(φ)c(P, Q) R √ sup E |P̂(t) (φ) − P(t) (φ)|p ≤ . t≥1 N (4.2) has also been established, though under stronger conditions than the fixed time result. The Glivenko-Cantelli-type theorem states that for any p ≥ 1 and any countable collection of uniformly bounded functions F ⊆ Bb (X), 1/p a(p)c(P, Q)C(F) R p √ , E sup |P̂ (φ) − P(φ)| ≤ N φ∈F where C(F) measures the complexity of the function class F. 39 (4.3) 4.1.2 KL Divergence Bounds A KL divergence bound in the reverse direction to that which we consider (cf. Section 2.2), KL(P̄ R ||P) ≤ c(P, Q) , N (4.4) can be extracted as a special case of a more general propagation-of-chaos result [9, Theorem 8.3.2]. Propagation of chaos concerns the relationship between the expected joint distribution over k particles and the distribution P ⊗k of k independent samples from P. In order words, propagation-of-chaos results measure how close the k particles are to being independent samples from P. The special case of interest to us is when k = 1. Propagation-of-chaos results require controlling the strength of the interactions between the particles and thus rely on a mixing condition, which is unnecessary in the k = 1 case. Thus, (4.4) is not directly comparable to our bound on KL(P||P̄ R ). An interesting open question concerns the fact that, under appropriate hypotheses, KL divergence becomes symmetric for “infinitesimal” divergences. It is not clear (to us) whether, in the SMC setting, KL divergence in one direction (asymptotically) bounds KL divergence in the other. An answer in either the affirmative or the negative would provide insight into the behavior of P̄ R as well as into how our results relate to those of Del Moral. 4.2 Adaptive Resampling and αSMC This thesis has examined the behavior of SMC estimators in the presence of deterministic resampling. However, it is common for practitioners to use adaptive resampling techniques to choose when to resampling based on the realized particle weights. The most popular adaptive scheme is based on the effective sample size (ESS) criterion 40 [14, 16, 23, 24]. The ESS for normalized weights w = (w1 , . . . , wN ) is defined as ESS(w) , N X !−1 wn2 . (4.5) n=1 The function ESS(w) ranges from 1 to N and is interpreted as the effective number of particles the importance sampler is using if the particles have weights w. If the ESS is below some fixed threshold (e.g. N/2), then a resampling step is performed. There are myriad heuristic arguments for using ESS [23, 24, 25] and some theoretical analyses of the behavior of adaptive resampling algorithms under a variety of technical assumptions [8, 10, 29]. Recently Whiteley, Lee, and Heine [29] provided a rigorous justification for the use of ESS from the operator viewpoint. They showed that if the ESS does not fall below γN , γ ∈ (0, 1] a fixed parameter, then the SMC algorithm does in fact behave as if there are γN particles. So in this technical sense ESS is in fact a valid measure of the effective sample size. We now briefly describe their set-up and one relevant result. Consider a state-space model where Xt = Z is a fixed measurable space, Pt (x(t−1) , dxt ) ∝ K(xt−1 , dxt )g(xt , yt ), and Qt (x(t−1) , dxt ) = K(xt−1 , dxt ), with yt the observation at time t. In this state-space model, wt ∝ g(·, yt ). Critically, the goal in [29] was to perform one-step-ahead prediction. That is, to approximate the predictive distribution Z Pt|t−1 (dxt ) ∝ K(xt−1 , dxt ) t−1 Y K(xs−1 , dxs )g(xs , ys )f0 (dx0 ), (4.6) s=1 where f0 is the density of the initial state x0 . Whiteley et al. give an algorithm they call αSMC, which generalizes SIS, SIR, and numerous other SMC variants. The algorithm provides a flexible resampling mechanism in which at each time t, a stochastic matrix αt−1 is chosen from a set of 41 N × N matrices, denoted AN . We denote the value in the n-th row and k-th column nk . of αt−1 by αt−1 Algorithm 4 αSMC for n = 1, . . . , N do Sample particle X1n ∼ Q1 n Set weights W1n ← w1 (X(1) ) end for for t = 2, . . . , T do Select αt−1 from AN according to some functional of X (t−1) for n = 1, . . . , N do P k k nk Wtn ← N k=1 αt−1 Wt−1 wt (Xt ) n Resample particle X̃t−1 | W t−1 , X (t−1) ∼ PN k=1 k k αnk t−1 Wt−1 wt (Xt ) δX(t−1) k Wtn n Sample next particle Xtn | X̃t−1 ∼ K(X̃t−1 , ·) end for end for The αSMC predictive estimators take the form α P̂t|t−1 = N X n=1 Wtn δXtn . PN k k=1 Wt (4.7) Define the (generalized) ESS for αSMC at time t to be EtN P n 2 (N −1 N n=1 Wt ) , . P n 2 N −1 N n=1 (Wt ) (4.8) Since EtN has been normalized to lie between 1/N and 1, the standard ESS at time t is given by ESSt , N EtN . The SMC algorithms already discussed in this thesis can be obtained as special cases of αSMC as follows. SIS is recovered by always setting αt−1 to be IN , the N ×N identity matrix. For SIR, set αt−1 = 11/N , the N × N matrix with all entries equal to 1/N . The standard ESS-based adaptive SMC algorithm is obtained by setting αt−1 42 to 11/N if EtN < γ and IN otherwise. Whiteley et al. give a time-uniform Lp error bound when the version of αSMC that is employed guarantees a lower bound on EtN . Let φ ∈ Bb (X) with sup |φ| ≤ 1 and let p ≥ 1. Then, under appropriate regularity conditions, sup EtN t≥1 ≥ γ =⇒ sup E t≥1 h |P̂tα (φ) p − P̃t (φ)| i1/p ≤ a(p)c(P, Q) √ . γN (4.9) Comparing (4.9) to [9, Theorem 7.4.4], which states that i1/p a(p)c(P, Q) h √ , sup E |P̂tR (φ) − P̃t (φ)|p ≤ t≥1 N (4.10) we see that the condition supt≥1 EtN ≥ γ ensures that the effective number of particles in the time-uniform Lp error bound is γN compared to N particles if SIR is used. We conjecture that a similar generalization from (4.10) to (4.9) may exist for our timeuniform result for KL-divergence given in Theorem 3.2.1. However, the ESS condition is likely to be different from that used in the Lp context. The form of this KL ESS condition could prove to be of practical interest when employing SMC methods for sampling. 4.3 Global Parameter Estimation in State-space Models Typically, in addition to the latent state at each time, state-space models have a global parameter θ which, in the Bayesian setting, has posterior distribution that must also be estimated. Standard SMC algorithms only handle the case of fixed θ. An active area of research is developing extensions to SMC that allow for estimation of the joint state and global parameter distribution in an online manner [5, 7, 14]. Applying techniques developed in this thesis to understand these algorithms from the measure perspective is an exciting direction for future work. For example, the nested particle filter algorithm [7] provides a scalable approach to parameter estimation problems, 43 with constant cost at each time (unlike, e.g., the SMC2 algorithm [5], which has O(t) cost at time t). Hence, an analysis of the algorithm from the measure perspective would be particularly worthwhile and would complement the existing analyses done from the operator perspective. 44 Appendix A Auxiliary Results A.1 Technical Lemmas Proposition A.1.1. Let ZN = PN n=1 Xn , where the Xn are i.i.d. nonnegative random variables with mean µ > 0 and 0 ≤ a ≤ Xn ≤ b < ∞. Let µN , E[ZN ] = N µ and c > 0. Then 1 E c + ZN ≤ 1 + Θ(N −4/3 ) c + µN (A.1) and if a > 0, then the Θ(N −4/3 ) term is independent of c. Proof. We have 1 E c + ZN 1 1 =E 1(µN − ZN ≥ tN ) + E 1(µN − ZN < tN ) c + ZN c + ZN 1 1 ≤ E [1(µN − ZN ≥ tN )] + E [1(µN − ZN < tN )] c + Na c + N (µ − t) 1 1 −2t2 N 2 ≤ exp + , 2 c + Na N (b − a) c + N (µ − t) where the final step follows from Hoeffding’s inequality. Choosing t = N −1/3 yields 1 E c + ZN 1 ≤ exp c + Na −2N 1/3 (b − a)2 + 1 c + N (µ − N −1/3 ) 45 1 1 −2N 1/3 1 1 = + exp + − 2 −1/3 c + Nµ c + Na (b − a) c + N (µ − N ) c + Nµ 1/3 2/3 1 −2N N 1 + exp + = 2 c + Nµ c + Na (b − a) (c + N (µ − N −1/3 ))(c + N µ) 1 1 = +Θ . c + Nµ N 4/3 1 Note that if a > 0, then the Θ N 4/3 term can be made independent of c by replacing c with zero. Lemma A.1.2. For all > 0 there is some N > 0 such that log(1+x) > P2N k+1 k x k=1 (−1) for all x > −1 + . Proof. We have log(1 + x) − 2N X (−1) k+1 k x = 2N X k+1 x (−1) k=1 k=1 k k − 2N X (−1)k+1 xk + R2N (x) k=1 = G2N (x) + R2N (x), where G2N (x) = P2N k=1 (k−1)(−x)k k and R2N (x) is the remainder term in the 2N -degree Taylor series for log(1 + x) centered at 0. For x ∈ (−1, 0], 2N 1X 1 G2N (x) ≥ (−x)k ≥ 2 k=1 2 = Z 2N +1 (−x)t dt 2 1 (−x)2 1 1 (−x)2N +1 − (−x)2 ≥− ≥ (−x)3 , 2 log(−x) 2 log(−x) 2 where the last inequality follows from the fact that log x ≥ (x − 1)/x, so −1/ log(x) ≥ −x/(x − 1) ≥ x for x ∈ [0, 1). For x ∈ (−1, x], the magnitude of the remainder term can be bounded as Z |R2N (x)| = x 0 (x − t)2N ≤ |x| max fN (x, t), dt x≤t≤0 (1 + t)2N +1 46 where fN (x, t) = (x−t)2N . (1+t)2N +1 Since ∂fN (x − t)2N −1 =− (−t + (2N + 1)x + 2N ), ∂t (1 + t)2N +2 for fixed x, fN is increasing in t on the interval [x, (2N + 1)x + 2N ], so for all x ∈ [−2N/(2N + 1), 0], fN is increasing in t on the interval [x, 0]. Hence, for all x ∈ [−2N/(2N + 1), 0], |R2N (x)| ≤ |x|f (x, 0) = (−x)2N +1 . Note that (−x)2N +1 ≤ (−x)3 for x ∈ [−1/21/(2N −2) , 0]. Letting b(N ) = max{−1/21/(2N −2) , −2N/(2N + 1)}, (A.2) we have for all x ∈ [b(N ), 0], that |R2N (x)| ≤ G2N (x), and thus that log(1 + x) − 2N X (−1)k+1 xk ≥ 0. (A.3) k=1 For x > 0, note that log(1 + x) ≥ x/(1 + x) and that x/(1 + x) = ∞ 2N X X (−1)k+1 xk . (−1)k+1 xk + k=1 k=2N +1 For x ∈ [0, 1], ∞ X k+1 k (−1) x = k=2N +1 ∞ X (x2k+1 − x2k+2 ) ≥ 0, k=N while for x ≥ 1, ∞ X k+1 k (−1) x =x 2N +1 + k=2N +1 ∞ X (x2k+1 − x2k ) ≥ 0. k=N +1 Thus, for N > 0, log(1 + x) ≥ x/(1 + x) ≥ 47 P2N k=1 (−1) k+1 k x . So for fixed , choosing N such that b(N ) < −1 + completes the proof. A.2 Auxiliary SMC Results Lemma A.2.1. P and P̄ I are absolutely continuous with respect to each other. Proof. The fact that P̄ I P follows immediately from Lemma 2.1.3. To see that P P̄ I , note that for measurable A ⊆ X, P̄ I (A) = 0 =⇒ there is some B ⊂ A such that Q(B) = 0 and ∀x ∈ A\B, w(x) = 0. But since P Q, Q(B) = 0 =⇒ P(B) = 0 and since for x ∈ A \ B, w(x) = 0, P(A \ B) = 0 as well. So P(A) = 0. S Proposition A.2.2. For the SIS algorithm, P̄(T ) P(T ) and S dP̄(T ) dP(T ) " (x(T ) ) = N E PN n=1 1 n w(X(T ) # N X(T ) = x(T ) . ) (A.4) Proof. The result is an immediate corollary of Lemma 2.1.3. It also follows from Lemma A.2.1 that: Lemma A.2.3. P and P̄ S are absolutely continuous with respect to each other. Let W t , W t (X (t) ) , (T ) (T ) PN n=1 (T ) the functions f1 , f2 , . . . , fT (T ) fT (X (T ) , hi) , (T ) ft (X (t) , xt+1:T ) , Wtn be the sum of the SIR weights at time t. Define recursively by N W T (X (T ) ) h i (T ) N N N N E ft+1 (X (t+1) , xt+2:T ) | X̃(t) = X(t) , Xt+1 = xt+1 W t (X (t) ) (A.5) (A.6) for 1 ≤ t ≤ T − 1. R Proposition A.2.4. For the SIR algorithm, P̄(T ) P(T ) and R dP̄(T ) dP(T ) (T ) N (x(T ) ) = E[f1 (X (1) , x2:T ) | X(1) = x1 ] 48 (A.7) Proof. First define the measure-valued functions (T ) gT (X (T ) , hi) , (T ) gt (X (t) , xt+1:T ) , N δX N W T (X (T ) ) (T ) i h (T ) N N N = xt+1 , Xt+1 = X(t) N E gt+1 (X (t+1) , xt+2:T ) | X̃(t) W t (X (t) ) (A.8) (A.9) (T ) R N N and note that P̄(T ) = E[wT (XT | X̃(T −1) )gT (X (T ) , hi)]. The inductive hypothesis is that for 1 ≤ t ≤ T , R P̄(T ) Z wt (XtN =E (T ) N N , dxt+1:T ) | X̃(t−1) )gt (X (t) , xt+1:T )Pt+1:T (X(t) . (A.10) N = xt Pt:T (X̃(t−1) , dxt:T ) (A.11) Writing (T ) g̃t (X̃ (t−1) ) Z , E (T ) gt (X (t) , xt+1:T ) XtN and assuming the inductive hypothesis holds for some fixed t, R P̄(T ) Z =E wt (XtN Z =E E Z =E E (T ) N N | X̃(t−1) )gt (X (t) , xt+1:T )Pt+1:T (X(t) , dxt+1:T ) (T ) gt (X (t) , xt+1:T ) XtN (T ) gt (X (t) , xt+1:T ) XtN = xt = xt N N Pt (X̃(t−1) , dxt )Pt+1:T (hX̃(t−1) , xt i, dxt+1:T ) N Pt:T (X̃(t−1) , dxt:T ) # N k k X | X̃(t−2) ) wt−1 (Xt−1 N (T ) k E g̃t (X̃ (t−1) ) X̃(t−1) = X(t−1) =E W t−1 " k=1 # N N N wt−1 (Xt−1 | X̃(t−2) ) N (T ) N =E E g̃t (X̃ (t−1) ) X̃(t−1) = X(t−1) W t−1 Z (T ) N N N =E wt−1 (Xt−1 | X̃(t−2) )gt−1 (X (t−1) , xt:T )Pt:T (X(t−1) , dxt:T ) . " The induction yields R P̄(T ) Z =E (T ) N w1 (X1N )g1 (X (1) , x2:T )P2:T (X(1) , dx2:T ) 49 Z (T ) N E[g1 (X (1) , x2:T ) | X(1) = x1 ]P1 (dx1 )P2:T (x1 , dx2:T ) =E Z (T ) N = E[g1 (X (1) , x2:T ) | X(1) = x1 ]P(T ) (dx(T ) ), so applying Lemma 2.1.2 proves the proposition. Lemma A.2.5. P and P̄ R are absolutely continuous with respect to each other. Proof. The reasoning is analogous to proof of Lemma A.2.1. The following is stated as Lemma 3.1.3, but not proven there. Lemma A.2.6. R dP̄(T ) dP(T ) NT (x(T ) ) ≥ QT t=1 (N − 1 + wt (xt | x1:t−1 )) . Proof. Define (T ) ht (x(T ) ) N T −t+1 , QT s=t (N − 1 + ws (xs | x(s−1) )) (T ) ht,t+1 (X (t−1) , x(T ) , ψ) , ψ(x(T ) , X (t−1) ) " # (T ) N ht,s+1 (X (s) , x(T ) , ψ) N (T ) X(s) = x(s) ht,s (X (s−1) , x(T ) , ψ) , E PN n | X̃ n w (X ) s n=1 s (s−1) for 1 ≤ s ≤ t. First note that " (T ) ht,t (X (t−1) , x(T ) , ψ) # (T ) N ht,t+1 (X (t) , x(T ) , ψ) N X(t) = x(t) = E PN n n w (X | X̃ ) t n=1 t (t−1) " # N ψ(X (t) , x(T ) ) N = E PN X(t) = x(t) n n n=1 wt (Xt | X̃(t−1) ) (T ) = ht−1,t X (t−1) , x(T ) , πt (ψ) , where " N ψ(X (t) , x(T ) ) πt (ψ)(X (t) , x(T ) ) , E PN n=1 n wt (Xtn | X̃(t−1) 50 # N X(t) = x(t) . ) (A.12) (T ) (T ) Also, if ht,s (X (s−1) , x(T ) , ψ) = ht−1,s (X (s−1) , x(t) , ψ 0 ), then # (T ) N ht,s (X (s−1) , x(T ) , ψ) N = E PN X(s−1) = x(s−1) n n n=1 ws−1 (Xs−1 | X̃(s−2) ) " # (T ) N ht−1,s (X (s−1) , x(t) , ψ 0 ) N X(s−1) = x(s−1) = E PN n n ) | X̃ w (X s−1 n=1 s−1 (s−2) " (T ) ht,s−1 (X (s−2) , x(T ) , ψ) (T ) = ht−1,s−1 (X (s−2) , x(T ) , ψ 0 ). So by induction on s ht,1 (hi, x(T ) , ψ) = ht−1,1 (hi, x(T ) , πt (ψ)). (A.13) (T ) Observe that if ψ(X (t) , x(T ) ) = ht+1 (x(T ) ), then by Jensen’s inequality " (T ) (T ) πt (ht+1 )(X (t) , x(T ) ) = E PN N ht+1 (x(T ) ) n=1 n wt (Xtn | X̃(t−1) # N X(t) = x(t) ) (T ) N ht+1 (x(T ) ) ≥ E hP N N n n n=1 wt (Xt | X̃(t−1) ) | X(t) = x(t) i (T ) = N ht+1 (x(T ) ) N − 1 + wt (xt | x(t−1) ) (T ) = ht (x(T ) ). (A.14) Next, applying Proposition A.2.4 and (A.13) yields R dP̄(T ) dP(T ) (T ) (T ) N (x(T ) ) = E[f1 (X (1) , x2:T ) | X(1) = x1 ] = hT,1 (hi, x(T ) , 1) (T ) (T ) = hT,1 (hi, x(T ) , hT +1 ). Assume that for fixed 1 ≤ t ≤ T , R dP̄(T ) dP(T ) (T ) (T ) (x(T ) ) ≥ ht,1 (hi, x(T ) , ht+1 ). Then applying (A.13) and (A.14) yields R dP̄(T ) dP(T ) (T ) (T ) (T ) (T ) (x(T ) ) ≥ ht,1 (hi, x(T ) , ht+1 ) = ht−1,1 (hi, x(T ) , πt (ht+1 )) 51 (T ) (T ) ≥ ht−1,1 (hi, x(T ) , ht ). So by induction on t, R dP̄(T ) dP(T ) (T ) (T ) (x(T ) ) ≥ h1,1 (hi, x(T ) , ht ), proving the lemma. Lemma A.2.7. R dP̄(T ) dP(T ) (x(T ) ) ≤ QT NT t=1 (N − 1 + wt (xt | x1:t−1 )) + Θ(N −1/3 ). (A.15) Proof. The results follows from Proposition A.1.1 and an argument analogous to that of Lemma 3.1.3. 52 Appendix B Proofs of SIS and SIR Comparison Theorems We give proofs of Theorems 3.3.1 and 3.3.2 in the case when the full joint distribution P is targeted. The proofs for the marginal distribution cases of the theorems are P essentially identical. In both proofs, let ν , t Vt . B.1 Proof of Theorem 3.3.1 Let ε , min{wT /2, V − ν} > 0, so V ≤ ν + ε. By Theorem 3.1.1, ν KL(P||P̄ ) ≤ +Θ N R 1 N2 , (B.1) so for sufficiently large N KL(P||P̄ R ) ≤ ν + ε/3 . N (B.2) By Propositions A.1.1 and A.2.2 " # N dP̄ S N N X(T ) = x(T ) ≤ (x(T ) ) = E PN + Θ(N −1/3 ). n dP N − 1 + w(x ) w(X ) (T ) n=1 (T ) 53 Since w(x(T ) ) is bounded, for large enough N N dP̄ S (x(T ) ) ≤ . dP N − 1 + w(x(T ) ) − ε/6 Therefore, by Lemma A.1.2, there is an integer K > 0 N − 1 + w(X(T ) ) − ε/6 KL(P||P̄ ) ≥ EX(T ) ∼P log N " # 2K w(X(T ) ) − 1 − ε/6 X (−1)k (w(X(T ) ) − 1 − ε/6)k ≥ EX(T ) ∼P − n Nk k=2 2K ν + ε − ε/6 X (−1)k EX(T ) ∼P w(X(T ) ) − 1 − ε/6)k − . = N Nk k=2 S So, for N sufficiently large ν + ε − ε/3 n (B.3) ν + ε/3 ν + 2ε/3 < ≤ KL(P ||P̄ S ), N N (B.4) KL(P||P̄ S ) ≥ and for N sufficiently large KL(P||P̄ R ) ≤ as was to be shown. B.2 Proof of Theorem 3.3.2 Let δ , min{w/2, ν − V } > 0, so V ≤ ν − δ and by Theorem 3.1.1, KL(Q||P̄ S ) ≤ ν−δ . N (B.5) On the other hand, Lemma A.2.7 states that R dP̄(T ) dP(T ) (x(T ) ) ≤ QT t=1 (N NT − 1 + wt (xt | x1:t−1 )) 54 + O(N −1/3 ), so, since the wt are a.s.-bounded, for large enough n R dP̄(T ) dP(T ) NT (x(T ) ) ≤ QT t=1 (N − 1 + wt (xt | x1:t−1 ) − δ/(3T )) and by Lemma A.1.2, there is an integer K > 0 such that KL(P||P̄ R ) " # (N − 1 + w (x | x ) − δ/(3T )) t t 1:t−1 ≥ EX(T ) ∼P log t=1 NT T X wt (xt | x1:t−1 ) − 1 − δ/(3T ) ≥ EX(T ) ∼P log 1 + N t=1 " # 2K T X wt (xt | x1:t−1 ) − 1 − δ/(3T ) X (−1)k (wt (xt | x1:t−1 ) − 1 − δ/(3T ))k EX(T ) ∼P − ≥ N Nk t=1 k=2 " # T 2K X Vt − δ/(3T ) X (−1)k EX(T ) ∼P (wt (xt | x1:t−1 ) − 1 − δ/(3T ))k = − . k N N t=1 k=2 QT Thus, for N sufficiently large, R KL(P||P̄ ) ≥ T X Vt − 2δ/(3T ) N t=1 = ν − 2δ/3 n (B.6) and for N sufficiently large KL(P||P̄ S ) ≤ ν−δ ν − 2δ/3 < ≤ KL(P||P̄ R ), n n as was to be shown. 55 (B.7) Bibliography [1] C. Andrieu, A. Doucet, and R. Holenstein. “Particle Markov chain Monte Carlo methods”. In: Journal of the Royal Statistical Society: Series B (Statistical Methodology) 72.3 (2010), pp. 269–342. [2] M. S. Arulampalam et al. “A tutorial on particle filters for online nonlinear/nonGaussian Bayesian tracking”. In: IEEE Transactions on Signal Processing 50.2 (2002), pp. 174–188. [3] O Cappé et al. “Population Monte Carlo”. In: Journal of Computational and Graphical Statistics 13.4 (2004), pp. 907–929. [4] N. Chopin. “Central limit theorem for sequential Monte Carlo methods and its application to Bayesian inference”. In: The annals of statistics 32.6 (Dec. 2004), pp. 2385–2411. [5] N. Chopin, P. E. Jacob, and O Papaspiliopoulos. “SMC2 : an efficient algorithm for sequential analysis of state space models”. In: Journal of the Royal Statistical Society: Series B (Statistical Methodology) 75.3 (2013), pp. 397–426. [6] T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley & Sons, 1991. [7] D. Crisan and J. Miguez. “Nested particle filters for online parameter estimation in discrete-time state-space Markov models”. In: arXiv.org (Aug. 2013). arXiv:1308.1883v2 [stat.CO]. [8] D. Crisan and O Obanubi. “Particle filters with random resampling times”. In: Stochastic processes and their applications 122.4 (2012), pp. 1332–1368. [9] P. Del Moral. Feynman-Kac Formulae: Genealogical and Interacting Particles Systems with Applications. New York: Springer, 2004. [10] P. Del Moral, A. Doucet, and A. Jasra. “On adaptive resampling strategies for sequential Monte Carlo methods”. In: Bernoulli 18.1 (Feb. 2012), pp. 252–278. [11] P. Del Moral, A. Doucet, and A. Jasra. “Sequential Monte Carlo samplers”. In: Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68.3 (2006), pp. 411–436. [12] P. Del Moral and A. Guionnet. “On the stability of interacting processes with applications to filtering and genetic algorithms”. In: Annales de l’Institut Henri Poincaré, Probabilités et Statistiques 37.2 (2001), pp. 155–194. 56 [13] R Douc et al. “Convergence of adaptive mixtures of importance sampling schemes”. In: The annals of statistics 35.1 (Feb. 2007), pp. 420–448. [14] A. Doucet, N. de Freitas, and N. Gordon. Sequential Monte Carlo in Practice. New York: Springer, 2001. [15] A. Doucet, S. J. Godsill, and C. Andrieu. “On sequential Monte Carlo sampling methods for Bayesian filtering”. In: Statistics and Computing 10.3 (2000), pp. 197–208. [16] A. Doucet and A. M. Johansen. “A tutorial on particle filtering and smoothing: fifteen years later”. In: Handbook of Nonlinear Filtering. Ed. by D Crisan and B Rozovsky. Cambridge: Cambridge University Press, 2010. [17] J. Geweke. “Bayesian inference in econometric models using Monte Carlo integration”. In: Econometrica: Journal of the Econometric Society (1989), pp. 1317– 1339. [18] F. Gustafsson et al. “Particle filters for positioning, navigation, and tracking”. In: IEEE Transactions on Signal Processing 50.2 (2002), pp. 425–437. [19] R. Holenstein. “Particle Markov Chain Monte Carlo”. PhD thesis. Vancouver: University Of British Columbia, 2009. [20] N. Kantas et al. “An overview of sequential Monte Carlo methods for parameter estimation in general state-space models”. In: 15th IFAC Symposium on System Identification. 2009, pp. 774–785. [21] H. R. Künsch. “Particle filters”. In: Bernoulli 19.4 (Sept. 2013), pp. 1391–1403. [22] A. Lee et al. “On the utility of graphics cards to perform massively parallel simulation of advanced Monte Carlo methods”. In: Journal of Computational and Graphical Statistics 19.4 (2010), pp. 769–789. [23] J. S. Liu. “Metropolized independent sampling with comparisons to rejection sampling and importance sampling”. In: Statistics and Computing 6.2 (1996), pp. 113–119. [24] J. S. Liu and R. Chen. “Blind deconvolution via sequential imputations”. In: Journal of the American Statistical Association 90.430 (1995), pp. 567–576. [25] R. M. Neal. “Annealed importance sampling”. In: Statistics and Computing 11.2 (2001), pp. 125–139. [26] M.-S. Oh and J. O. Berger. “Adaptive Importance Sampling in Monte Carlo Integration”. In: Journal of Statistical Computation and Simulation 41 (1992), pp. 143–168. [27] S. Thrun. “Particle Filters in Robotics”. In: UAI. 2002. [28] C. Vergé et al. “On parallel implementation of Sequential Monte Carlo methods: the island particle model”. In: arXiv.org (2013). arXiv:1306.3911v1 [math.PR]. [29] N. Whiteley, A. Lee, and K. Heine. “On the role of interaction in sequential Monte Carlo algorithms”. In: arXiv.org (Sept. 2013). arXiv:1309.2918v1 [stat.CO]. 57