An information-theoretic analysis of resampling in sequential Monte Carlo Jonathan H. Huggins

An information-theoretic analysis of
resampling in sequential Monte Carlo
by
Jonathan H. Huggins
B.A., Columbia University (2012)
Submitted to the Department of Electrical Engineering and Computer
Science
in partial fulfillment of the requirements for the degree of
Master of Science
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
June 2014
c Jonathan H. Huggins, MMXIV. All rights reserved.
The author hereby grants to MIT permission to reproduce and to
distribute publicly paper and electronic copies of this thesis document
in whole or in part in any medium now known or hereafter created.
Author . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Department of Electrical Engineering and Computer Science
May 8, 2014
Certified by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Professor Joshua B. Tenenbaum
Professor of Computational Cognitive Science
Thesis Supervisor
Accepted by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Professor Leslie A. Kolodziejski
Chairman, Department Committee on Graduate Theses
An information-theoretic analysis of
resampling in sequential Monte Carlo
by
Jonathan H. Huggins
Submitted to the Department of Electrical Engineering and Computer Science
on May 8, 2014, in partial fulfillment of the
requirements for the degree of
Master of Science
Abstract
Sequential Monte Carlo (SMC) methods form a popular class of Bayesian inference
algorithms. While originally applied primarily to state-space models, SMC is increasingly being used as a general-purpose Bayesian inference tool. Traditional analyses of
SMC algorithms focus on their usage for approximating expectations with respect to
the posterior of a Bayesian model. However, these algorithms can also be used to obtain approximate samples from the posterior distribution of interest. We investigate
the asymptotic and non-asymptotic properties of SMC from this sampling viewpoint.
Let P be a distribution of interest, such as a Bayesian posterior, and let P̂ be a
random estimator of P generated by an SMC algorithm. We study P̄ , E[P̂ ], i.e., the
law of a sample drawn from P̂ , as the number of particles tends to infinity. We give
convergence rates of the Kullback-Leibler divergence KL(P ||P̄ ) as well as necessary
and sufficient conditions for the resampled version of P̄ to asymptotically dominate
the non-resampled version from this KL divergence perspective. Versions of these
results are given for both the full joint and the filtering settings. In the filtering case
we also provide time-uniform bounds under a natural mixing condition. Our results
open up the possibility of extending recent analyses of adaptive SMC algorithms for
expectation approximation to the sampling setting.
Thesis Supervisor: Professor Joshua B. Tenenbaum
Title: Professor of Computational Cognitive Science
2
Acknowledgments
During the first year and a half of my PhD at MIT, I have been fortunate enough to
have worked with, and been inspired by, numerous people on many projects.
First, I would like to express my deep thanks to my advisor, Josh Tenenbaum, who
has indulged my varied, and sometimes aimless, interests. He has been instrumental
in shaping my thinking about what constitute the most interesting problems in the
field of machine learning. Yet he has also graciously allowed me the freedom to explore
and discover which problems are the most exciting to me, and in which areas I can
have the most impact. It was his inquiries that instigated the research presented in
this thesis.
Special thanks go to Dan Roy, who was my close collaborator on this work. Without Dan’s innumerable insights and invaluable suggestions, this project would never
have come to fruition. However, even more important than any particular contributions Dan made to this project are the ways he has helped me to become a far more
effective theoretician and precise mathematical thinker. I hope to continue to learn
from and follow his example.
Thanks also to Vikash Mansinghka and Arnaud Doucet for their critical insights,
suggestions, and support while this research was still in its formative stages. In
particular, I would like to thank Vikash for suggesting (repeatedly, until I finally
listened!) that we should study the expected value of the random measures produced
by SIS and SIR. And thanks to Arnaud for recommending we investigate asymptotic
stability properties via time-uniform bounds. Thanks also to Cameron Freer, Peter
Krafft, Tejas Kulkarni, and Andy Miller for reading drafts of various versions this
work.
Finally, I would like to thank my family and, in particular, my wonderful and
brilliant wife, Diana, for their love and support.
This research was conducted with U.S. Government support under FA9550-11C-0028 and awarded by the DoD, Air Force Office of Scientific Research, National
Defense Science and Engineering Graduate (NDSEG) Fellowship, 32 CFR 168a.
3
Contents
1 Introduction
1.1
6
Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . .
2 Sequential Monte Carlo
2.1
2.2
10
Importance Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
2.1.1
IS for Variance Reduction and Sampling . . . . . . . . . . . .
11
2.1.2
The KL Divergence Perspective on IS . . . . . . . . . . . . . .
13
Sequential Importance Sampling with and without Resampling . . . .
17
2.2.1
20
SIS and SIR for Variance Reduction and Sampling . . . . . . .
3 Main Results
3.1
7
22
Convergence Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
3.1.1
Rates for the Filtering Distribution . . . . . . . . . . . . . . .
26
3.2
Time-uniform Bounds . . . . . . . . . . . . . . . . . . . . . . . . . .
30
3.3
Comparing SIS and SIR . . . . . . . . . . . . . . . . . . . . . . . . .
35
4 Conclusions and Future Work
4.1
38
Other Convergence Rates for SMC . . . . . . . . . . . . . . . . . . .
39
4.1.1
Lp Error Bounds . . . . . . . . . . . . . . . . . . . . . . . . .
39
4.1.2
KL Divergence Bounds . . . . . . . . . . . . . . . . . . . . . .
40
4.2
Adaptive Resampling and αSMC . . . . . . . . . . . . . . . . . . . .
40
4.3
Global Parameter Estimation in State-space Models . . . . . . . . . .
43
4
A Auxiliary Results
45
A.1 Technical Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
A.2 Auxiliary SMC Results . . . . . . . . . . . . . . . . . . . . . . . . . .
48
B Proofs of SIS and SIR Comparison Theorems
53
B.1 Proof of Theorem 3.3.1 . . . . . . . . . . . . . . . . . . . . . . . . . .
53
B.2 Proof of Theorem 3.3.2 . . . . . . . . . . . . . . . . . . . . . . . . . .
54
5
Chapter 1
Introduction
Sequential Monte Carlo (SMC) methods are a widely-used class of algorithms for
approximate inference in Bayesian models [11, 14, 15, 16, 20, 21]. The SMC approach
is attractive because it provides a flexible, efficient, and relatively simple solution to
the problem of computing estimates of expectation functionals when the underlying
distribution is analytically intractable, as is often the case for posterior distributions
arising in Bayesian analysis. In the case of time-series models, SMC (which in the
time series context is commonly called particle filtering) also provides a method for
performing fast online inference, which is critical in many real-world applications such
as robotics and tracking [2, 14, 18, 27].
A direct precursor to SMC was the importance sampling algorithm. The motivation for developing importance sampling was to produce estimators of expectation
functionals with smaller variance than the standard Monte Carlo estimators [14, 17].
Most analyses of SMC continue to adopt this functional approximation perspective
(or what we shall refer to as the operator perspective), attempting to quantify the
performance of SMC algorithms not only in terms of asymptotic variance, but more
generally by bounding approximation error [cf. 4, 9, 14, 16].
Two canonical SMC algorithms are sequential importance sampling (SIS) and sampling importance resampling (SIR). Like all SMC algorithms, SIS and SIR approximate a distribution P by a discrete, random distribution P̂ formed from a collection
of N weighted samples called particles. The difference between SIS and SIR can be
6
understood as follows. In SIS, the particles are constructed incrementally and independently of each other. Information about the particles is only combined at the
final step of the algorithm. SIR introduces resampling steps: during resampling, particles with large weights are likely to be duplicated, while those with extremely small
weights might disappear altogether; after resampling, all particles are given equal
weight. The original motivation for introducing the resampling variant of SIS was to
prevent weight degeneracy [14, 24, 25]. Weight degeneracy arises because, when using
the SIS algorithm, even after only a modest number of steps, a single particle may
have vastly more weight than all the other particles combined. If this occurs, then
the SIS approximation effectively consists of a single particle.
Both theory and practice have shown that resampling provides more accurate estimates of expectations by reducing the variance of these estimates [9, 14, 16, 24, 29].
However, resampling is not always desirable. From the operator perspective, there are
two reasons to avoid resampling. First, if the quality of the approximation is good,
then, roughly speaking, resampling simply adds variance to the Monte Carlo estimate [4, 11]. Second, there is a computational price to pay for resampling. The SIS
algorithm is “embarrassingly parallel” because the particles evolves independently.
Resampling, however, breaks this parallelism, potentially leading to a substantial increase in the effective computational requirements [22, 28, 29]. Because of the impact
on computational efficiency, it is important to also understand how resampling affects inference quality when the goal is to approximate the underlying measure, not
calculate an expectation. Developing such an understanding is the primary goal of
this work.
1.1
Summary of Contributions
As noted above, previous theoretical investigations of SIS and SIR have primarily
taken the operator viewpoint (or operator perspective), assessing the quality of P̂ when
used to approximate the expectation operator EP [·] by EP̂ [·]. In this thesis we take
the measure viewpoint (or measure perspective), focusing instead on the properties of
7
t=1
t=1
t=2
t=2
t=3
t=3
t=4
t=5
t=4
(a) SIS
(b) SIR
Figure 1-1: A cartoon depiction of the SIS and SIR algorithms. The size of the
particles indicate their relative weights at each time step. Note that in the SIS case,
most of the weight becomes concentrated on a single particle.
SIS and SIR when employed to produce samples approximately distributed according
to P. More precisely, we investigate the mean of P̂ , denoted P̄ , which can also be
understood as the (marginal) distribution of a single sample drawn from P̂ . We use
the Kullback-Leibler (KL) divergence from P to P̄ to measure how far P̄ is from
P. Our motivation is twofold. First, we seek to better understand the quality of
SMC algorithms when the object which we wish to approximate is the distribution P
itself, as opposed to an expectation. The second goal is to understand how SIS and
SIR compare to each other from the measure perspective, in a manner analogous to
the way asymptotic variance allows for comparison of algorithms from the operator
perspective.
Our first main result is KL divergence convergence rates for both SIS and SIR
as the number of particles N tends to infinity. We show that if the variances (with
respect to the appropriate distributions) of the particle weights are finite, then SIS
and SIR converge at a 1/N rate. The constants in these rates are shown to be
asymptotically tight, leading to our second main result, which gives necessary and
sufficient conditions for SIR to asymptotically dominate SIS from the KL divergence
perspective.
8
In practice, SMC methods are often applied to state-space models to approximate
the marginal distribution of the hidden state at the most recent time. We give
analogues to our first two main results in this filtering case as well. Finally, for the
filtering case, we also give time-uniform bounds on the KL divergence for SIR under
a natural mixing condition.
Our results provide analogues to a number of classical asymptotic and non-asymptotic
SMC analyses, which all apply in the case of deterministic resampling. It is common,
however, for practitioners to use the effective sample size (ESS) criterion to adaptively determine whether to resample at a particular step of the algorithm [23, 24,
25]. The traditional arguments for employing the ESS criterion were heuristic, though
recently Whiteley, Lee, and Heine [29] provided a rigorous justification from the operator viewpoint for the use of ESS. We are hopeful that the work presented here
provides a framework for deriving analogous results from measure viewpoint to those
of Whiteley et al., though with an appropriately modified notion of effective sample
size.
The remainder of the thesis is organized as follows. Chapter 2 begins by considering the simpler case of importance sampling (Section 2.1), before formally defining
SIS and SIR (Section 2.2). Our main results are presented in Chapter 3. Chapter 4
concludes with a discussion of previous research on the convergence properties of
SMC, connections to our results, and speculation on important directions for future
work.
9
Chapter 2
Sequential Monte Carlo
2.1
Importance Sampling
To provide intuition for our main SIS and SIR results and to establish some notation,
we begin by considering estimators arising from importance sampling (IS). Let P
and Q be probability measures on a measurable space X. The goal is to form an
estimate of the target distribution P when we are only able to sample from the proposal
distribution Q. Assume that, for all measurable sets A ⊆ X, we have Q(A) = 0 =⇒
P(A) = 0, i.e., P is absolutely continuous with respect to Q, written P Q, and so
dP
there exists a Radon-Nikodym derivative of P with respect to Q, denoted by w , dQ
,
R
R
which satisfies φ dP = φ w dQ for every measurable function φ. We will refer to
w as the weight function.
The importance sampling algorithm is very simple:
10
Algorithm 1 Importance Sampling
for n = 1, . . . , N do
sample particle Xn ∼ Q
end for
Form the importance sampling estimator
I
P̂ ,
N
X
n=1
2.1.1
w(Xn )
δXn
PN
k=1 w(Xk )
IS for Variance Reduction and Sampling
Importance sampling was originally designed with the operator perspective in mind as
a variance reduction technique. Let Bb (X) be the set of all measurable bounded real
functions on X. For measure ν and function φ, write ν(φ) , Eν [φ] for the expected
value of φ w.r.t. ν. Consider the task of approximating the expectation φ̄ = P(φ) for
i.i.d.
some φ ∈ Bb (X). Given X1 , . . . , XN ∼ P , the standard Monte Carlo (MC) estimator
P
D
for φ̄ is φ̄M C , N1 N
n=1 φ(Xn ). Letting =⇒ denote convergence in distribution, the
MC estimator satisfies the central limit theorem (CLT)
√
D
2
N (φ̄M C − φ̄) =⇒ N(0, σM
C ),
N →∞
2
with asymptotic variance (AV) equal to the variance of φ w.r.t. P, i.e., σM
C =
EP [(φ − φ̄)2 ]. The IS estimate of φ̄,
I
φ̄I , P̂ (φ) =
PN
n=1 w(Xn )φ(Xn )
,
PN
n=1 w(Xn )
on the other hand, satisfies the CLT
√
D
N (φ̄I − φ̄) =⇒ N(0, σI2 ),
N →∞
11
with AV σI2 = EP [(φ − φ̄)2 w]. For a fixed φ and an appropriate choice of Q, it is
2
possible to have σI2 σM
C , making IS a superior choice to standard Monte Carlo
[17, 26].
However, importance sampling is no longer used only for the estimation of integrals. For example, Del Moral, Doucet, and Jasra [11] give a unifying framework for
using SMC methods to obtain, in a variety of scenarios, samples that are approximately distributed according to a measure of interest. Also, recently developed particle Markov chain Monte Carlo methods aim to combine the best features of SMC and
MCMC approaches by using SMC as a proposal mechanism for a Metropolis-Hastings
or approximate Gibbs sampler [1, 19].
In light of such alternative uses for SMC, consider a sample X | P̂ I ∼ P̂ I obtained
from an IS estimator. Then X has a marginal distribution, which we denote by
P̄ I , that is known to approach P as N → ∞. The quantity P̄ I (along with its SMC
variants) will be the key quantity of interest in our study. We will seek to characterize
how well P̄ I (and its SMC variants) approximates P when a finite number of particles
are used.
A very useful equivalent definition for P̄ I is that it is the expected value of P̂ I :
P̄ I , EP̂ I . Formally, the measure EP̂ I is given by (EP̂ I )(A) , E[P̂ I (A)], for every
measurable A ⊆ X. Note that since the distribution of X involves marginalizing over
P̂ I , it has the same support as P. In particular, P̄ I and P are absolutely continuous
with respect to each other.
Fig. 2-1a gives an example of P , Q, and w in the case that X = R and P and
Q have, respectively, densities p and q with respect to Lebesgue measure. Fig. 21b shows an example of an importance sampling estimate with N = 20 particles.
Fig. 2-2 shows the density of P̄ I , along with p and q, for N = 4, 8, 16, 32 particles.
Informally, note that for a “small” number of particles, P̄ I is strongly “biased” toward
the proposal distribution Q: for N not too large, with non-trivial probability all the
samples from Q will be in a region of low P -probability. Hence, all the weights
will be small ( 1). But in order to form the probability measure P̂ I the sum of
the weights is normalized to 1, creating a bias toward regions of high Q-probability.
12
p(x)
q(x)
p(x)
q(x)
w(x)
(a)
(b)
Figure 2-1: An example in which X = R and P and Q have, respectively, densities
p and q with respect to Lebesgue measure. (a) Plots of the densities p and q, and
the weight function w = f /g. (b) An example of an IS estimate P̂ I with N = 20
particles. The heights of the lines indicate the weights of the particles samples from
Q.
However, as N increases, the probability of producing a sample from Q in a region of
high P -probability (and thus with a large weight) increases, which induces a better
approximation to P .
2.1.2
The KL Divergence Perspective on IS
To measure the discrepancy between P̄ I and P we use KL divergence [6], which is
a natural information-theoretic measure of closeness between two probability measures. Section 2.2 provides further discussion of our choice of KL divergence. For all
measures µ ν, the KL divergence from µ to ν is given by
KL(µ||ν) , Eµ log(dµ/dν).
(2.1)
Under the conditions on P and Q given above, we have the following result:
Theorem 2.1.1. For the IS algorithm,
VarQ [w]
KL(P||P̄ ) ≤ log 1 +
N
I
≤
VarQ [w]
.
N
(2.2)
Hence, KL(P||P̄ I ) = O(1/N ) when the variance of w is finite.
Remark 1. The theorem applies in the case where Q is chosen adaptively [26] via, for
13
number of particles = 4
number of particles = 8
p(x)
q(x)
P̄ I
p(x)
q(x)
P̄ I
(a)
(b)
number of particles = 16
number of particles = 32
p(x)
q(x)
P̄ I
p(x)
q(x)
P̄ I
(c)
(d)
Figure 2-2: The expected IS distribution P̄ I for N = 4, 8, 16, 32 particles. For small
numbers of particles P̄ I is strongly biased toward the proposal distribution Q. The
density of P̄ I in each plot was approximated using a kernel density estimate.
example, the population Monte Carlo algorithm [3, 13].
Remark 2. The theorem shows that VarQ [w] measures how much “bias” the use of
the proposal distribution Q introduces into the importance sampler and thus how
many particles are required to remove “most” of the bias: once N = VarQ [w]/C, the
KL divergence from P to P̄ I is at most the constant log(1 + C).
The key to bounding the KL divergence in the IS case, as well as in the SIS and
SIR cases considered later, is to upper bound the derivative term inside the log, which
for IS is
dP
.
dP̄ I
To obtain such a bound, we first derive an explicit expression for
dP̄ I
,
dP
which can then be lower bounded. The following technical lemma will repeatedly
prove useful:
Lemma 2.1.2. Let ψ be a measurable function and let µ be a probability measure on
the space Ω. If
ν , EX∼µ [ψ(X)δX ] ,
then ν µ and ψ is a version of dν/dµ.
14
(2.3)
Proof. Since for all measurable A ⊆ Ω
Z
Z
ν(A) =
Z
ψ(x)δx (A)µ(dx) =
ψ(x)1A (x)µ(dx) =
Ω
ψ(x)µ(dx),
Ω
A
ψ is a version of the Radon-Nikodym derivative dν/dµ.
With this result in hand, obtaining an expression for
dP̄ I
dP
is straightforward:
Lemma 2.1.3. For the IS algorithm, P̄ I P and
#
"
dP̄ I
N
XN = x .
(x) = E PN
dP
w(X
)
n
n=1
(2.4)
Proof. Since
"
P̄ I = E
N
X
PN
w(Xk )
"
N
k=1
n=1
Z
=
#
w(Xn )
δXn
"
N w(XN )
= E PN
δXN
k=1 w(Xk )
#
#
XN = x dP(x),
δx E PN
k=1 w(Xk )
the result follows from Lemma 2.1.2.
Proof of Theorem 2.1.1. By Lemma 2.1.3 and Jensen’s inequality
#
"
N
N
dP̄ I
XN = x ≥ h
i
(x) = E PN
P
N
dP
w(X
)
n
E
n=1
n=1 w(Xn ) | XN = x
=
N
.
N − 1 + w(x)
Therefore, Lemma A.2.1 implies that
dP
dP̄ I
I
P̄ −1
= ( ddP
) , which together with Jensen’s
inequality yields
dP
N − 1 + w(X)
KL(P||P̄ ) = EX∼P log
(X) ≤ EX∼P log
N
dP̄ I
N − 1 + w(X)2
N − 1 + w(X)
≤ log EX∼P
= log EX∼Q
N
N
VarQ [w]
= log 1 +
.
N
I
15
Recall that the total variation distance between measures µ and ν is given by
dT V (µ, ν) = supA⊆X |µ(A) − ν(A)|. The following corollary to Theorem 2.1.1 is immediate from Pinsker’s inequality [6]:
Corollary 2.1.4.
s
dT V (P, P̄ I ) ≤
r
VarQ [w]
1
VarQ [w]
log 1 +
≤
.
2
N
2N
(2.5)
Remark 3. If P and Q are absolutely continuous with respect to each other, the
variance of the weights VarQ [w] is actually equal to the χ2 distance dχ2 (P, Q). Let λ
be a dominating measure for P and Q and let p = dP/dλ and q = dQ/dλ. Then
Z 2
p
(p − q)2
dχ2 (P, Q) =
dλ =
dλ − 1
p
q
Z 2
p
=
dQ − 1 = VarQ [w].
q
Z
Taking N = 1, we have KL(P||P̄ I ) = KL(P||Q), and so Theorem 2.1.1 leads to the
classic inequality
KL(P||Q) ≤ log(1 + dχ2 (P, Q)).
(2.6)
Remark 4. In the preceding discussion we have assumed that w can be computed
exactly, whereas in practice it may only be possible to compute w∗ = cw. If w can
only be calculated up to a constant, the above results still hold since
I
P̂ =
N
X
w∗ (Xn )
PN
n=1
k=1
w∗ (Xk )
δXn =
N
X
n=1
w(Xn )
δXn
PN
k=1 w(Xk )
as before. Hence, throughout we will assume without loss of generality that w can be
computed exactly.
16
2.2
Sequential Importance Sampling with and without Resampling
In this section we present a general formulation of SIS and SIR, rather than one
couched in the language of a Bayesian state-space model.
Let W, Y, Z be measurable spaces and let K1 (w, dy) and K2 (w, y, dz) be probability kernels from W to Y and W × Y to Z, respectively. The kernel product
(K1 ⊗ K2 )(w, dy × dz) is the probability kernel from W to Y × Z given by
Z
(K1 ⊗ K2 )(w, B × C) =
K2 (w, y, C)K1 (w, dy)
B
for every measurable B ⊆ Y and C ⊆ Z.
Let X1 , X2 , . . . , XT be a sequence of measurable spaces, let Xs:t , Xs × Xs+1 ×
· · · × Xt , let X(t) , X1:t , and let X , X(T ) be the full product space of interest. Let
P , P1 ⊗ P2 ⊗ · · · ⊗ PT be the distribution of interest over X, where P1 is a probability
measure on X1 and for 1 ≤ t ≤ T − 1, Pt+1 is a probability kernel from X(t) to Xt+1 .
Define Ps:t , Ps ⊗ · · · ⊗ Pt to be the probability kernel from X(s−1) to Xs:t and let
P(t) , P1:t , so P = P(T ) .1
Both SIS and SIR construct approximations to each distribution P(1) , P(2) , . . . , P(T )
in turn, and use earlier approximations to produce later ones. Like IS, these approaches make use of an importance distribution Q P, but decompose it into
stages Q , Q1 ⊗ · · · ⊗ QT . This decomposition induces a corresponding sequence of
(conditional) weight functions
dP1
(x1 )
dQ1
dPt (x(t−1) , ·)
wt (xt | x(t−1) ) ,
(xt ).
dQt (x(t−1) , ·)
w1 (x1 | hi) , w1 (x1 ) ,
Letting w(t) , w1:t ,
Qt
s=1
(2.7)
(2.8)
wt , we have w = w(T ) . Here and throughout, xt will
1
We have abused notation slightly here as P1:t (and hence P(t) , P1:t ) is not a probability kernel,
though it could be made one by introducing the one-point space X(0) .
17
denote a point in Xt . Write xs:t , hxs , xs+1 , . . . , xt i and x(t) , x1:t .
SIS is simply a reformulation of IS in the case of a product space and operates by
n N
}n=1 with corresponding nonnegpropagating a collection of N particles X (t) = {X(t)
n N
ative weights W (t) = {W(t)
}n=1 . The distribution P(t) is then approximated by
S
=
P̂(t)
N
X
n=1
n
W(t)
n .
δX(t)
PN
k
W
k=1
(t)
(2.9)
The SIS algorithm for generating the particles and the weights is as follows:
Algorithm 2 Sequential Importance Sampling
for n = 1, . . . , N do
Sample particle X1n ∼ Q1
n
X(1)
← X1n
n
← w1 (X1n )
Set weight W(1)
end for
for t = 2, . . . , N do
for n = 1, . . . , N do
n
, ·)
Sample next particle Xtn | X (t−1) ∼ Qt (X(t−1)
n
n
, Xtn )
← (X(t−1)
X(t)
n
n
n
Update weight W(t)
← W(t−1)
· wt (Xtn | X(t−1)
)
end for
end for
In practice, the SIS procedure typically suffers from degeneracy, even when there
k
are only a few dimensions T . Specifically, for some k ∈ {1, . . . , N }, the weight W(T
)
n
is much larger than every other weight W(T
) , n 6= k. A single weight therefore comes
S
to dominate the others and the approximation takes the form P̂(T
[16, 20].
k
) ≈ δX(T
)
The standard solution to the degeneracy problem is to include a resampling step, in
which, after each iteration (or some subset of iterations), particles with large weights
will tend to be duplicated and those with small weights will tend to be removed [16].
All particles are then given equal weight, so there is no longer such severe weight
18
degeneracy.
We analyze the simplest resampling scheme, called multinomial resampling. Sampling importance resample (SIR) is identical to SIS except for a resampling step performed after each iteration. Let W t = {Wtn }N
n=1 denote weights for the particles X (t)
R
at time t. The SIR estimators P̂(t)
are defined in an analogous manner to the SIS
estimators:
R
P̂(t)
=
N
X
Wtn
PN
k
k=1 Wt
n=1
n .
δX(t)
(2.10)
The SIS algorithm for generating the particles and the weights is as follows:
Algorithm 3 Sampling Importance Resampling
for n = 1, . . . , N do
Sample particle X1n ∼ Q1
n
X(1)
← X1n
Set weight W1n ← w1 (X1n )
end for
for t = 2, . . . , N do
for n = 1, . . . , N do
n
R
Resample particle X̃(t−1)
| W t−1 , X (t−1) ∼ P̂(t−1)
n
, ·)
Sample next particle Xtn | X̃ (t−1) ∼ Qt (X̃(t−1)
n
n
, Xtn )
← (X(t−1)
X(t)
n
Set weight Wtn ← wt (Xtn | X̃(t−1)
)
end for
end for
S
R
R
Write P̂ S , P̂(T
, P̂(T
) and P̂
) for, respectively, the SIS estimator and the SIR
estimator of the full distribution P = P(T ) .
19
2.2.1
SIS and SIR for Variance Reduction and Sampling
As with importance sampling, the performance of SIS and SIR can be viewed from
an operator perspective or a measure perspective. The majority of previous work has
focused on the former, where for a test function φ ∈ Bb (P), P̂ S (φ) or P̂ R (φ) is used
as an estimate of the expectation φ̄ , P (φ). Since SIS is an instantiation of IS, they
share the same CLT
√
D
N (P̂ S (φ) − φ̄) =⇒ N(0, σS2 ),
N →∞
where σS2 = EP [(φ − µ)2 w]. A CLT also holds for SIR [4, 16]:
√
D
N (P̂ R (φ) − φ̄) =⇒ N(0, σR2 ).
N →∞
(2.11)
See Chopin [4] for an explicit expression for σR2 . AV provides one method for comparing the efficiency of SIS and SIR. If σR2 < σS2 , then SIR is, in the AV sense, superior to
SIS: asymptotically, the expected L2 error of the SIR estimator for φ̄ will be smaller
than that of the SIS estimator.
As described for IS in Section 2.1, however, our concern will be with the measure
perspective, when P̂ S and P̂ R are used to produce samples that are approximately
distributed according to P. To determine how far the distribution of a sample from
P̂ S (or P̂ R ) is from P, we must therefore study the (marginal) expected estimators
P̄ S , E[P̂ S ] and P̄ R , E[P̂ R ], which are directly analogous to P̄ I .
We are only aware of a small amount of work investigating the properties of P̄ S
or P̄ R . Del Moral [9] gives an upper bound on the total variation distance between
P̄ R and P,
dT V (P̄ R , P) ≤
c
,
N
(2.12)
and a bound on the KL divergence from P̄ R to P,
KL(P̄ R ||P) ≤
20
c0
.
N
(2.13)
To the best of our knowledge, the quantities KL(P||P̄ S ) and KL(P||P̄ R ) have not
been previously analyzed. By studying the asymptotic properties of KL(P||P̄ S ) and
KL(P||P̄ R ), we aim to develop a criterion similar to AV that allows us to determine under what conditions SIR is (at least asymptotically) superior to SIS from the
sampling perspective.
Since our interest is in approximating P, it is, in a certain information-theoretic
sense, more natural to study KL(P||P̄ S ) and KL(P||P̄ R ) than KL(P̄ S ||P) and KL(P̄ R ||P).
This is because, for measures µ and ν, KL(µ||ν) is the expected number of additional
bits required to encode samples from µ when using a code for ν instead [6]. In other
words, it is the amount of information lost by using samples from ν instead of samples
from µ. Another reason to investigate the KL divergence from P to P̄ R (and to P̄ S ), is
that KL(P̄ R ||P) could be small even if P̄ R gave zero probability to a region with positive P-probability, whereas KL(P||P̄ R ) would be infinite in this case. Even in the less
extreme scenario in which P̄ R puts small mass on a region with high P-probability,
KL(P̄ R ||P) could be small whereas KL(P||P̄ R ) would be very large.
21
Chapter 3
Main Results
3.1
Convergence Rates
Our first result gives upper bounds on the KL divergences for the SIS and SIR expected estimators P̄ S and P̄ R that are analogous to the upper bound given in Theorem 2.1.1 for the IS expected estimator. These convergence results motivate our
necessary and sufficient condition for SIR to be superior to SIS, which is given at the
end of the chapter. The key quantities in these analyses are
V , V T , VarQ(T ) [w(T ) ] and Vt , VarP(t−1) ⊗Qt [wt ],
(3.1)
where for notational convenience we write P(0) ⊗ Q1 instead of Q1 .
Theorem 3.1.1. For the SIS and SIR algorithms,
V
V
KL(P||P̄ ) ≤ log 1 +
≤
N
N
S
(3.2)
and
R
KL(P||P̄ ) ≤
T
X
t=1
P
Vt
1
t Vt
≤
+Θ
.
log 1 +
N
N
N2
Hence, KL(P||P̄ S ) = O(1/N ) when V is finite and KL(P||P̄ R ) = O(1/N ) when
22
(3.3)
P
t
Vt
is finite.
Remark 5. Heuristically, the sum
P
t
Vt grows linearly in T since each term is the
variance of the (conditional) weight from a single time step, while V grows exponentially with T since the (conditional) weights from each time step are being multiplied
together. This behavior is similar to what practitioners observe empirically.
Remark 6. Intuitively, the performance of SIR depends on a sum of variances of the
individual wt because SIR resets the particle weights after each time step, so the wt ’s
never “interact” with each other. The performance of SIS, on the other hand, depends
on the variance of the wt ’s multiplied together because they are multiplied together
in the SIS algorithm to give the final particle weights. Thus, the variance of the sum
of the wt ’s measures the performance of SIR while the variance of the product of the
wt ’s measures the performance of SIS.
Remark 7. It is worth reiterating (cf. Remark 2) that V (in the case of SIS) and
P
t
Vt
(in the case of SIR) measure how much “bias” the use of the proposal distribution Q
introduces into SIS/SIR, and thus how many particle are required to remove “most”
of the bias. It is reasonable to ask for the KL divergence to be O(log T ).1 In the case
of SIS, once N = V /(CT ) particles are used, the KL divergence from P to P̄ S is at
most log(1 + CT ). Define VT∗ , sup1≤t≤T Vt . Then for SIR, once N = VT∗ T /(C 0 log T )
particles are used, the KL divergence from P to P̄ R is at most C 0 log T . However, we
expect that V = Θ(αT ) for some constant α and we might suppose that supt Vt < ∞.
If both assumptions hold, then to achieve O(log T ) KL divergence, we should expect
to choose N = Ω(αT /T ) for SIS and N = Ω(T / log T ) for SIR.
As with importance sampling, we can use Pinsker’s inequality to bound the total
variation distance:
Corollary 3.1.2.
s
dT V (P, P̄ S ) ≤
1
V
log 1 +
≤
2
N
1
s
V
.
2N
An analogous discussion to that which follows could be carried if we instead ask for the KL
divergence to be a constant. In this case the required scale of N would only change by logarithmic
factors.
23
and
s
P
1
t Vt
−2
log 1 +
+ Θ(N )
dT V (P, P̄ ) ≤
2
N
rP
t Vt
≤
+ Θ(N −2 ) .
2N
R
Remark 8. The SIR convergence rate for TV distance given in the corollary is not
optimal since, as noted in Section 2.2, Del Moral [9] shows that in fact dT V (P, P̄ R ) =
O(1/N ).
The proof of Theorem 3.1.1 follows the same strategy as that for Theorem 2.1.1:
we obtain an explicit expression for
R
dP̄(T
)
dP(T )
, which can then be lower bounded.
Lemma 3.1.3.
R
dP̄(T
)
dP(T )
NT
(x(T ) ) ≥ QT
− 1 + wt (xt | x1:t−1 ))
t=1 (N
.
(3.4)
The proof of Lemma 3.1.3 requires a pair of tedious inductive arguments, so we
instead convey the key ideas by giving an expression for
R
dP̄(2)
dP(2)
and proving the lemma
in the T = 2 case. The full proof is given in Appendix A.
R
P(2) ,
Lemma 3.1.4. For the SIR algorithm, P̄(2)
#
N
X 1 = x1 ,
(3.5)
N2
.
(N − 1 + w1 (x1 ))(N − 1 + w2 (x2 | x1 ))
(3.6)
"
R
dP̄(2)
N
N
(x(2) ) = E
E
dP(2)
W1
W2
N
X(2) = x(2)
and
R
dP̄(2)
dP(2)
Proof. Let W t ,
(x(2) ) ≥
PN
n=1
Wtn denote the sum of the SIR weights at time t. Since
P̂2R =
N
X
Wn
2
n=1
W2
n =
δX(2)
N
n
X
w2 (X2n | X̃(1)
)
n=1
24
W2
n ,
δX(2)
we have
" PN
n
n
n=1 w2 (X2 | X̃(1) )
R
P̄(2)
=E
W2
"
"
n
δX(2)
=NE
N
)
w2 (X2N | X̃(1)
W2
#
N
δX(2)
##
N
k
N X̃
=NE
E
δX(2)
(1) = X(1)
W1
W2
"k=1 N
"
##
N
w1 (X(1) )
)
w2 (X2N | X̃(1)
N
2
N
N X̃
=N E
E
δX(2)
(1) = X(1)
W1
W2
#
Z "
Z N
N
N
N
N
N
N X̃
= E
E
δX(2)
(1) = X(1) , X2 = x2 P2 (x1 , dx2 ) X1 = x1 P1 (dx1 )
W1
W2
#
Z "
N
N
N
N
N X
(3.7)
= E
E
δX(2)
(2) = x(2) X1 = x1 P(2) (dx(2) ).
W1
W2
N
k
X
)
w1 (X(1)
"
#
N
)
w2 (X2N | X̃(1)
Hence, applying Lemma 2.1.2 to (3.7) yields (3.5) and repeated application of Jensen’s
inequality yields (3.6):
"
#
N
N N
N
E
X(2) = x(2) X1 = x1
(x(2) ) = E
dP(2)
W1
W2 #
"
N
N
N
X 1 = x1
≥E
W 1 N − 1 + w2 (x2 | x1 ) R
dP̄(2)
≥
N2
.
(N − 1 + w1 (x1 ))(N − 1 + w2 (x2 | x1 ))
Proof of Theorem 3.1.1. The SIS bound follows immediately from Theorem 2.1.1.
For the SIR bound, by Lemma 3.1.3 and Jensen’s inequality,
"
QT
t=1 (N − 1 + wt (xt | x(t−1) ))
R
KL(P(T ) ||P̄(T
) ) ≤ EP(T ) log
NT
T
X
N − 1 + wt (xt | x(t−1) )
=
EP(T ) log
N
t=1
T
X
wt (xt | x(t−1) ) − 1
≤
log EP(T ) 1 −
N
t=1
25
!#
=
T
X
t=1
Vt
.
log 1 +
N
Remark 9. It is interesting to note that in the case of Q = P case, wt ≡ 1, so
S
R
KL(P(T ) ||P̄(T
) ) = KL(P(T ) ||P̄(T ) ) = 0. So SIS and SIR are equivalent. Indeed, from
the KL perspective all that is required is a single sample from Q(= P ). However,
when P̂ S and P̂ R are used as estimators, P̂ S is clearly superior to P̂ R . Specifically,
for φ ∈ Bb (X), SIS produces N independent samples, so
Var[P̂ S (φ)] =
VarP [φ]
.
N
For simplicity, consider a version of P̂ R obtained by first generating N samples from
P , then applying multinomial resampling to obtain the final N samples. In this case
it is easy to show that
Var[P̂ R (φ)] =
(2N − 1) VarP [φ]
≈ 2 Var[P̂ S (φ)],
N2
so SIS is superior to SIR.
3.1.1
Rates for the Filtering Distribution
When SMC methods are applied to state-space models, often, instead of considering
the full joint distribution of the latent states, only the marginal distribution of the
most recent latent state is of interest [2, 14, 16, 27]. In this context SMC algorithms
are often referred to as particle filters. Sequential Monte Carlo samplers also require
estimating the marginal of the recent state [cf. 11]. It is therefore natural to consider
the KL divergence between the marginal of P,
P̃T , P(T ) (X(T −1) × ·),
26
(3.8)
and the marginals of P̄ S and P̄ R ,
S
R
R
P̄TS , P̄(T
) (X(T −1) × ·) and P̄T , P̄(T ) (X(T −1) × ·).
(3.9)
From the operator perspective, P̂TS and P̂TR generally approximate P̃T far better
S
R
than P̂(T
) and P̂(T ) approximate P(T ) . It is quite natural for SIS and SIR to produce
better estimates of the marginal expectation since, while both the marginal and joint
estimators involve the same number of particles, the joint expectation involves an
integral over a much higher dimensional space. So it is somewhat surprising that the
KL divergence bounds we obtain in the marginal case are almost identical to those in
the full joint distribution case already considered. But in fact, there are intuitively
good reasons to expect the KL divergence case will behave very differently from that
of functional approximation. Since only a single sample is being drawn from P̂ S (or
P̂ R ), the quality of the full sample X1:T compared to the marginal sample XT does
not suffer from the same curse of dimensionality.
Theorem 3.1.5.
KL(P̃T ||P̄TS )
V
≤ log 1 +
N
≤
V
N
(3.10)
and
PT
Vt
KL(P̃T ||P̄TR ) ≤ log 1 + t=1 + Θ
N
P
Vt
1
≤ t +Θ
.
N
N2
1
N2
!
To prove the theorem, we must first establish lower bounds for
(3.11)
(3.12)
dP̄TS
dP̃T
and
dP̄TR
.
dP̃T
Define the reverse probability kernel P̃(T −1) (xT , dx(T −1) ) such that P(T ) (dx(T ) ) =
P̃T (dxT )P̃(T −1) (xT , dx(T −1) ).
27
Proposition 3.1.6.
dP̄TS
N
(xT ) ≥
.
R
dP̃T
N − 1 + w(x(T ) )P̃(T −1) (xT , dx(T −1) )
(3.13)
Proof. Using Proposition A.2.2 we have
P̄TS (dxT )
Z
S
dP̄(T
)
Z
S
dP̄(T
)
(x(T ) )PT (dx(T ) ) =
(x(T ) )P̃(T −1) (xT , dx(T −1) )P̃T (dxT )
dP(T )
dP(T )
"
#
Z
N
1
X(T ) = x(T ) P̃(T −1) (xT , dx(T −1) )P̃T (dxT ),
= N E PN
n
n=1 w(X(T ) )
=
so
dP̄TS
=
dP̃T
"
Z
N E PN
1
#
N
X(T ) = x(T ) P̃(T −1) (xT , dx(T −1) )
)
n
w(X(T
)
N
≥
R
N − 1 + w(x(T ) )P̃(T −1) (xT , dx(T −1) )
n=1
by Jensen’s inequality.
Proposition 3.1.7.
NT
dP̄TR
(xT ) ≥ R QT
.
dP̃T
t=1 (N + wt (xt | x(t−1) ) − 1)P̃(T −1) (xT , dx(T −1) )
(3.14)
Proof. We prove the theorem in the T = 2 case. The general case follows from a pair
of inductions analogous to those used to prove Proposition A.2.4 and Lemma 3.1.3,
so they are omitted.
If T = 2, (3.7) implies that
#
N
N
N
= x(2) X1N = x1 P(2) (dx(2) )
P̄2R = E
E
δX N X(2)
W1
W2 2 #
Z "
N
N
N
= E
·
δX N X(2) = x(2) P̃(1) (x2 , dx1 )P̃2 (dx2 ).
W1 W2 2 Z
"
28
Hence,
dP̄2R
(x2 ) =
dP̃2
Z
≥R
"
N
N
E
·
W1 W2
#
N
X(2) = x(2) P̃(1) (x2 , dx1 )
N2
.
N
= x(2) ]P̃(1) (x2 , dx1 )
E[W 1 · W 2 | X(2)
It remains to simplify the denominator:
Z
N
E W 1 · W 2 | X(2)
= x(2) P̃(1) (x2 , dx1 )
Z
N
= x(2) P̃(1) (x2 , dx1 )
= E W 1 · (N − 1 + w2 (x2 | x1 )) | X(2)
Z
= (N − 1 + w1 (x1 ))(N − 1 + w2 (x2 | x1 ))P̃(1) (x2 , dx1 ),
concluding the proof.
Proof of Theorem 3.1.5. For SIS, by Proposition 3.1.6,
KL(P̃T ||P̄TS )
N −1+
R
w(x(T ) )P̃(T −1) (xT , dx(T −1) )
N
R
N − 1 + w(x(T ) )P̃(T −1) (xT , dx(T −1) )
≤ log EP̃2
N
!
R
w(x(T ) )P̃(T −1) (xT , dx(T −1) )P̃T (dxT ) − 1
= log 1 +
N
EP [w] − 1
V
= log 1 +
= log 1 +
.
N
N
≤ EP̃T log
For SIR, to upper bound the KL divergence, we can use the fact that
T
Y
(N + wt (xt | x(t−1) ) − 1)
t=1
T
=N +N
T −1
T
T X
T
X
X
T −2
(wxt − 1) + N
(wxt − 1)(wxs − 1)
t=1
+ N T −3
T X
T X
T
X
t=1 s<t
(wxt − 1)(wxs − 1)(wxr − 1) + · · · +
t=1 s<t r<s
T
Y
(wxt − 1),
t=1
29
(3.15)
where wxt , wt (xt | x(t−1) ). So, by Proposition 3.1.7 and an application of (3.15),
KL(P̃T ||P̄TR )
3.2
R QT
t=1 (N + wt (xt | x(t−1) ) − 1)P̃(T −1) (xT , dx(T −1) )
≤ EP̃T log
NT i
hQ
T
EP
t=1 (N + wt (xt | x(t−1) ) − 1)
≤ log
NT
PT
!
E
[w
−
1]
1
P
xt
= log 1 + t=1
+Θ
N
N2
PT
!
V
1
t
= log 1 + t=1 + Θ
N
N2
Time-uniform Bounds
In this section we give uniform convergence results over time in the marginal distribution case for SIR. For the time-uniform results we assume that for all t ≥ 1, we
have probability kernels Pt and Qt , so we can consider the joint distribution P(t) and
marginal distribution P̃t for unbounded t.
We will assume that the wt are uniformly bounded from above and uniformly
bounded away from zero, which is a standard assumption in asymptotic analyses of
SMC methods [see, e.g., 9, 29]:
Assumption (A). For all t ≥ 1,
0 < w ≤ wt (xt | x(t−1) ) ≤ w < ∞.
(3.16)
Define the modified proposal distribution Q∗(t,T ) , Q(t) ⊗ Pt+1:T which uses the
standard proposal distribution for the first t time steps and then proposes from the
R∗
true conditional distribution at times t + 1 through T . Let P̂t,T
be the SIR estimator
R∗
R∗
for P̃T when the proposal Q∗(t,T ) is used and let P̄t,T
, E[P̂t,T
].
Time-uniform results require an asymptotic stability assumption on Pt and Qt .
30
The weakest such assumption we consider controls only the limiting behavior of the
system:
Assumption (B).
R∗
lim sup KL(P̃T +t ||P̄t,T
+t ) = 0.
T →∞ t≥0
(3.17)
We will also consider the following stronger condition in order to obtain time-uniform
convergence rates:
Assumption (C). There exists T0 ≥ 1 and γ > 0 such that for all T ≥ T0
R∗
−γT
sup KL(P̃T +t ||P̄t,T
.
+t ) ≤ e
(3.18)
t≥0
Both assumptions can be understood as requiring the stochastic process defined by
the conditional distributions {Pt }t≥1 to have a sufficiently strong mixing property with
respect to the SIR algorithm: no matter how long the (typically incorrect) proposals
from {Qt }t≥1 are used, once the true conditionals are used as proposals, the estimated
SIR marginals converge to the truth. Assumption (B) only requires that mixing occur
in the infinite-time limit while Assumption (C) requires an asymptotically exponential
mixing rate.
We can now state and prove our time-uniform bounds, which are analogous to
those given by Del Moral and Guionnet [12] in the total variation setting.
Theorem 3.2.1. If Assumptions (A) and (B) hold, then
lim sup KL(P̃t ||P̄tR ) = 0.
N →∞ t≥1
31
(3.19)
If Assumptions (A) and (C) hold, then
sup KL(P̃t ||P̄tR )
t≥1
w
≤
N
log(N/w)
1+
γ
(3.20)
for any N ≥ 1 such that
log(N/w)
T = T (N ) ,
≥ T0 .
γ
(3.21)
Proof. The proof is similar in spirit to that of Theorem 3.1 in Del Moral and Guionnet
[12]. First, note that by Assumption (A) and the proof of Theorem 3.1.5
Qt
(N
+
w
(x
|
x
)
−
1)
(N + w − 1)t
s
s
(s−1)
s=1
KL(P̃t ||P̄tR ) ≤ log
≤
log
Nt
Nt
t
t(w − 1)
w−1
≤
= log 1 +
.
N
N
EP
Hence,
sup KL(P̃t ||P̄tR ) ≤
t=1,...,T
Tw
.
N
(3.22)
We also have that
KL(P̃t ||P̄tR ) = EP(t) log
S∗
dP̄t−T,t
dP̃t
dP̃t
=
E
log
+ EP(t) log
P(t)
R
R
R∗
dP̄t
dP̄t
dP̄t−T,t
R∗
dP̄t−T,t
R∗
+ KL(P̃t ||P̄t−T,t
)
dP̄tR
R∗
dP̄t−T,t
= EP(t) log
+ εT ,
dP̄tR
= EP(t) log
where
R∗
εT , sup KL(P̃T +s ||P̄T,T
+s ).
s≥0
32
(3.23)
Reasoning analogously to the proof of Proposition 3.1.7 we see that
R∗
dP̄t,T
dP̃T
Z
=
"
Nt
E Qt
s=1 W s
#
N
X(T ) = x(T ) P̃(T −1) (xT , dx(T −1) ).
(3.24)
R∗
Since dP̄tR = dP̄t,t
, by (3.24), Jensen’s inequality, and Assumption (A)
"
#
N
X(t) = x(t) P̃(t−1) (xt , dx(t−1) )
"
#
Z
N t−T N
NT
E Qt−T
≥
Qt
X(t) = x(t) P̃(t−1) (xt , dx(t−1) )
s=t−T +1 (N − 1 + wxs )
s=1 W s
#
Z " t−T −1 N
N
NT
X(t) = x(t) P̃(t−1) (xt , dx(t−1) )
E Qt−T
≥
(N − 1 + w)T
W
s
s=1
dP̄tR
=
dP̃t
=
Z
Nt
E Qt
s=1 W s
R∗
dP̄t−T,t
NT
.
(N − 1 + w)T dP̃t
(3.25)
Combining (3.23) and (3.25) yields that for all t > T ,
KL(P̃t ||P̄tR ) ≤ T log
Tw
N −1+w
+ εT ≤
+ εT ,
N
N
which together with (3.22) implies
sup KL(P̃t ||P̄tR ) ≤
t≥1
Tw
+ εT .
N
First letting N → ∞ and then taking T → ∞ proves (3.19).
If Assumption (C) holds, then by the same reasoning as before, for all T ≥ T0
sup KL(P̃t ||P̄tR ) ≤
t≥1
Tw
+ e−γT .
N
So choosing
log(N/w)
T = T (N ) ,
γ
33
yields
sup KL(P̃t ||P̄tR )
t≥1
w
≤
N
log(N/w)
1+
γ
.
as long as T (N ) ≥ T0 , proving (3.20).
Theorem 7.4.4 of Del Moral [9] states that under Assumption (A) and an assumption similar in spirit to Assumption (C), for any φ ∈ Bb (P) with sup |φ| ≤ 1,
√
sup E[P̃t (φ) − P̂tR (φ)] = O(1/ N ).
t≥1
The rate
sup KL(P̃t ||P̄tR ) = O(log N/N )
t≥1
thus introduces an additional log N factor not present in Del Moral’s result, which
may be possible to remove. If we measure the distance between P̃t and P̄tR using total
variation, Pinsker’s inequality gives
p
sup dT V (P̃t , P̄tR ) = O( log N/N ).
(3.26)
t≥1
It is not clear that the rate given in (3.26) is optimal, since
dT V (P̃t , P̄tR ) = O(1/N ),
so we suspect that supt≥1 dT V (P̃t , P̄tR ) in fact converges at a 1/N or log N/N rate.
34
3.3
Comparing SIS and SIR
Based on Theorems 3.1.1 and 3.1.5, one might conjecture that, in both the full and
the marginal distribution settings, SIR dominates SIS when
X
Vt V ,
t
and indeed this is the case under some additional hypotheses. In the joint distribution
case, the proof of Theorem 2.1.1 establishes the lower bound
dP̄ S
N
(x(T ) ) ≥
.
dP
N − 1 + w(x(T ) )
When V is finite, a matching upper bound of the form
dP̄ S
N
(x(T ) ) ≤
+ o(1)
dP
N − 1 + w(x(T ) )
can also be established, where for large N , the o(1)-term can essentially be ignored.
Note that the finiteness of V is already a necessary condition for Theorem 3.1.1 to
be non-trivial. Analogous statements hold for SIR in the joint case and SIS and SIR
in the marginal case. To prove the conjecture we will require that Assumption (A)
P
holds. Clearly Assumption (A) implies that V and t Vt are finite, so Theorems 3.1.1
and 3.1.5 are non-trivial in this setting.
Theorem 3.3.1. If Assumption (A) holds and
P
t
Vt < V , then for N sufficiently
large, KL(P||P̄ R ) < KL(P||P̄ S ) and KL(P̃T ||P̄TR ) < KL(P̃T ||P̄TS ).
Remark 10. It is instructive to consider the T = 2 case and assume that P1 , Q1 , P2 (x1 , ·),
and Q2 (x1 , ·) share a common dominating measure λ. Write pt =
dPt
dλ
and qt =
We can then write out the three variance terms slightly more explicitly as
p1 (x1 )2
λ(dx1 ) − 1
q1 (x1 )
Z
p2 (x2 | x1 )2
V2 = VarP(1) ⊗Q2 [w2 ] = p1 (x1 )
λ(dx(2) ) − 1
q2 (x2 | x1 )
Z
V1 = VarQ1 [w1 ] =
35
dQt
.
dλ
Z
V = VarQ(2) [w(2) ] =
p1 (x1 )2 p2 (x2 | x1 )2
λ(dx(2) ) − 1.
q1 (x1 ) q2 (x2 | x1 )
Remark 11. Still considering the T = 2 case, note the similarity between V2 and V ,
with the only difference being that the latter has an additional w1 (x1 ) =
p1 (x1 )
q1 (x1 )
term
in the integral. Say Q1 is of low quality but Q2 is of higher quality for choices of
x with high P1 -probability than for x of low P1 -probability. In this case, V will be
very large compared to V2 because in the V integral, the w1 (x1 ) term will overweight
the w2 (x2 | x1 ) term in exactly the places where it has high variance, whereas V2 will
overweight the w2 (x2 | x1 ) term in exactly the places where it has low variance. The
V1 may have reasonably large magnitude, but V will still be much larger. Thus SIR
will be superior to SIS in cases where Q1 is of low quality, but Q2 is of better quality
in regions of greatest importance.
A converse of Theorem 3.3.1 also holds:
Theorem 3.3.2. If Assumption (A) holds and
P
t
Vt > V , then for N sufficiently
large, KL(P||P̄ R ) > KL(P||P̄ S ) and KL(P̃T ||P̄TR ) > KL(P̃T ||P̄TS ).
Hence, we have an asymptotically necessary and sufficient condition for SIR to be
superior to SIS.
Corollary 3.3.3. If Assumption (A) holds, then for N sufficiently large:
KL(P||P̄ R ) < KL(P||P̄ S )
if and only if
X
Vt < V .
(3.27)
t
and
KL(P̃T ||P̄TR ) < KL(P̃T ||P̄TS )
if and only if
X
Vt < V .
(3.28)
t
Proofs of Theorems 3.3.1 and 3.3.2 are given in Appendix B.
Remark 12. Examining the proofs of Theorems 3.3.1 and 3.3.2, one can see that the
P
key quantities for determining when N is “sufficiently large” are ∆ , | t Vt − V |,
P
w, and w. That is, the greater the difference between t Vt and V , the larger w
36
is, and the smaller w is, the smaller N needs to be to reach the asymptotic regime.
The dependence on ∆ here is quite natural. Indeed, the regime of ∆ large is exactly
the one of greatest interest, since that is when the choice of SIR or SIS will have the
greatest impact. As for the dependence on w (or w), if w is zero or extremely small
(resp. w is infinite or extremely large), but the weights are only small (resp. large)
with low probability, then versions of Theorems 3.3.1 and 3.3.2 and Corollary 3.3.3
that hold with high probability can easily be formulated.
37
Chapter 4
Conclusions and Future Work
In this thesis we have investigated the quality of two SMC estimators — SIS and SIR
— from what we have called the measure perspective. As discussed in Section 2.2,
from we call the operator perspective, the asymptotic variance (AV) of SIS and SIR
estimators can be used to judge their relative performance when used to approximate
an expectation P (φ), φ ∈ Bb (X). Here, our analysis has instead centered on the KL
divergence from P to P̄ S (and to P̄ R ). In addition to proving convergence rates for
the KL divergences of both expected estimators, we obtained necessary and sufficient
conditions for the SIR estimator to be superior to the SIS estimator in terms of
KL divergence, providing an alternative to AV which is applicable when taking the
measure viewpoint. In particular, we showed the “measure AV” — for both the joint
P
distribution and the filtering distribution — is V for SIS and t Vt for SIR.
In the remainder of this chapter, we conclude by discussing some related results,
drawing some connections between that work and our results, and speculating on
promising directions for future research.
38
4.1
4.1.1
Other Convergence Rates for SMC
Lp Error Bounds
In addition to the CLT results already discussed, numerous other asymptotic and nonasymptotic analyses of P̂ S and P̂ R have been carried out. We mention just few of
them here. Throughout this section we write constants in functional form to indicate
which quantities they depend on. The values of the constants in this and subsequent
sections may change from line to line.
For interacting particle systems, which include SIR as a special case, Lp error
bounds and Glivenko-Cantelli-type theorems have been established [cf. 9]. One such
Lp bound states that, for any p ≥ 1 and any φ ∈ Bb (X),
h
i1/p a(p)b(φ)c(P, Q)
R
p
√
E |P̂(T
(φ)
−
P
(φ)|
≤
(T )
)
N
(4.1)
A time-uniform Lp bound (cf. Section 3.2),
h
i1/p a(p)b(φ)c(P, Q)
R
√
sup E |P̂(t)
(φ) − P(t) (φ)|p
≤
.
t≥1
N
(4.2)
has also been established, though under stronger conditions than the fixed time result.
The Glivenko-Cantelli-type theorem states that for any p ≥ 1 and any countable
collection of uniformly bounded functions F ⊆ Bb (X),
1/p
a(p)c(P, Q)C(F)
R
p
√
,
E sup |P̂ (φ) − P(φ)|
≤
N
φ∈F
where C(F) measures the complexity of the function class F.
39
(4.3)
4.1.2
KL Divergence Bounds
A KL divergence bound in the reverse direction to that which we consider (cf. Section 2.2),
KL(P̄ R ||P) ≤
c(P, Q)
,
N
(4.4)
can be extracted as a special case of a more general propagation-of-chaos result [9,
Theorem 8.3.2]. Propagation of chaos concerns the relationship between the expected
joint distribution over k particles and the distribution P ⊗k of k independent samples
from P. In order words, propagation-of-chaos results measure how close the k particles
are to being independent samples from P. The special case of interest to us is when
k = 1. Propagation-of-chaos results require controlling the strength of the interactions
between the particles and thus rely on a mixing condition, which is unnecessary in
the k = 1 case. Thus, (4.4) is not directly comparable to our bound on KL(P||P̄ R ).
An interesting open question concerns the fact that, under appropriate hypotheses,
KL divergence becomes symmetric for “infinitesimal” divergences. It is not clear (to
us) whether, in the SMC setting, KL divergence in one direction (asymptotically)
bounds KL divergence in the other. An answer in either the affirmative or the negative
would provide insight into the behavior of P̄ R as well as into how our results relate
to those of Del Moral.
4.2
Adaptive Resampling and αSMC
This thesis has examined the behavior of SMC estimators in the presence of deterministic resampling. However, it is common for practitioners to use adaptive resampling
techniques to choose when to resampling based on the realized particle weights. The
most popular adaptive scheme is based on the effective sample size (ESS) criterion
40
[14, 16, 23, 24]. The ESS for normalized weights w = (w1 , . . . , wN ) is defined as
ESS(w) ,
N
X
!−1
wn2
.
(4.5)
n=1
The function ESS(w) ranges from 1 to N and is interpreted as the effective number
of particles the importance sampler is using if the particles have weights w. If the
ESS is below some fixed threshold (e.g. N/2), then a resampling step is performed.
There are myriad heuristic arguments for using ESS [23, 24, 25] and some theoretical analyses of the behavior of adaptive resampling algorithms under a variety of
technical assumptions [8, 10, 29]. Recently Whiteley, Lee, and Heine [29] provided a
rigorous justification for the use of ESS from the operator viewpoint. They showed
that if the ESS does not fall below γN , γ ∈ (0, 1] a fixed parameter, then the SMC
algorithm does in fact behave as if there are γN particles. So in this technical sense
ESS is in fact a valid measure of the effective sample size. We now briefly describe
their set-up and one relevant result. Consider a state-space model where
Xt = Z is a fixed measurable space,
Pt (x(t−1) , dxt ) ∝ K(xt−1 , dxt )g(xt , yt ), and
Qt (x(t−1) , dxt ) = K(xt−1 , dxt ),
with yt the observation at time t. In this state-space model, wt ∝ g(·, yt ). Critically,
the goal in [29] was to perform one-step-ahead prediction. That is, to approximate
the predictive distribution
Z
Pt|t−1 (dxt ) ∝
K(xt−1 , dxt )
t−1
Y
K(xs−1 , dxs )g(xs , ys )f0 (dx0 ),
(4.6)
s=1
where f0 is the density of the initial state x0 .
Whiteley et al. give an algorithm they call αSMC, which generalizes SIS, SIR,
and numerous other SMC variants. The algorithm provides a flexible resampling
mechanism in which at each time t, a stochastic matrix αt−1 is chosen from a set of
41
N × N matrices, denoted AN . We denote the value in the n-th row and k-th column
nk
.
of αt−1 by αt−1
Algorithm 4 αSMC
for n = 1, . . . , N do
Sample particle X1n ∼ Q1
n
Set weights W1n ← w1 (X(1)
)
end for
for t = 2, . . . , T do
Select αt−1 from AN according to some functional of X (t−1)
for n = 1, . . . , N do
P
k
k
nk
Wtn ← N
k=1 αt−1 Wt−1 wt (Xt )
n
Resample particle X̃t−1
| W t−1 , X (t−1) ∼
PN
k=1
k
k
αnk
t−1 Wt−1 wt (Xt )
δX(t−1)
k
Wtn
n
Sample next particle Xtn | X̃t−1 ∼ K(X̃t−1
, ·)
end for
end for
The αSMC predictive estimators take the form
α
P̂t|t−1
=
N
X
n=1
Wtn
δXtn .
PN
k
k=1 Wt
(4.7)
Define the (generalized) ESS for αSMC at time t to be
EtN
P
n 2
(N −1 N
n=1 Wt )
,
.
P
n 2
N −1 N
n=1 (Wt )
(4.8)
Since EtN has been normalized to lie between 1/N and 1, the standard ESS at time t
is given by ESSt , N EtN .
The SMC algorithms already discussed in this thesis can be obtained as special
cases of αSMC as follows. SIS is recovered by always setting αt−1 to be IN , the N ×N
identity matrix. For SIR, set αt−1 = 11/N , the N × N matrix with all entries equal to
1/N . The standard ESS-based adaptive SMC algorithm is obtained by setting αt−1
42
to 11/N if EtN < γ and IN otherwise.
Whiteley et al. give a time-uniform Lp error bound when the version of αSMC
that is employed guarantees a lower bound on EtN . Let φ ∈ Bb (X) with sup |φ| ≤ 1
and let p ≥ 1. Then, under appropriate regularity conditions,
sup EtN
t≥1
≥ γ =⇒ sup E
t≥1
h
|P̂tα (φ)
p
− P̃t (φ)|
i1/p
≤
a(p)c(P, Q)
√
.
γN
(4.9)
Comparing (4.9) to [9, Theorem 7.4.4], which states that
i1/p a(p)c(P, Q)
h
√
,
sup E |P̂tR (φ) − P̃t (φ)|p
≤
t≥1
N
(4.10)
we see that the condition supt≥1 EtN ≥ γ ensures that the effective number of particles
in the time-uniform Lp error bound is γN compared to N particles if SIR is used. We
conjecture that a similar generalization from (4.10) to (4.9) may exist for our timeuniform result for KL-divergence given in Theorem 3.2.1. However, the ESS condition
is likely to be different from that used in the Lp context. The form of this KL ESS
condition could prove to be of practical interest when employing SMC methods for
sampling.
4.3
Global Parameter Estimation in State-space
Models
Typically, in addition to the latent state at each time, state-space models have a global
parameter θ which, in the Bayesian setting, has posterior distribution that must also
be estimated. Standard SMC algorithms only handle the case of fixed θ. An active
area of research is developing extensions to SMC that allow for estimation of the
joint state and global parameter distribution in an online manner [5, 7, 14]. Applying
techniques developed in this thesis to understand these algorithms from the measure
perspective is an exciting direction for future work. For example, the nested particle
filter algorithm [7] provides a scalable approach to parameter estimation problems,
43
with constant cost at each time (unlike, e.g., the SMC2 algorithm [5], which has O(t)
cost at time t). Hence, an analysis of the algorithm from the measure perspective
would be particularly worthwhile and would complement the existing analyses done
from the operator perspective.
44
Appendix A
Auxiliary Results
A.1
Technical Lemmas
Proposition A.1.1. Let ZN =
PN
n=1
Xn , where the Xn are i.i.d. nonnegative random
variables with mean µ > 0 and 0 ≤ a ≤ Xn ≤ b < ∞. Let µN , E[ZN ] = N µ and
c > 0. Then
1
E
c + ZN
≤
1
+ Θ(N −4/3 )
c + µN
(A.1)
and if a > 0, then the Θ(N −4/3 ) term is independent of c.
Proof. We have
1
E
c + ZN
1
1
=E
1(µN − ZN ≥ tN ) + E
1(µN − ZN < tN )
c + ZN
c + ZN
1
1
≤
E [1(µN − ZN ≥ tN )] +
E [1(µN − ZN < tN )]
c + Na
c + N (µ − t)
1
1
−2t2 N 2
≤
exp
+
,
2
c + Na
N (b − a)
c + N (µ − t)
where the final step follows from Hoeffding’s inequality. Choosing t = N −1/3 yields
1
E
c + ZN
1
≤
exp
c + Na
−2N 1/3
(b − a)2
+
1
c + N (µ − N −1/3 )
45
1
1
−2N 1/3
1
1
=
+
exp
+
−
2
−1/3
c + Nµ c + Na
(b − a)
c + N (µ − N
) c + Nµ
1/3
2/3
1
−2N
N
1
+
exp
+
=
2
c + Nµ c + Na
(b − a)
(c + N (µ − N −1/3 ))(c + N µ)
1
1
=
+Θ
.
c + Nµ
N 4/3
1
Note that if a > 0, then the Θ
N 4/3
term can be made independent of c by replacing
c with zero.
Lemma A.1.2. For all > 0 there is some N > 0 such that log(1+x) >
P2N
k+1 k
x
k=1 (−1)
for all x > −1 + .
Proof. We have
log(1 + x) −
2N
X
(−1)
k+1 k
x =
2N
X
k+1 x
(−1)
k=1
k=1
k
k
−
2N
X
(−1)k+1 xk + R2N (x)
k=1
= G2N (x) + R2N (x),
where G2N (x) =
P2N
k=1
(k−1)(−x)k
k
and R2N (x) is the remainder term in the 2N -degree
Taylor series for log(1 + x) centered at 0. For x ∈ (−1, 0],
2N
1X
1
G2N (x) ≥
(−x)k ≥
2 k=1
2
=
Z
2N +1
(−x)t dt
2
1 (−x)2
1
1 (−x)2N +1 − (−x)2
≥−
≥ (−x)3 ,
2
log(−x)
2 log(−x)
2
where the last inequality follows from the fact that log x ≥ (x − 1)/x, so −1/ log(x) ≥
−x/(x − 1) ≥ x for x ∈ [0, 1). For x ∈ (−1, x], the magnitude of the remainder term
can be bounded as
Z
|R2N (x)| = x
0
(x − t)2N
≤ |x| max fN (x, t),
dt
x≤t≤0
(1 + t)2N +1 46
where fN (x, t) =
(x−t)2N
.
(1+t)2N +1
Since
∂fN
(x − t)2N −1
=−
(−t + (2N + 1)x + 2N ),
∂t
(1 + t)2N +2
for fixed x, fN is increasing in t on the interval [x, (2N + 1)x + 2N ], so for all x ∈
[−2N/(2N + 1), 0], fN is increasing in t on the interval [x, 0]. Hence, for all x ∈
[−2N/(2N + 1), 0],
|R2N (x)| ≤ |x|f (x, 0) = (−x)2N +1 .
Note that (−x)2N +1 ≤ (−x)3 for x ∈ [−1/21/(2N −2) , 0]. Letting
b(N ) = max{−1/21/(2N −2) , −2N/(2N + 1)},
(A.2)
we have for all x ∈ [b(N ), 0], that |R2N (x)| ≤ G2N (x), and thus that
log(1 + x) −
2N
X
(−1)k+1 xk ≥ 0.
(A.3)
k=1
For x > 0, note that log(1 + x) ≥ x/(1 + x) and that
x/(1 + x) =
∞
2N
X
X
(−1)k+1 xk .
(−1)k+1 xk +
k=1
k=2N +1
For x ∈ [0, 1],
∞
X
k+1 k
(−1)
x =
k=2N +1
∞
X
(x2k+1 − x2k+2 ) ≥ 0,
k=N
while for x ≥ 1,
∞
X
k+1 k
(−1)
x =x
2N +1
+
k=2N +1
∞
X
(x2k+1 − x2k ) ≥ 0.
k=N +1
Thus, for N > 0, log(1 + x) ≥ x/(1 + x) ≥
47
P2N
k=1 (−1)
k+1 k
x . So for fixed , choosing
N such that b(N ) < −1 + completes the proof.
A.2
Auxiliary SMC Results
Lemma A.2.1. P and P̄ I are absolutely continuous with respect to each other.
Proof. The fact that P̄ I P follows immediately from Lemma 2.1.3. To see that
P P̄ I , note that for measurable A ⊆ X, P̄ I (A) = 0 =⇒ there is some B ⊂ A such
that Q(B) = 0 and ∀x ∈ A\B, w(x) = 0. But since P Q, Q(B) = 0 =⇒ P(B) = 0
and since for x ∈ A \ B, w(x) = 0, P(A \ B) = 0 as well. So P(A) = 0.
S
Proposition A.2.2. For the SIS algorithm, P̄(T
) P(T ) and
S
dP̄(T
)
dP(T )
"
(x(T ) ) = N E PN
n=1
1
n
w(X(T
)
#
N
X(T ) = x(T ) .
)
(A.4)
Proof. The result is an immediate corollary of Lemma 2.1.3.
It also follows from Lemma A.2.1 that:
Lemma A.2.3. P and P̄ S are absolutely continuous with respect to each other.
Let W t , W t (X (t) ) ,
(T )
(T )
PN
n=1
(T )
the functions f1 , f2 , . . . , fT
(T )
fT (X (T ) , hi) ,
(T )
ft (X (t) , xt+1:T ) ,
Wtn be the sum of the SIR weights at time t. Define
recursively by
N
W T (X (T ) )
h
i
(T )
N
N
N
N E ft+1 (X (t+1) , xt+2:T ) | X̃(t)
= X(t)
, Xt+1
= xt+1
W t (X (t) )
(A.5)
(A.6)
for 1 ≤ t ≤ T − 1.
R
Proposition A.2.4. For the SIR algorithm, P̄(T
) P(T ) and
R
dP̄(T
)
dP(T )
(T )
N
(x(T ) ) = E[f1 (X (1) , x2:T ) | X(1)
= x1 ]
48
(A.7)
Proof. First define the measure-valued functions
(T )
gT (X (T ) , hi) ,
(T )
gt (X (t) , xt+1:T ) ,
N
δX N
W T (X (T ) ) (T )
i
h
(T )
N
N
N
= xt+1
, Xt+1
= X(t)
N E gt+1 (X (t+1) , xt+2:T ) | X̃(t)
W t (X (t) )
(A.8)
(A.9)
(T )
R
N
N
and note that P̄(T
) = E[wT (XT | X̃(T −1) )gT (X (T ) , hi)]. The inductive hypothesis is
that for 1 ≤ t ≤ T ,
R
P̄(T
)
Z
wt (XtN
=E
(T )
N
N
, dxt+1:T )
| X̃(t−1)
)gt (X (t) , xt+1:T )Pt+1:T (X(t)
.
(A.10)
N
= xt Pt:T (X̃(t−1)
, dxt:T )
(A.11)
Writing
(T )
g̃t (X̃ (t−1) )
Z
,
E
(T )
gt (X (t) , xt+1:T ) XtN
and assuming the inductive hypothesis holds for some fixed t,
R
P̄(T
)
Z
=E
wt (XtN
Z
=E
E
Z
=E
E
(T )
N
N
| X̃(t−1)
)gt (X (t) , xt+1:T )Pt+1:T (X(t)
, dxt+1:T )
(T )
gt (X (t) , xt+1:T ) XtN
(T )
gt (X (t) , xt+1:T ) XtN
= xt
= xt
N
N
Pt (X̃(t−1)
, dxt )Pt+1:T (hX̃(t−1)
, xt i, dxt+1:T )
N
Pt:T (X̃(t−1)
, dxt:T )
#
N
k
k
X
| X̃(t−2)
)
wt−1 (Xt−1
N
(T )
k
E g̃t (X̃ (t−1) ) X̃(t−1) = X(t−1)
=E
W
t−1
" k=1
#
N
N
N wt−1 (Xt−1
| X̃(t−2)
)
N
(T )
N
=E
E g̃t (X̃ (t−1) ) X̃(t−1) = X(t−1)
W t−1
Z
(T )
N
N
N
=E
wt−1 (Xt−1 | X̃(t−2) )gt−1 (X (t−1) , xt:T )Pt:T (X(t−1) , dxt:T ) .
"
The induction yields
R
P̄(T
)
Z
=E
(T )
N
w1 (X1N )g1 (X (1) , x2:T )P2:T (X(1)
, dx2:T )
49
Z
(T )
N
E[g1 (X (1) , x2:T ) | X(1)
= x1 ]P1 (dx1 )P2:T (x1 , dx2:T )
=E
Z
(T )
N
= E[g1 (X (1) , x2:T ) | X(1)
= x1 ]P(T ) (dx(T ) ),
so applying Lemma 2.1.2 proves the proposition.
Lemma A.2.5. P and P̄ R are absolutely continuous with respect to each other.
Proof. The reasoning is analogous to proof of Lemma A.2.1.
The following is stated as Lemma 3.1.3, but not proven there.
Lemma A.2.6.
R
dP̄(T
)
dP(T )
NT
(x(T ) ) ≥ QT
t=1 (N − 1 + wt (xt | x1:t−1 ))
.
Proof. Define
(T )
ht (x(T ) )
N T −t+1
, QT
s=t (N
− 1 + ws (xs | x(s−1) ))
(T )
ht,t+1 (X (t−1) , x(T ) , ψ) , ψ(x(T ) , X (t−1) )
"
#
(T )
N ht,s+1 (X (s) , x(T ) , ψ) N
(T )
X(s) = x(s)
ht,s (X (s−1) , x(T ) , ψ) , E PN
n | X̃ n
w
(X
)
s
n=1 s
(s−1)
for 1 ≤ s ≤ t. First note that
"
(T )
ht,t (X (t−1) , x(T ) , ψ)
#
(T )
N ht,t+1 (X (t) , x(T ) , ψ) N
X(t) = x(t)
= E PN
n
n
w
(X
|
X̃
)
t
n=1 t
(t−1)
"
#
N ψ(X (t) , x(T ) ) N
= E PN
X(t) = x(t)
n
n
n=1 wt (Xt | X̃(t−1) )
(T )
= ht−1,t X (t−1) , x(T ) , πt (ψ) ,
where
"
N ψ(X (t) , x(T ) )
πt (ψ)(X (t) , x(T ) ) , E PN
n=1
n
wt (Xtn | X̃(t−1)
50
#
N
X(t) = x(t) .
)
(A.12)
(T )
(T )
Also, if ht,s (X (s−1) , x(T ) , ψ) = ht−1,s (X (s−1) , x(t) , ψ 0 ), then
#
(T )
N ht,s (X (s−1) , x(T ) , ψ) N
= E PN
X(s−1) = x(s−1)
n
n
n=1 ws−1 (Xs−1 | X̃(s−2) )
"
#
(T )
N ht−1,s (X (s−1) , x(t) , ψ 0 ) N
X(s−1) = x(s−1)
= E PN
n
n
)
|
X̃
w
(X
s−1
n=1 s−1
(s−2)
"
(T )
ht,s−1 (X (s−2) , x(T ) , ψ)
(T )
= ht−1,s−1 (X (s−2) , x(T ) , ψ 0 ).
So by induction on s
ht,1 (hi, x(T ) , ψ) = ht−1,1 (hi, x(T ) , πt (ψ)).
(A.13)
(T )
Observe that if ψ(X (t) , x(T ) ) = ht+1 (x(T ) ), then by Jensen’s inequality
"
(T )
(T )
πt (ht+1 )(X (t) , x(T ) ) = E PN
N ht+1 (x(T ) )
n=1
n
wt (Xtn | X̃(t−1)
#
N
X(t) = x(t)
)
(T )
N ht+1 (x(T ) )
≥
E
hP
N
N
n
n
n=1 wt (Xt | X̃(t−1) ) | X(t) = x(t)
i
(T )
=
N ht+1 (x(T ) )
N − 1 + wt (xt | x(t−1) )
(T )
= ht (x(T ) ).
(A.14)
Next, applying Proposition A.2.4 and (A.13) yields
R
dP̄(T
)
dP(T )
(T )
(T )
N
(x(T ) ) = E[f1 (X (1) , x2:T ) | X(1)
= x1 ] = hT,1 (hi, x(T ) , 1)
(T )
(T )
= hT,1 (hi, x(T ) , hT +1 ).
Assume that for fixed 1 ≤ t ≤ T ,
R
dP̄(T
)
dP(T )
(T )
(T )
(x(T ) ) ≥ ht,1 (hi, x(T ) , ht+1 ). Then applying
(A.13) and (A.14) yields
R
dP̄(T
)
dP(T )
(T )
(T )
(T )
(T )
(x(T ) ) ≥ ht,1 (hi, x(T ) , ht+1 ) = ht−1,1 (hi, x(T ) , πt (ht+1 ))
51
(T )
(T )
≥ ht−1,1 (hi, x(T ) , ht ).
So by induction on t,
R
dP̄(T
)
dP(T )
(T )
(T )
(x(T ) ) ≥ h1,1 (hi, x(T ) , ht ), proving the lemma.
Lemma A.2.7.
R
dP̄(T
)
dP(T )
(x(T ) ) ≤ QT
NT
t=1 (N − 1 + wt (xt | x1:t−1 ))
+ Θ(N −1/3 ).
(A.15)
Proof. The results follows from Proposition A.1.1 and an argument analogous to that
of Lemma 3.1.3.
52
Appendix B
Proofs of SIS and SIR Comparison
Theorems
We give proofs of Theorems 3.3.1 and 3.3.2 in the case when the full joint distribution
P is targeted. The proofs for the marginal distribution cases of the theorems are
P
essentially identical. In both proofs, let ν , t Vt .
B.1
Proof of Theorem 3.3.1
Let ε , min{wT /2, V − ν} > 0, so V ≤ ν + ε. By Theorem 3.1.1,
ν
KL(P||P̄ ) ≤
+Θ
N
R
1
N2
,
(B.1)
so for sufficiently large N
KL(P||P̄ R ) ≤
ν + ε/3
.
N
(B.2)
By Propositions A.1.1 and A.2.2
"
#
N
dP̄ S
N
N
X(T ) = x(T ) ≤
(x(T ) ) = E PN
+ Θ(N −1/3 ).
n
dP
N
−
1
+
w(x
)
w(X
)
(T )
n=1
(T )
53
Since w(x(T ) ) is bounded, for large enough N
N
dP̄ S
(x(T ) ) ≤
.
dP
N − 1 + w(x(T ) ) − ε/6
Therefore, by Lemma A.1.2, there is an integer K > 0
N − 1 + w(X(T ) ) − ε/6
KL(P||P̄ ) ≥ EX(T ) ∼P log
N
"
#
2K
w(X(T ) ) − 1 − ε/6 X (−1)k (w(X(T ) ) − 1 − ε/6)k
≥ EX(T ) ∼P
−
n
Nk
k=2
2K
ν + ε − ε/6 X (−1)k EX(T ) ∼P w(X(T ) ) − 1 − ε/6)k
−
.
=
N
Nk
k=2
S
So, for N sufficiently large
ν + ε − ε/3
n
(B.3)
ν + ε/3
ν + 2ε/3
<
≤ KL(P ||P̄ S ),
N
N
(B.4)
KL(P||P̄ S ) ≥
and for N sufficiently large
KL(P||P̄ R ) ≤
as was to be shown.
B.2
Proof of Theorem 3.3.2
Let δ , min{w/2, ν − V } > 0, so V ≤ ν − δ and by Theorem 3.1.1,
KL(Q||P̄ S ) ≤
ν−δ
.
N
(B.5)
On the other hand, Lemma A.2.7 states that
R
dP̄(T
)
dP(T )
(x(T ) ) ≤ QT
t=1 (N
NT
− 1 + wt (xt | x1:t−1 ))
54
+ O(N −1/3 ),
so, since the wt are a.s.-bounded, for large enough n
R
dP̄(T
)
dP(T )
NT
(x(T ) ) ≤ QT
t=1 (N
− 1 + wt (xt | x1:t−1 ) − δ/(3T ))
and by Lemma A.1.2, there is an integer K > 0 such that
KL(P||P̄ R )
"
#
(N
−
1
+
w
(x
|
x
)
−
δ/(3T
))
t
t
1:t−1
≥ EX(T ) ∼P log t=1
NT
T
X
wt (xt | x1:t−1 ) − 1 − δ/(3T )
≥
EX(T ) ∼P log 1 +
N
t=1
"
#
2K
T
X
wt (xt | x1:t−1 ) − 1 − δ/(3T ) X (−1)k (wt (xt | x1:t−1 ) − 1 − δ/(3T ))k
EX(T ) ∼P
−
≥
N
Nk
t=1
k=2
"
#
T
2K
X
Vt − δ/(3T ) X (−1)k EX(T ) ∼P (wt (xt | x1:t−1 ) − 1 − δ/(3T ))k
=
−
.
k
N
N
t=1
k=2
QT
Thus, for N sufficiently large,
R
KL(P||P̄ ) ≥
T
X
Vt − 2δ/(3T )
N
t=1
=
ν − 2δ/3
n
(B.6)
and for N sufficiently large
KL(P||P̄ S ) ≤
ν−δ
ν − 2δ/3
<
≤ KL(P||P̄ R ),
n
n
as was to be shown.
55
(B.7)
Bibliography
[1] C. Andrieu, A. Doucet, and R. Holenstein. “Particle Markov chain Monte Carlo
methods”. In: Journal of the Royal Statistical Society: Series B (Statistical
Methodology) 72.3 (2010), pp. 269–342.
[2] M. S. Arulampalam et al. “A tutorial on particle filters for online nonlinear/nonGaussian Bayesian tracking”. In: IEEE Transactions on Signal Processing 50.2
(2002), pp. 174–188.
[3] O Cappé et al. “Population Monte Carlo”. In: Journal of Computational and
Graphical Statistics 13.4 (2004), pp. 907–929.
[4] N. Chopin. “Central limit theorem for sequential Monte Carlo methods and its
application to Bayesian inference”. In: The annals of statistics 32.6 (Dec. 2004),
pp. 2385–2411.
[5] N. Chopin, P. E. Jacob, and O Papaspiliopoulos. “SMC2 : an efficient algorithm
for sequential analysis of state space models”. In: Journal of the Royal Statistical
Society: Series B (Statistical Methodology) 75.3 (2013), pp. 397–426.
[6] T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley
& Sons, 1991.
[7] D. Crisan and J. Miguez. “Nested particle filters for online parameter estimation in discrete-time state-space Markov models”. In: arXiv.org (Aug. 2013).
arXiv:1308.1883v2 [stat.CO].
[8] D. Crisan and O Obanubi. “Particle filters with random resampling times”. In:
Stochastic processes and their applications 122.4 (2012), pp. 1332–1368.
[9] P. Del Moral. Feynman-Kac Formulae: Genealogical and Interacting Particles
Systems with Applications. New York: Springer, 2004.
[10] P. Del Moral, A. Doucet, and A. Jasra. “On adaptive resampling strategies for
sequential Monte Carlo methods”. In: Bernoulli 18.1 (Feb. 2012), pp. 252–278.
[11] P. Del Moral, A. Doucet, and A. Jasra. “Sequential Monte Carlo samplers”. In:
Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68.3
(2006), pp. 411–436.
[12] P. Del Moral and A. Guionnet. “On the stability of interacting processes with
applications to filtering and genetic algorithms”. In: Annales de l’Institut Henri
Poincaré, Probabilités et Statistiques 37.2 (2001), pp. 155–194.
56
[13] R Douc et al. “Convergence of adaptive mixtures of importance sampling schemes”.
In: The annals of statistics 35.1 (Feb. 2007), pp. 420–448.
[14] A. Doucet, N. de Freitas, and N. Gordon. Sequential Monte Carlo in Practice.
New York: Springer, 2001.
[15] A. Doucet, S. J. Godsill, and C. Andrieu. “On sequential Monte Carlo sampling methods for Bayesian filtering”. In: Statistics and Computing 10.3 (2000),
pp. 197–208.
[16] A. Doucet and A. M. Johansen. “A tutorial on particle filtering and smoothing:
fifteen years later”. In: Handbook of Nonlinear Filtering. Ed. by D Crisan and
B Rozovsky. Cambridge: Cambridge University Press, 2010.
[17] J. Geweke. “Bayesian inference in econometric models using Monte Carlo integration”. In: Econometrica: Journal of the Econometric Society (1989), pp. 1317–
1339.
[18] F. Gustafsson et al. “Particle filters for positioning, navigation, and tracking”.
In: IEEE Transactions on Signal Processing 50.2 (2002), pp. 425–437.
[19] R. Holenstein. “Particle Markov Chain Monte Carlo”. PhD thesis. Vancouver:
University Of British Columbia, 2009.
[20] N. Kantas et al. “An overview of sequential Monte Carlo methods for parameter
estimation in general state-space models”. In: 15th IFAC Symposium on System
Identification. 2009, pp. 774–785.
[21] H. R. Künsch. “Particle filters”. In: Bernoulli 19.4 (Sept. 2013), pp. 1391–1403.
[22] A. Lee et al. “On the utility of graphics cards to perform massively parallel
simulation of advanced Monte Carlo methods”. In: Journal of Computational
and Graphical Statistics 19.4 (2010), pp. 769–789.
[23] J. S. Liu. “Metropolized independent sampling with comparisons to rejection
sampling and importance sampling”. In: Statistics and Computing 6.2 (1996),
pp. 113–119.
[24] J. S. Liu and R. Chen. “Blind deconvolution via sequential imputations”. In:
Journal of the American Statistical Association 90.430 (1995), pp. 567–576.
[25] R. M. Neal. “Annealed importance sampling”. In: Statistics and Computing
11.2 (2001), pp. 125–139.
[26] M.-S. Oh and J. O. Berger. “Adaptive Importance Sampling in Monte Carlo
Integration”. In: Journal of Statistical Computation and Simulation 41 (1992),
pp. 143–168.
[27] S. Thrun. “Particle Filters in Robotics”. In: UAI. 2002.
[28] C. Vergé et al. “On parallel implementation of Sequential Monte Carlo methods:
the island particle model”. In: arXiv.org (2013). arXiv:1306.3911v1 [math.PR].
[29] N. Whiteley, A. Lee, and K. Heine. “On the role of interaction in sequential Monte Carlo algorithms”. In: arXiv.org (Sept. 2013). arXiv:1309.2918v1
[stat.CO].
57