A discrete kernel sampling algorithm for DBNs

advertisement
A discrete kernel sampling algorithm for DBNs
Theodore Charitos
Department of Information and Computing Sciences, Utrecht University email:
theodore@cs.uu.nl
Summary. Particle filtering (PF) is a powerful sampling-based inference algorithm
for dynamic Bayesian networks (DBNs) with discrete-state spaces. In its operation,
the main principle is a recursive generation of samples (particles) which approximate
the distributions of the unknowns. This generation of samples includes a resampling
step that concentrates samples according to their relative weight in regions of interest
of the state-space. We propose a more systematic approach than resampling based on
regularisation (smoothing) of the empirical distribution associated with the samples,
using the kernel method. We show in our experiments that our algorithm leads to
more accurate estimates than the standard PF.
Key words: particle filtering, discrete kernel, dynamic Bayesian networks
1 Introduction
A DBN is a graphical model that encodes a joint probability distribution on a
set of stochastic variables, explicitly capturing the temporal relationships between
them [Kja95], [Mur02]. We use capital letters to denote random variables and lower
case to denote values. Boldface capital letters denote sets and lower case their values
respectively. Let Vn = (Vn1 , . . . , Vnm ), m ≥ 2, denote the set of variables at time step
n. Then, a DBN is a tuple (B1 , B2 ), where B1 is a Bayesian network [CDL99] that
represents the prior distribution for the variables at the first time step V1 , and B2
defines the transition model for the variables in two consecutive time steps, so that
for every n ≥ 2
m
p(Vn | Vn−1 ) =
p(Vnj | π(Vnj ))
j=1
where π(Vnj ) denotes the set of parents of Vnj , for j = 1, . . . , m. In most dynamical
systems, we assume that the set Vn can be split in two mutually exclusive and
collectively exhaustive sets Xn = (Xn1 , . . . , Xns ), Yn = (Yn1 , . . . , Ynm−s ), where Xn
and Yn represent the hidden and observable variables per time step respectively. We
use the term observation model to denote the probability to observe an instantiation
1390
Theodore Charitos
of values yn for Yn given an instantiation of values xn for Xn . We also denote by
∆
y1:k = {y1 , y2 , . . . , yk } the observations up to and including time step k.
DBNs are usually assumed to be time invariant, which means that the topology
and the parameters of the network per time step and across time steps do not change.
Monitoring a DBN is the task of computing the probability distribution of the hidden state at time step n given the observations, that is, p(xn | y1:n ). To compute
this probability distribution, Murphy [Mur02] introduced the interface algorithm,
which is an extension of the junction-tree algorithm [CDL99]. More specifically, the
interface algorithm exploits efficiently the forward interface FIn , which is the set
of variables at time step n that affect some variables at time step n + 1 directly.
However, the computational complexity of the interface algorithm is exponential
in the number of hidden variables and hence exact monitoring is prohibitive for
large DBNs [KL01], [Mur02]. A way to handle these problems is to use sequential Monte Carlo methods that are easy to implement, work on almost any kind
of DBNs and with a large number of samples are guaranteed to provide the exact
answer [Dou98], [GSS93], [Kit96], [LC98].
2 Particle filtering
Let us assume that we are able to sample N independent and identically distributed
(i)
random samples {xn ; i = 1, . . . , N } according to p(Xn | y1:n ). Then, an empirical
estimate of this distribution is given by
p(xn | y1:n ) ≈
1
N
N
i=1
δx(i) (dxn )
n
where δ(d·) denotes the Dirac delta function. This estimate is unbiased, and from
the strong law of large numbers converges almost surely to the exact probability
distribution as N → ∞ [Dou98]. Typically, we cannot sample efficiently from the
posterior distribution p(Xn | y1:n ), so instead we sample from a proposal or importance distribution q(x) and weight the samples according to
(i)
ωn(i) ∝
(i)
p(xn | xn−1 , yn )
(i)
q(xn
|
(i)
xn−1 , yn )
to obtain the following mass approximation of p(xn | y1:n )
N
p(xn | y1:n ) ≈
ωn(i) δx(i) (dxn )
n
(1)
i=1
(i)
where ωn is the normalised weight. The most common proposal is to sample from
the prior probability distribution p(Xn | Xn−1 ). Although such a proposal results
in higher Monte Carlo variation than the optimal proposal as a result of it not
incorporating the most recent observations, it is usually easier to implement [GSS93],
[Kit96], [LC98]. The weights now simplify to
ωn(i) ∝ p(yn | x(i)
n )
(2)
For DBNs the generation of a new sample according to the previous analysis is as
follows. Initially we construct a Bayesian network on the variables FIn−1 ∪ Vn ,
A discrete kernel sampling algorithm for DBNs
•
•
•
•
•
•
•
•
•
•
•
1391
(i)
ωn = 1
(i)
xn is empty
for each variable j in a topological order
(i)
(i)
let u be the value of π(Vnj ) in (xn−1 , xn )
j
if Vn ∈ Xn
sample vnj ∼ p(Vnj | π(Vnj ))
(i)
(i)
set xn = {xn , vnj }
else
set vnj to be the value of Vnj ∈ Yn
(i)
(i)
ωn = ωn × p(vnj | u)
(i)
(i)
Return (xn , ωn )
Fig. 1. Pseudocode for sampling in DBNs.
called 2-TBN in [KL01], that represents the transition model B2 . Subsequently, we
order the variables in Vn in a topological manner consistent with the edges in the
/ π(Vnj ). A value now for each hidden variable
2-TBN so that if j < j then Vnj ∈
in Xn is sampled based on the values of its parents. For an observable variable, we
do not need to sample it but instantiate it to its observed value. The computation
of the weights is now straightforward based on the probability distributions of the
sampled values. A pseudocode of this scheme is shown in Figure 1. In the simplest
case where the observations yn concern leaf variables, the above scheme computes
the weights according to equation (2). In general, the observations can concern
variables in arbitrary locations within the DBN. In that case, the weights computed
are proportional to j p(ynj | π(ynj )), where π(ynj ) may contain observed values.
Therefore, the above scheme for sampling in DBNs takes into account part of the
(i)
observations yn in the proposal distribution to compute the weights ωn , and is
hence more efficient than using just the prior distribution as the proposal for the
PF.
A serious drawback with the implementation of the PF as it is, is that the
variance of the weights increases stochastically over time [Dou98], [LC98]. A way to
avoid this problem, is to include a resampling step in order to eliminate samples with
low weights and multiply samples with high weights [GSS93], [Kit96], [LC98]. After
resampling, the future samples are more concentrated on domains of higher posterior
probability, which entails improved estimates. A resampling scheme associates to
(i)
each sample xn a number of offsprings, say Ni ∈ N, such that N
i=1 Ni = N . More
(i)
(i)
formally, resampling involves mapping the Dirac random measure {xn , ωn } into
(i )
an equally weighted random measure {xn , N −1 }, where the index i denotes the
(i)
position of the sample xn in the new (resampled) set of samples. Several resampling
(i)
algorithms have been proposed in the literature that satisfy E(Ni ) = N ωn , but
their performance depends on the variance of the samples, var(Ni ). Multinomial
resampling [GSS93], residual resampling [LC98] and stratified resampling [Kit96]
are the most common resampling algorithms, whose computational complexity is
O(N ).
Under these considerations, the PF for monitoring in DBNs consists of two consecutive steps at each time step: sampling and resam-
1392
Theodore Charitos
pling.
Schematically, for i =
1, . . . , N , the PF works according to
(i)
(i)
(i )
(i)
(i)
{xn , ωn } −→ {xn , N −1 } −→ {xn+1 , ωn+1 }
The success of the PF depends on whether the Dirac-point mass approximation
provides an adequate representation of the posterior distribution. In the resampling
step, any particular sample with a high weight will be duplicated many times. As a
result, the cloud of samples may eventually collapse to a single sample. This problem
is more evident if there is no system noise or the observation noise has very small
variance [GSS93]. More refined approaches such as kernel smoothing [MOL01] can
help surmount this problem. We develop such an approach for DBNs in the next
section.
3 Smoothed particle filtering
The main idea in kernel smoothing is to replace the resampling step in the PF at time
step n with sampling from the smoothed probability distribution of the hidden state
(i)
(i)
that is represented in {xn , ωn , i = 1, . . . , N }. The reason underlying this approach
(i)
is the following. The samples {xn ; i = 1, . . . , N } are necessarily very sparse, which
implies that in DBNs many entries in the joint posterior distribution representing the
hidden state will be estimated to have probability zero, even if their probability in the
exact posterior distribution is positive. If the transition model is near-deterministic,
that is, there are parts of the state space that only transition to other parts with
very low probability, parts of the space that are not represented in the samples will
not be explored. This can occur if the PF has missed these parts earlier, or because
misleading observation at previous time steps have rendered specific trajectories of
samples unlikely.
To address this concern, Koller and Lerner [KL01] propose to smooth the probability of the hidden state for each value xn = x as
p(x | y1:n ) =
1
Z
ωn(i) + αo
(i)
i:xn =x
where αo is a smoothing parameter and Z is a normalising constant. Hence for states
that have probability zero the smoothing gives them mass αo . The normalising
(i)
N
constant equals Z =
i=1 ωn + αo M , where M is the total number of states
consistent with yn . As pointed out by the authors and stated also in [Mur02, pp.
89], the computation of M is in the worst case #-P hard, rendering thus the scheme
computationally expensive in practice. For this reason, it was already mentioned
in [KL01] that alternative smoothing schemes can be used, but neither the authors
nor anyone else developed further this suggestion.
We propose a smoothing scheme for DBNs where instead of the joint probability
distribution of the hidden state we focus on the marginal probability distribution of
the variables that represent the hidden state. More precisely, from equation (1) the
marginal probability of the hidden variable Xnj for each value xh is
ph = p(Xnj = xh | y1:n ) =
ωn(i)
j(i)
i:Xn
(3)
=xh
As we already argued, the PF can estimate erroneously ph to be zero or very small.
An additional reason for this can be when Xnj is a multi-valued variable, since in this
A discrete kernel sampling algorithm for DBNs
1393
case it is possible that the sampling algorithm in Figure 1 may miss certain values of
Xnj . To avoid this problem, we apply discrete kernel methods to smooth ph . Discrete
kernel methods have been widely used in statistical analysis of categorical data for
estimating the probability distribution defined in a multivariate space [AMD79],
[Sim95]. The principle of such methods is to smooth the probability of a categorical
variable for a specific value by ”borrowing” information from neighbouring values.
Suppose that the hidden variable Xnj has K values where the probability for
each value h = 1, . . . , K, is given from equation (3). Then, the function
z=
N
K−1
K
h=1
(ph − 1/K)2
1/K
2
denotes the χ Pearson test for the hypothesis that all the categories are equiprobable, standardised by the degrees of freedom K − 1. A method of smoothing ph is
via the kernel method which gives
K
ph =
p W (h, λ)
(4)
=1
where
W (h, λ) =
λ
if = h
(1 − λ)/(K − 1) if = h
z −1 z ≥ 1
. The smoothing parameter
1 z<1
α plays the role of placing some mass in value h that may have probability zero.
An alternative formulation of equation (4) is a convex combination of ph and the
uniform estimate 1/K, that is
and λ = (N + α)/(N + αK) with α =
ph = (1 − )ph + /K
where = αK/(N + αK). The magnitude of α determines a trade-off between bias
and variance, since a smaller value of α leads to a less biased, but with higher variance, smoothed estimator, while a larger value of α leads to a smaller variance,
but biased, smoothed estimator [Sim95]. There exist several alternative definitions
for the smoothing parameter α that are all functions of z, and we refer the interested reader to [AMD79] for more details. Furthermore, there exist more aggressive
smoothing strategies if there is a natural ordering to the categories of the variable
Xnj . In this case, the kernel framework may require that the weights W (h, λ) decrease smoothly as |h − | increases [Sim95]. There is thus a variety of choices for
the kernel function W and the smoothing parameter α that can be used in a given
application and for a given variable.
To create a sample that will be propagated to the next time step, we need
to focus only on the hidden variables at time step n that belong to the forward
interface FIn . That is because every variable Xnj ∈ FIn belongs to at least one set
j
j
) of a variable Vn+1
, and hence a value xjn needs to be assigned to
of parents π(Vn+1
it in the sampling algorithm in Figure 1. This can be done easily by generating a
j(i )
value xn for sample i from the established smoothed distribution of Xnj denoted
j
as p(Xn | y1:n ). As a result, the smoothed particle filtering (SPF) algorithm for
1394
Theodore Charitos
Fig. 2. The Mildew model for forecasting the extension of the mildew fungus and
the gross yield for three time steps; clear variables are hidden, shaded variables are
observable.
monitoring in DBNs consists of two consecutive steps at each time step: sampling
and smoothing. Schematically, for i = 1, . . . , N , the SPF works according to
(i)
(i)
(i)
j(i )
: Xnj ∈ FIn , p(Xnj | y1:n )} −→{xn+1 , ωn+1 }
{x(i)
n , ωn } −→{xn
This version of the SPF performs O(|FIn |) smoothing operations per time step,
where |FIn | denotes the size of the forward interface. For DBNs with many variables
in the forward interface, speeding up the SPF can be done using a threshold criterion,
such as the size K, to determine for which variables their probability distribution
needs to be smoothed. The advantage of the SPF over the PF is that it serves to
spread out some of the probability mass over unobserved states, increasing thus the
amount of exploration done for unfamiliar regions of the space.
4 Experimental results
To study the performance of the SPF we performed experiments on the Mildew model
(Figure 2) [Kja95]. The Mildew model is designed for forecasting the extension of
the mildew fungus and the gross yield from a field of wheat. It has nine variables per
time step, where we assumed that six of them are hidden (Fungicide, Mildew, Micro
climate, Solar energy, Leaf Area Index and Dry matter), and three are observable
(Precipitation, Temperature and Photo-synthesis).
We randomly created a transition and an observation model for the network
where we assume that every variable could take 4 values, and subsequently generated
an observation sequence. Our goal was to compare the results at each time step
given by the PF and the SPF with the correct distributions computed using exact
inference, which is feasible for a model of this size. We used the L1 -norm to compute
the average error on the marginal probability distributions of all the hidden variables.
Figure 3a shows the error as a function of the number of samples, where we report the
average error over the entire run. We observe that the error drops immediately at first
and then the improvement becomes smaller and smaller as we increase the number
of samples. Note that the drop-off occurs at around 500 samples, which is much less
than the total number of hidden states (46 ). We can conclude from this figure that
A discrete kernel sampling algorithm for DBNs
1395
for a particular computational cost, the SPF does better than the PF. This gain can
be much higher if the transition and/or the observation model of the network are
near-deterministic. Figure 3b shows the behaviour of the error over the entire run
with 350 samples per time step. We observe that the error changes considerably over
the sequence. The spikes in the error correspond to unlikely evidence, in which case
the samples become less reliable. We notice that the error in the SPF is significantly
smaller than in the PF. Even when there is unlikely evidence, the effect of smoothing
the probability distributions of the hidden variables does lead to improved estimates.
0.7
0.3
PF
SPF
PF
SPF
avg L error on marginals
0.5
0.4
0.3
0.25
0.2
0.15
1
avg L1 error on marginals
0.6
0.2
0.1
0.1
0
0
500
1000
1500
# samples
(a) Error as a function of samples.
Fig. 3. Comparison between PF and SPF.
0.05
5
10
15
20
25
Time step
(b) Average L1 -error over a sequence.
5 Conclusions and extensions
We have proposed a sequential importance sampling algorithm, called SPF, to perform inference in large DBNs with discrete state space. Our algorithm constitutes
an extension of the standard PF in a sense that replaces the resampling step with
a smoothing step. In other words, the SPF smoothes the probability distribution
of the hidden variables as estimated by the samples, and then generates a new set
of samples from these smoothed probability distributions that is propagated to the
next time step [MOL01]. We showed that the smoothing step can be done efficiently
using discrete kernel methods [AMD79], [Sim95], that have the effect of placing some
mass in hidden states that have erroneously been estimated to have probability zero.
We can also combine the SPF with the Rao-Blackwellised PF (RBPF) algorithm [Dou00]. The basic idea of the RBPF is to exploit the structure of the DBN
to increase the efficiency of the PF. That is, the RBPF allows us to marginalise out
some variables by applying exact inference algorithms, and hence sample only from a
subset X n of the hidden variables. The main advantage of this strategy is that it can
drastically reduce the size of the state space over which we have to sample, leading
thus to better performance. For a given accuracy, we would need less samples using
1396
Theodore Charitos
the RBPF rather than using the PF, since we sample from a lower-dimension distribution. We can thus smooth the probability distributions of the hidden variables
in X n that belong to the forward interface FIn . We believe that a combination of
the SPF with the RBPF provides a powerful and efficient algorithm for monitoring
large DBNs.
Acknowledgements
This research was supported by the Netherlands Organisation for Scientific Research
(NWO).
References
[AMD79] C. Aitken and D.G. MacDonald (1979). An application of discrete kernel
methods to forensic odontology. Applied Statistics, 28(1): 55-61.
[CDL99] R.G. Cowell, A.P. Dawid, S.L. Lauritzen, D.J. Spiegelhalter (1999). Probabilistic Networks and Expert Systems. Springer-Verlag, New York.
[Dou98] A. Doucet (1998). On sequential simulation-based methods for Bayesian
filtering. Technical report CUED/F-INFENG/TR 310, Department of Engineering, Cambridge University.
[Dou00] A. Doucet, N. de Freitas, K. Murphy and S. Russel (2000). RaoBlackwellised particle filtering for dynamic Bayesian networks. In Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence,
pp. 176–183.
[GSS93] N.J. Gordon, D.J. Salmond and A.F.M. Smith (1993). Novel approach to
nonlinear/non-Gaussian Bayesian state estimation. IEE Proceedings F,
140(2): 107–113.
[Kit96]
G. Kitagawa (1996). Monte Carlo filter and smoother for non-Gaussian
nonlinear state space models. Journal of Computational and Graphical
Statistics. 5: 1-25.
[Kja95]
U. Kjaerulff (1995). dHugin: A computational system for dynamic timesliced Bayesian networks. International Journal of Forecasting, 11: 89-111.
[KL01]
D. Koller and U. Lerner (2001). Sampling in factored dynamic systems.
In A. Doucet, N. De Freitas and N. Gordon, editors, Sequential Monte
Carlo Methods in Practice, Springer-Verlag, New York.
[LC98]
J.S. Liu and R. Chen (1998). Sequential Monte Carlo methods for dynamical systems. Journal of the American Statistical Association, 93: 10321044.
[Mur02] K.P. Murphy (2002). Dynamic Bayesian Networks: Representation, Inference and Learning. Ph.D. diss, University of California Berkley.
[MOL01] C. Musso, N. Oudjane and F. Le Gland (2001). Improving regularised
particle filters. In A. Doucet, N. De Freitas and N. Gordon, editors, Sequential Monte Carlo Methods in Practice, Springer-Verlag, New York.
[Sim95] J.S. Simonoff (1995). Smoothing categorical data. Journal of Statistical
Planning and Inference, 47: 41-69.
Download