A discrete kernel sampling algorithm for DBNs Theodore Charitos Department of Information and Computing Sciences, Utrecht University email: theodore@cs.uu.nl Summary. Particle filtering (PF) is a powerful sampling-based inference algorithm for dynamic Bayesian networks (DBNs) with discrete-state spaces. In its operation, the main principle is a recursive generation of samples (particles) which approximate the distributions of the unknowns. This generation of samples includes a resampling step that concentrates samples according to their relative weight in regions of interest of the state-space. We propose a more systematic approach than resampling based on regularisation (smoothing) of the empirical distribution associated with the samples, using the kernel method. We show in our experiments that our algorithm leads to more accurate estimates than the standard PF. Key words: particle filtering, discrete kernel, dynamic Bayesian networks 1 Introduction A DBN is a graphical model that encodes a joint probability distribution on a set of stochastic variables, explicitly capturing the temporal relationships between them [Kja95], [Mur02]. We use capital letters to denote random variables and lower case to denote values. Boldface capital letters denote sets and lower case their values respectively. Let Vn = (Vn1 , . . . , Vnm ), m ≥ 2, denote the set of variables at time step n. Then, a DBN is a tuple (B1 , B2 ), where B1 is a Bayesian network [CDL99] that represents the prior distribution for the variables at the first time step V1 , and B2 defines the transition model for the variables in two consecutive time steps, so that for every n ≥ 2 m p(Vn | Vn−1 ) = p(Vnj | π(Vnj )) j=1 where π(Vnj ) denotes the set of parents of Vnj , for j = 1, . . . , m. In most dynamical systems, we assume that the set Vn can be split in two mutually exclusive and collectively exhaustive sets Xn = (Xn1 , . . . , Xns ), Yn = (Yn1 , . . . , Ynm−s ), where Xn and Yn represent the hidden and observable variables per time step respectively. We use the term observation model to denote the probability to observe an instantiation 1390 Theodore Charitos of values yn for Yn given an instantiation of values xn for Xn . We also denote by ∆ y1:k = {y1 , y2 , . . . , yk } the observations up to and including time step k. DBNs are usually assumed to be time invariant, which means that the topology and the parameters of the network per time step and across time steps do not change. Monitoring a DBN is the task of computing the probability distribution of the hidden state at time step n given the observations, that is, p(xn | y1:n ). To compute this probability distribution, Murphy [Mur02] introduced the interface algorithm, which is an extension of the junction-tree algorithm [CDL99]. More specifically, the interface algorithm exploits efficiently the forward interface FIn , which is the set of variables at time step n that affect some variables at time step n + 1 directly. However, the computational complexity of the interface algorithm is exponential in the number of hidden variables and hence exact monitoring is prohibitive for large DBNs [KL01], [Mur02]. A way to handle these problems is to use sequential Monte Carlo methods that are easy to implement, work on almost any kind of DBNs and with a large number of samples are guaranteed to provide the exact answer [Dou98], [GSS93], [Kit96], [LC98]. 2 Particle filtering Let us assume that we are able to sample N independent and identically distributed (i) random samples {xn ; i = 1, . . . , N } according to p(Xn | y1:n ). Then, an empirical estimate of this distribution is given by p(xn | y1:n ) ≈ 1 N N i=1 δx(i) (dxn ) n where δ(d·) denotes the Dirac delta function. This estimate is unbiased, and from the strong law of large numbers converges almost surely to the exact probability distribution as N → ∞ [Dou98]. Typically, we cannot sample efficiently from the posterior distribution p(Xn | y1:n ), so instead we sample from a proposal or importance distribution q(x) and weight the samples according to (i) ωn(i) ∝ (i) p(xn | xn−1 , yn ) (i) q(xn | (i) xn−1 , yn ) to obtain the following mass approximation of p(xn | y1:n ) N p(xn | y1:n ) ≈ ωn(i) δx(i) (dxn ) n (1) i=1 (i) where ωn is the normalised weight. The most common proposal is to sample from the prior probability distribution p(Xn | Xn−1 ). Although such a proposal results in higher Monte Carlo variation than the optimal proposal as a result of it not incorporating the most recent observations, it is usually easier to implement [GSS93], [Kit96], [LC98]. The weights now simplify to ωn(i) ∝ p(yn | x(i) n ) (2) For DBNs the generation of a new sample according to the previous analysis is as follows. Initially we construct a Bayesian network on the variables FIn−1 ∪ Vn , A discrete kernel sampling algorithm for DBNs • • • • • • • • • • • 1391 (i) ωn = 1 (i) xn is empty for each variable j in a topological order (i) (i) let u be the value of π(Vnj ) in (xn−1 , xn ) j if Vn ∈ Xn sample vnj ∼ p(Vnj | π(Vnj )) (i) (i) set xn = {xn , vnj } else set vnj to be the value of Vnj ∈ Yn (i) (i) ωn = ωn × p(vnj | u) (i) (i) Return (xn , ωn ) Fig. 1. Pseudocode for sampling in DBNs. called 2-TBN in [KL01], that represents the transition model B2 . Subsequently, we order the variables in Vn in a topological manner consistent with the edges in the / π(Vnj ). A value now for each hidden variable 2-TBN so that if j < j then Vnj ∈ in Xn is sampled based on the values of its parents. For an observable variable, we do not need to sample it but instantiate it to its observed value. The computation of the weights is now straightforward based on the probability distributions of the sampled values. A pseudocode of this scheme is shown in Figure 1. In the simplest case where the observations yn concern leaf variables, the above scheme computes the weights according to equation (2). In general, the observations can concern variables in arbitrary locations within the DBN. In that case, the weights computed are proportional to j p(ynj | π(ynj )), where π(ynj ) may contain observed values. Therefore, the above scheme for sampling in DBNs takes into account part of the (i) observations yn in the proposal distribution to compute the weights ωn , and is hence more efficient than using just the prior distribution as the proposal for the PF. A serious drawback with the implementation of the PF as it is, is that the variance of the weights increases stochastically over time [Dou98], [LC98]. A way to avoid this problem, is to include a resampling step in order to eliminate samples with low weights and multiply samples with high weights [GSS93], [Kit96], [LC98]. After resampling, the future samples are more concentrated on domains of higher posterior probability, which entails improved estimates. A resampling scheme associates to (i) each sample xn a number of offsprings, say Ni ∈ N, such that N i=1 Ni = N . More (i) (i) formally, resampling involves mapping the Dirac random measure {xn , ωn } into (i ) an equally weighted random measure {xn , N −1 }, where the index i denotes the (i) position of the sample xn in the new (resampled) set of samples. Several resampling (i) algorithms have been proposed in the literature that satisfy E(Ni ) = N ωn , but their performance depends on the variance of the samples, var(Ni ). Multinomial resampling [GSS93], residual resampling [LC98] and stratified resampling [Kit96] are the most common resampling algorithms, whose computational complexity is O(N ). Under these considerations, the PF for monitoring in DBNs consists of two consecutive steps at each time step: sampling and resam- 1392 Theodore Charitos pling. Schematically, for i = 1, . . . , N , the PF works according to (i) (i) (i ) (i) (i) {xn , ωn } −→ {xn , N −1 } −→ {xn+1 , ωn+1 } The success of the PF depends on whether the Dirac-point mass approximation provides an adequate representation of the posterior distribution. In the resampling step, any particular sample with a high weight will be duplicated many times. As a result, the cloud of samples may eventually collapse to a single sample. This problem is more evident if there is no system noise or the observation noise has very small variance [GSS93]. More refined approaches such as kernel smoothing [MOL01] can help surmount this problem. We develop such an approach for DBNs in the next section. 3 Smoothed particle filtering The main idea in kernel smoothing is to replace the resampling step in the PF at time step n with sampling from the smoothed probability distribution of the hidden state (i) (i) that is represented in {xn , ωn , i = 1, . . . , N }. The reason underlying this approach (i) is the following. The samples {xn ; i = 1, . . . , N } are necessarily very sparse, which implies that in DBNs many entries in the joint posterior distribution representing the hidden state will be estimated to have probability zero, even if their probability in the exact posterior distribution is positive. If the transition model is near-deterministic, that is, there are parts of the state space that only transition to other parts with very low probability, parts of the space that are not represented in the samples will not be explored. This can occur if the PF has missed these parts earlier, or because misleading observation at previous time steps have rendered specific trajectories of samples unlikely. To address this concern, Koller and Lerner [KL01] propose to smooth the probability of the hidden state for each value xn = x as p(x | y1:n ) = 1 Z ωn(i) + αo (i) i:xn =x where αo is a smoothing parameter and Z is a normalising constant. Hence for states that have probability zero the smoothing gives them mass αo . The normalising (i) N constant equals Z = i=1 ωn + αo M , where M is the total number of states consistent with yn . As pointed out by the authors and stated also in [Mur02, pp. 89], the computation of M is in the worst case #-P hard, rendering thus the scheme computationally expensive in practice. For this reason, it was already mentioned in [KL01] that alternative smoothing schemes can be used, but neither the authors nor anyone else developed further this suggestion. We propose a smoothing scheme for DBNs where instead of the joint probability distribution of the hidden state we focus on the marginal probability distribution of the variables that represent the hidden state. More precisely, from equation (1) the marginal probability of the hidden variable Xnj for each value xh is ph = p(Xnj = xh | y1:n ) = ωn(i) j(i) i:Xn (3) =xh As we already argued, the PF can estimate erroneously ph to be zero or very small. An additional reason for this can be when Xnj is a multi-valued variable, since in this A discrete kernel sampling algorithm for DBNs 1393 case it is possible that the sampling algorithm in Figure 1 may miss certain values of Xnj . To avoid this problem, we apply discrete kernel methods to smooth ph . Discrete kernel methods have been widely used in statistical analysis of categorical data for estimating the probability distribution defined in a multivariate space [AMD79], [Sim95]. The principle of such methods is to smooth the probability of a categorical variable for a specific value by ”borrowing” information from neighbouring values. Suppose that the hidden variable Xnj has K values where the probability for each value h = 1, . . . , K, is given from equation (3). Then, the function z= N K−1 K h=1 (ph − 1/K)2 1/K 2 denotes the χ Pearson test for the hypothesis that all the categories are equiprobable, standardised by the degrees of freedom K − 1. A method of smoothing ph is via the kernel method which gives K ph = p W (h, λ) (4) =1 where W (h, λ) = λ if = h (1 − λ)/(K − 1) if = h z −1 z ≥ 1 . The smoothing parameter 1 z<1 α plays the role of placing some mass in value h that may have probability zero. An alternative formulation of equation (4) is a convex combination of ph and the uniform estimate 1/K, that is and λ = (N + α)/(N + αK) with α = ph = (1 − )ph + /K where = αK/(N + αK). The magnitude of α determines a trade-off between bias and variance, since a smaller value of α leads to a less biased, but with higher variance, smoothed estimator, while a larger value of α leads to a smaller variance, but biased, smoothed estimator [Sim95]. There exist several alternative definitions for the smoothing parameter α that are all functions of z, and we refer the interested reader to [AMD79] for more details. Furthermore, there exist more aggressive smoothing strategies if there is a natural ordering to the categories of the variable Xnj . In this case, the kernel framework may require that the weights W (h, λ) decrease smoothly as |h − | increases [Sim95]. There is thus a variety of choices for the kernel function W and the smoothing parameter α that can be used in a given application and for a given variable. To create a sample that will be propagated to the next time step, we need to focus only on the hidden variables at time step n that belong to the forward interface FIn . That is because every variable Xnj ∈ FIn belongs to at least one set j j ) of a variable Vn+1 , and hence a value xjn needs to be assigned to of parents π(Vn+1 it in the sampling algorithm in Figure 1. This can be done easily by generating a j(i ) value xn for sample i from the established smoothed distribution of Xnj denoted j as p(Xn | y1:n ). As a result, the smoothed particle filtering (SPF) algorithm for 1394 Theodore Charitos Fig. 2. The Mildew model for forecasting the extension of the mildew fungus and the gross yield for three time steps; clear variables are hidden, shaded variables are observable. monitoring in DBNs consists of two consecutive steps at each time step: sampling and smoothing. Schematically, for i = 1, . . . , N , the SPF works according to (i) (i) (i) j(i ) : Xnj ∈ FIn , p(Xnj | y1:n )} −→{xn+1 , ωn+1 } {x(i) n , ωn } −→{xn This version of the SPF performs O(|FIn |) smoothing operations per time step, where |FIn | denotes the size of the forward interface. For DBNs with many variables in the forward interface, speeding up the SPF can be done using a threshold criterion, such as the size K, to determine for which variables their probability distribution needs to be smoothed. The advantage of the SPF over the PF is that it serves to spread out some of the probability mass over unobserved states, increasing thus the amount of exploration done for unfamiliar regions of the space. 4 Experimental results To study the performance of the SPF we performed experiments on the Mildew model (Figure 2) [Kja95]. The Mildew model is designed for forecasting the extension of the mildew fungus and the gross yield from a field of wheat. It has nine variables per time step, where we assumed that six of them are hidden (Fungicide, Mildew, Micro climate, Solar energy, Leaf Area Index and Dry matter), and three are observable (Precipitation, Temperature and Photo-synthesis). We randomly created a transition and an observation model for the network where we assume that every variable could take 4 values, and subsequently generated an observation sequence. Our goal was to compare the results at each time step given by the PF and the SPF with the correct distributions computed using exact inference, which is feasible for a model of this size. We used the L1 -norm to compute the average error on the marginal probability distributions of all the hidden variables. Figure 3a shows the error as a function of the number of samples, where we report the average error over the entire run. We observe that the error drops immediately at first and then the improvement becomes smaller and smaller as we increase the number of samples. Note that the drop-off occurs at around 500 samples, which is much less than the total number of hidden states (46 ). We can conclude from this figure that A discrete kernel sampling algorithm for DBNs 1395 for a particular computational cost, the SPF does better than the PF. This gain can be much higher if the transition and/or the observation model of the network are near-deterministic. Figure 3b shows the behaviour of the error over the entire run with 350 samples per time step. We observe that the error changes considerably over the sequence. The spikes in the error correspond to unlikely evidence, in which case the samples become less reliable. We notice that the error in the SPF is significantly smaller than in the PF. Even when there is unlikely evidence, the effect of smoothing the probability distributions of the hidden variables does lead to improved estimates. 0.7 0.3 PF SPF PF SPF avg L error on marginals 0.5 0.4 0.3 0.25 0.2 0.15 1 avg L1 error on marginals 0.6 0.2 0.1 0.1 0 0 500 1000 1500 # samples (a) Error as a function of samples. Fig. 3. Comparison between PF and SPF. 0.05 5 10 15 20 25 Time step (b) Average L1 -error over a sequence. 5 Conclusions and extensions We have proposed a sequential importance sampling algorithm, called SPF, to perform inference in large DBNs with discrete state space. Our algorithm constitutes an extension of the standard PF in a sense that replaces the resampling step with a smoothing step. In other words, the SPF smoothes the probability distribution of the hidden variables as estimated by the samples, and then generates a new set of samples from these smoothed probability distributions that is propagated to the next time step [MOL01]. We showed that the smoothing step can be done efficiently using discrete kernel methods [AMD79], [Sim95], that have the effect of placing some mass in hidden states that have erroneously been estimated to have probability zero. We can also combine the SPF with the Rao-Blackwellised PF (RBPF) algorithm [Dou00]. The basic idea of the RBPF is to exploit the structure of the DBN to increase the efficiency of the PF. That is, the RBPF allows us to marginalise out some variables by applying exact inference algorithms, and hence sample only from a subset X n of the hidden variables. The main advantage of this strategy is that it can drastically reduce the size of the state space over which we have to sample, leading thus to better performance. For a given accuracy, we would need less samples using 1396 Theodore Charitos the RBPF rather than using the PF, since we sample from a lower-dimension distribution. We can thus smooth the probability distributions of the hidden variables in X n that belong to the forward interface FIn . We believe that a combination of the SPF with the RBPF provides a powerful and efficient algorithm for monitoring large DBNs. Acknowledgements This research was supported by the Netherlands Organisation for Scientific Research (NWO). References [AMD79] C. Aitken and D.G. MacDonald (1979). An application of discrete kernel methods to forensic odontology. Applied Statistics, 28(1): 55-61. [CDL99] R.G. Cowell, A.P. Dawid, S.L. Lauritzen, D.J. Spiegelhalter (1999). Probabilistic Networks and Expert Systems. Springer-Verlag, New York. [Dou98] A. Doucet (1998). On sequential simulation-based methods for Bayesian filtering. Technical report CUED/F-INFENG/TR 310, Department of Engineering, Cambridge University. [Dou00] A. Doucet, N. de Freitas, K. Murphy and S. Russel (2000). RaoBlackwellised particle filtering for dynamic Bayesian networks. In Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence, pp. 176–183. [GSS93] N.J. Gordon, D.J. Salmond and A.F.M. Smith (1993). Novel approach to nonlinear/non-Gaussian Bayesian state estimation. IEE Proceedings F, 140(2): 107–113. [Kit96] G. Kitagawa (1996). Monte Carlo filter and smoother for non-Gaussian nonlinear state space models. Journal of Computational and Graphical Statistics. 5: 1-25. [Kja95] U. Kjaerulff (1995). dHugin: A computational system for dynamic timesliced Bayesian networks. International Journal of Forecasting, 11: 89-111. [KL01] D. Koller and U. Lerner (2001). Sampling in factored dynamic systems. In A. Doucet, N. De Freitas and N. Gordon, editors, Sequential Monte Carlo Methods in Practice, Springer-Verlag, New York. [LC98] J.S. Liu and R. Chen (1998). Sequential Monte Carlo methods for dynamical systems. Journal of the American Statistical Association, 93: 10321044. [Mur02] K.P. Murphy (2002). Dynamic Bayesian Networks: Representation, Inference and Learning. Ph.D. diss, University of California Berkley. [MOL01] C. Musso, N. Oudjane and F. Le Gland (2001). Improving regularised particle filters. In A. Doucet, N. De Freitas and N. Gordon, editors, Sequential Monte Carlo Methods in Practice, Springer-Verlag, New York. [Sim95] J.S. Simonoff (1995). Smoothing categorical data. Journal of Statistical Planning and Inference, 47: 41-69.