presentation

Probabilistic Graphical Models Chapter 12: Particle-Based Approximate Inference UR I ME I R DA F N A SA DE H Particle-Based Approximate Inference Methods for approximate the joint distributions as a set of instantiations to all or some of the variables in the network. These instantiations, often called 𝑝𝑎𝑟𝑡𝑖𝑐𝑙𝑒𝑠, are designed to provide a good representation of the overall probability distribution, The general framework for most of this lecture is: Consider some distribution 𝑃 𝒳 , and assume we want to estimate the probability of some event 𝑌 = 𝑦 relative to 𝑃, for some 𝑌 ⊆ 𝒳 and 𝑦 ∈ 𝑉𝑎𝑙 𝑌 . More generally, we might want to estimate the expectation of some function relative to 𝑃. We approximate this expectation by generating a set of M particles, estimating the value of the function relative to each of the particles, and then aggregating the results. Particle-Based Approximate Inference For example: sampled IID from P. If 𝑃 𝑥 = 1 = 𝑝 the estimator for P : More generally, for any distribution P, function f: Forward Sampling Input - ℬ, a Bayesian network over 𝒳. Output - 𝜉 = 𝑥1 , … 𝑥𝑚 , a sample from ℬ to 𝒳. Forward Sampling - Example Forward Sampling - Example i = 1: Sampling D Assume D=𝑑1 Forward Sampling - Example 𝐷 = 𝑑1 i = 2: Sampling 𝐼 Assume 𝐼 = 𝑖 0 Forward Sampling - Example 𝐷 = 𝑑1 , 𝐼 = 𝑖 0 i = 3: Sampling 𝐺 from 𝑝 𝐺 𝑖 0 , 𝑑1 . And continue… Forward Sampling – The estimates - a set of particles (samples) 𝑓 − a function over 𝒳 - an estimate for 𝔼𝑃 𝑓 ‫יעייעיחע‬ - an estimate for 𝑝(𝑦) Forward Sampling – Complexity 𝑀 – the total number of particles generated 𝑛= 𝒳 𝑇ℎ𝑒 𝑜𝑣𝑒𝑟𝑎𝑙𝑙 𝑐𝑜𝑠𝑡 𝑖𝑠: Forward Sampling – Absolute Error From the 𝐻𝑜𝑒𝑓𝑓𝑑𝑖𝑛𝑔 𝑏𝑜𝑢𝑛𝑑: Thus, to achieve an estimate whose error is bounded by ε, with probability at least 1 − 𝛿, we required: Equivalently: Forward Sampling – Relative Error From the 𝐶ℎ𝑒𝑟𝑛𝑜𝑓𝑓 𝑏𝑜𝑢𝑛𝑑: Thus, to achieve an estimate whose error is bounded by ε, with probability at least 1 − 𝛿, we required: Forward Sampling – Relative Error From the 𝐶ℎ𝑒𝑟𝑛𝑜𝑓𝑓 𝑏𝑜𝑢𝑛𝑑: Thus, to achieve an estimate whose error is bounded by ε, with probability at least 1 − 𝛿, we required: If 𝑃(𝑦) is very small, it’s likely that we will not generated any samples where this event holds. Our estimate of 0 is not going to be within any relative error Forward Sampling – Relative Error From the 𝐶ℎ𝑒𝑟𝑛𝑜𝑓𝑓 𝑏𝑜𝑢𝑛𝑑: Thus, to achieve an estimate whose error is bounded by ε, with probability at least 1 − 𝛿, we required: We do not know 𝑃 𝑦 … Forward Sampling – Relative Error From the 𝐶ℎ𝑒𝑟𝑛𝑜𝑓𝑓 𝑏𝑜𝑢𝑛𝑑: Thus, to achieve an estimate whose error is bounded by ε, with probability at least 1 − 𝛿, we required: Conditional Probability Queries We are interested in conditional probabilities of the form 𝑃 𝑦 𝐸 = 𝑒 . Unfortunately, it turns out that this estimation task is significantly harder. -Rejection Sampling: Generate samples 𝑥 from 𝑃(𝑋) with forward sampling, reject any sample that is not compatible with 𝑒. The resulting samples are sampled from 𝑃 𝑋 𝑒 . The problem is that the expected number of particles that are not rejected from an original sample set of size 𝑀 is 𝑀 ∙ 𝑃(𝑒). Conditional Probability Queries Estimates separately 𝑃(𝑦, 𝑒) and 𝑃 𝑒 and compute the ratio. If 𝑃 𝑒 ∈ 1 − 𝜀 𝑃 𝑒 , 1 + 𝜀 𝑃 𝑒 𝑃 𝑦, 𝑒 ∈ 1 − 𝜀 𝑃 𝑦, 𝑒 , 1 + 𝜀 𝑃 𝑦, 𝑒 then: 1− 2𝜀 𝑃(𝑦,𝑒) 1+𝜀 𝑃(𝑒) ≤ 𝑃 𝑦,𝑒 𝑃 𝑒 ≤ 1+ 2𝜀 𝑃(𝑦,𝑒) 1+𝜀 𝑃(𝑒) But, the number of samples required to get a low relative error also grows linearly with 1 𝑃(𝑒). This is not a problem when 𝑃(𝑒) gets an absolute error, but it’s not suffice to get any type of bound for 𝑃(𝑦,𝑒) 𝑃(𝑒). Likelihood weighting The rejection sampling process seems very wasteful in the way it handles evidence. It seems much more sensible to simply force the samples to take on the appropriate values at observed nodes. This simple approach can generate incorrect results… Likelihood weighting Assume our evidence is 𝑆 = 𝑠1 . Using the naive process, the expected number of samples that have 𝐼 = 𝑖1 is 30 percent. Thus, this approach fails to conclude that the posterior probability of 𝑖1 is higher when we observe 𝑠1 (0.41). We should conclude that a sample where we have 𝐼 = 𝑖1 and force 𝑆 = 𝑠1 should be worth 80 percent of a sample, whereas one where we have 𝐼 = 𝑖 0 and force 𝑆 = 𝑠1 should be worth 5 percent of a sample Likelihood weighting Likelihood Weighting - Example 𝑍 = 𝐿 = 𝑙0 , 𝑆 = 𝑠1 𝑤=1 sample 𝐷 = 𝑑1 Likelihood Weighting - Example 𝑍 = 𝐿 = 𝑙0 , 𝑆 = 𝑠1 𝑤=1 𝐷 = 𝑑1 Sample 𝐼 = 𝑖 0 Likelihood Weighting - Example 𝑍 = 𝐿 = 𝑙0 , 𝑆 = 𝑠1 𝑤 = 1, 𝐷 = 𝑑1 , 𝐼 = 𝑖 0 Set 𝑆 = 𝑠1 𝑤 ← 𝑤 ∙ 0.05 Likelihood Weighting - Example 𝑍 = 𝐿 = 𝑙0 , 𝑆 = 𝑠1 𝑤 = 0.05 𝐷 = 𝑑1 , 𝐼 = 𝑖 0 , 𝑆 = 𝑠 1 Sample G= 𝑔2 Likelihood Weighting - Example 𝑍 = 𝐿 = 𝑙0 , 𝑆 = 𝑠1 𝑤 = 0.05 𝐷 = 𝑑1 , 𝐼 = 𝑖 0 , 𝑆 = 𝑠 1 , G = 𝑔 2 Set 𝐿 = 𝑙 0 𝑤 ← 𝑤 ∙ 0.4 Likelihood Weighting - Example 𝑍 = 𝐿 = 𝑙0 , 𝑆 = 𝑠1 𝑤 = 0.02 𝜉 = < 𝐷 = 𝑑1 , 𝐼 = 𝑖 0 , 𝑆 = 𝑠 1 , G = 𝑔 2 , 𝐿 = 𝑙 0 > Likelihood Weighting– The estimates - a set of particles (samples) - an estimate for 𝑃(𝑦|𝑒) The same set of particles can be used to estimate the probability of any event 𝑦. Importance sampling 𝐼𝑚𝑝𝑜𝑟𝑡𝑎𝑛𝑐𝑒 𝑠𝑎𝑚𝑝𝑙𝑖𝑛𝑔 is a general approach for estimating the expectation of a function 𝑓 𝑥 relative to some distribution 𝐭𝐚𝐫𝐠𝐞𝐭 𝐝𝐢𝐬𝐭𝐫𝐢𝐛𝐮𝐭𝐢𝐨𝐧 𝑷 𝑿 . As we seen, we can estimate this expectation by generating samples 𝑥 1 , … 𝑥[𝑀] from 𝑃, and then estimating: Sometimes, it might be impossible or computationally very expensive to generate samples from 𝑃. For example, 𝑃 might be a posterior distribution for a Bayesian network or a prior distribution for a Markov network. Thus, we might prefer to generate samples from a different distribution, the 𝒑𝒓𝒐𝒑𝒐𝒔𝒂𝒍 𝒅𝒊𝒔𝒕𝒓𝒊𝒃𝒖𝒕𝒊𝒐𝒏 𝑸. Unnormalized Importance sampling If we generate samples from 𝑄 instead of 𝑃, we need to adjust our estimator to compensate for the incorrect sampling distribution. We define the 𝑢𝑛𝑛𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒𝑑 𝑖𝑚𝑝𝑜𝑟𝑡𝑎𝑛𝑐𝑒 𝑠𝑎𝑚𝑝𝑙𝑖𝑛𝑔 estimator: When the set of samples 𝒟 = {𝑥 1 , … 𝑥 𝑀 } generate from 𝑄. The new estimator is based on the observation that: 𝐸𝑄 𝑃 𝑋 𝑓(𝑋) 𝑄 𝑋 𝑋 𝑃 𝑥[𝑚] 𝑥[𝑚] The factor 𝑄 We define: = 𝑥 𝑃 𝑥 𝑄 𝑥 𝑓 𝑥 = 𝑄 𝑥 𝑓 𝑥 𝑃(𝑥) = 𝐸𝑃 𝑋 𝑥 can be viewed as a correction weight to the term 𝑓(𝑥[𝑚]). 𝑃 𝑥 𝑥 𝑤 𝑥 =𝑄 . 𝑓 𝑋 Unnormalized Importance sampling Our analysis immediately implies that the estimator is 𝑢𝑛𝑏𝑖𝑎𝑠𝑒𝑑, that is, it’s mean for any data set is precisely the desired value: 𝑃 𝑋 𝑋 𝐸𝐷 𝐸𝐷 (𝑓) = 𝐸𝑄(𝑋) 𝑓(𝑋) 𝑄 = 𝐸𝑃(𝑋) 𝑓(𝑋) Unnormalized Importance sampling From the Central Limit Theorem, we have that since 𝑀 ∞: 𝜎𝑄2 𝐸𝐷 (𝑓)~𝒩(𝔼𝑃 𝑓 , ) 𝑀 Where, The variance decreases linearly with the number of samples. Normalized Importance sampling One problem with the 𝑢𝑛𝑛𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒𝑑 𝑖𝑚𝑝𝑜𝑟𝑡𝑎𝑛𝑐𝑒 𝑠𝑎𝑚𝑝𝑙𝑖𝑛𝑔 estimator, is that it assumes that 𝑃 is known. A frequent situation is that 𝑃 is known only up to a normalizing constant 𝑍. Specifically, what we have access to is a function 𝑃, such that 𝑃 𝑋 =𝑍∙𝑃 𝑋 . For example, in a Bayesian network ℬ, we might have 𝑃 𝑋 = 𝑃ℬ 𝑋 𝑒 , 𝑃(X)=𝑃ℬ 𝑋, 𝑒 and 𝑍 = 𝑃ℬ 𝑒 . In this context, we define: Normalized Importance sampling We define the 𝑛𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑒𝑑 𝑖𝑚𝑝𝑜𝑟𝑡𝑎𝑛𝑐𝑒 𝑠𝑎𝑚𝑝𝑙𝑖𝑛𝑔 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑜𝑟: The estimator is based on the observation that: And: Normalized Importance sampling The normalized estimator involves a quotient, and it is therefore much more difficult to analyze theoretically. Unlike the unnormalized estimator, the normalized estimator is not unbiased. It’s immediate in the case 𝑀 = 1. Here, the estimator reduces to: Here, it’s mean is . Conversely, when 𝑀 goes to infinity, the numerators and denominators converges to the expected values. In general, the bias goes down as 1 𝑀. Normalized Importance sampling One can show that the variance of the importance sampling estimator with M data instances is approximately: This can be used to provide an estimate on the quality of a set of samples generated using normalized importance sampling. Assume that we were to estimate 𝔼𝑃 [𝑓] using standard sampling method, where we generate 𝑀 𝐼𝐼𝐷 samples from 𝑃(𝑋), this approach would result in a variance ratio between these two variances is: 𝕍𝑎𝑟𝑃 𝑓 𝑋 𝑀 Thus, we would expect 𝑀 weighted samples generated by importance sampling to be “equivalent” to samples generated by 𝐼𝐼𝐷 samples from 𝑃. . The The Mutilated Network Proposal Distribution Assume that we are interested in a particular event (𝒵 = 𝑧) 𝐺 = 𝑔2 , either because we wish to estimate its probability, or because we have observed it as evidence. We wish to focus our sampling process on the parts of the joint that are consistent with this event. It’s easy to take this event into consideration when sampling 𝐿, but it’s more difficult to account for 𝐺’s influence on 𝐷, 𝐼 and 𝑆. We define a simple proposal distribution that “sets” the value of 𝑍 ∈ 𝒵 to take the prespecified value. The Mutilated Network Proposal Distribution The Mutilated Network Proposal Distribution The Mutilated Network Proposal Distribution Importance sampling with proposal distribution 𝑄 induced by the mutilated network ℬ𝒵=𝑧 , 𝑃 𝒳 = 𝑃ℬ (𝒳, 𝑧) is precisely equivalent to the 𝐿𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 𝑊𝑒𝑖𝑔ℎ𝑡𝑖𝑛𝑔 algorithm with 𝒵 = 𝑧: 𝑃𝑟𝑜𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛: 𝐿𝑒𝑡 𝜉 𝑏𝑒 𝑎 𝑠𝑎𝑚𝑝𝑙𝑒 𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑒𝑑 𝑏𝑦 𝐿. 𝑊 ℬ, 𝒵 = 𝑧 𝑎𝑙𝑔𝑜𝑟𝑖𝑡ℎ𝑚 𝑎𝑛𝑑 𝑤 𝑏𝑒 𝑖𝑡𝑠 𝑤𝑒𝑖𝑔ℎ𝑡. 𝑇ℎ𝑒𝑛 𝑡ℎ𝑒 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 𝑜𝑣𝑒𝑟 𝜉 𝑖𝑠 𝑎𝑠 𝑑𝑒𝑓𝑖𝑛𝑒𝑑 𝑏𝑦 𝑡ℎ𝑒 𝑛𝑒𝑡𝑤𝑜𝑟𝑘 ℬ𝒵=𝑧 , 𝑎𝑛𝑑 𝑤 𝜉 = 𝑃ℬ 𝜉 𝑃(𝜉) = 𝑃ℬ𝑍=𝑧 (𝜉) 𝑄(𝜉) The Mutilated Network Proposal Distribution 𝑃𝑟𝑜𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛: 𝐿𝑒𝑡 𝜉 𝑏𝑒 𝑎 𝑠𝑎𝑚𝑝𝑙𝑒 𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑒𝑑 𝑏𝑦 𝐿. 𝑊 ℬ, 𝒵 = 𝑧 𝑎𝑙𝑔𝑜𝑟𝑖𝑡ℎ𝑚 𝑎𝑛𝑑 𝑤 𝑏𝑒 𝑖𝑡𝑠 𝑤𝑒𝑖𝑔ℎ𝑡. 𝑇ℎ𝑒𝑛 𝑡ℎ𝑒 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 𝑃ℬ 𝜉 ℬ𝑍=𝑧 (𝜉) 𝑜𝑣𝑒𝑟 𝜉 𝑖𝑠 𝑎𝑠 𝑑𝑒𝑓𝑖𝑛𝑒𝑑 𝑏𝑦 𝑡ℎ𝑒 𝑛𝑒𝑡𝑤𝑜𝑟𝑘 ℬ𝒵=𝑧 , 𝑎𝑛𝑑, 𝑤 𝜉 = 𝑃 Proof: Let 𝜉′ be some assignment to X. 0 , 𝜉′ 𝒵 ≠ 𝑧 𝑃 𝜉 = 𝜉′ = 𝑒𝑙𝑠𝑒 𝑥∉𝒵 𝑃ℬ 𝑥 = 𝜉′ 𝑥 |𝜉′ 𝑃𝑎𝑋 , Let 𝜉’’ be a sample generated by Forward Sampling with ℬ𝒵=𝑧 , , then: 0 , 𝜉′ 𝒵 ≠ 𝑧 𝑃 𝜉′′ = 𝜉 ′ = 𝑃 ℬ𝒵=𝑧 𝜉 ′ 𝑃ℬ 𝑥 = 𝜉′ 𝑥 |𝜉′ 𝑃𝑎𝑋 , 𝑥∉𝒵 𝑒𝑙𝑠𝑒 𝑃(𝜉) = 𝑄(𝜉). The Mutilated Network Proposal Distribution 𝑃𝑟𝑜𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛: 𝐿𝑒𝑡 𝜉 𝑏𝑒 𝑎 𝑠𝑎𝑚𝑝𝑙𝑒 𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑒𝑑 𝑏𝑦 𝐿. 𝑊 ℬ, 𝒵 = 𝑧 𝑎𝑙𝑔𝑜𝑟𝑖𝑡ℎ𝑚 𝑎𝑛𝑑 𝑤 𝑏𝑒 𝑖𝑡𝑠 𝑤𝑒𝑖𝑔ℎ𝑡. 𝑇ℎ𝑒𝑛 𝑡ℎ𝑒 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 𝑃ℬ 𝜉 ℬ𝑍=𝑧 (𝜉) 𝑜𝑣𝑒𝑟 𝜉 𝑖𝑠 𝑎𝑠 𝑑𝑒𝑓𝑖𝑛𝑒𝑑 𝑏𝑦 𝑡ℎ𝑒 𝑛𝑒𝑡𝑤𝑜𝑟𝑘 ℬ𝒵=𝑧 , 𝑎𝑛𝑑, 𝑤 𝜉 = 𝑃 𝑃(𝜉) = 𝑄(𝜉). Proof: 𝑤 𝜉 = 𝑃 𝜉 𝑥 𝜉 𝑃𝑎𝑥 𝑥∈𝑍 = 𝑥𝑃 𝜉 𝑥 𝜉 𝑃𝑎𝑥 𝑥∉𝑍 𝑃 𝜉 𝑥 𝜉 𝑃𝑎𝑥 = 𝑃ℬ 𝜉 𝑃ℬ𝑍=𝑧 (𝜉) Markov Chain Monte Carlo Methods We now present an alternative sampling approach that generates sequences of samples. This sequence is constructed so that, although the first sample may be generated from the prior, successive samples are generated from distributions that provably get closer and closer to the desired posterior we define: 𝑃∅ (X) = P(X|e) unlike forward sampling methods (including likelihood weighting), Markov chain methods apply equally well to directed and to undirected models. Indeed, the algorithm is easier to present in the context of a distribution 𝑃∅ defined in terms of a general set of factors ∅. General Outline We will:  See the algorithm for Gibbs sampling (+example).  Define and explain what are Markov chains.  Connect the two, and define the Gibbs chain (Markov chain created by Gibbs algorithm).  Discuss a larger class of Markov chain, and Metropolis-Hastings Algorithm.  Touch the definition of mixing time and methods to check whether our chain has mixed. General Outline We will:  See the algorithm for Gibbs sampling (+example).  Define and explain what are Markov chains.  Connect the two, and define the Gibbs chain (Markov chain created by Gibbs algorithm).  Discuss a larger class of Markov chain, and Metropolis-Hastings Algorithm.  Touch the definition of mixing time and methods to check whether our chain has mixed. Gibbs Sampling Input: X − set of variables ∅ – set of factors, 𝑃(0) (X) – initial distribution T – number of steps (and full samples) Output: 𝑋 (0) , . . . , 𝑋 (𝑇) − A set of samples (each one is a vector, containing values for all 𝑋𝑖 ∈ 𝑋!) 0 1 Gibbs Sampling – Example (with 𝑠 and 𝑙 ) We look for samples of D, I, G, given 𝑠 0 and 𝑙1 . First, we sample once (say, by forward sampling). Let us assume we got: 𝑖 (0) = 𝑖 0 , 𝑑(0) = 𝑑1 , 𝑔(0) = 𝑔2 Now, we start creating new samples over G ,I ,D at some order over them. 0 1 Gibbs Sampling – Example (with 𝑠 and 𝑙 ) [s = 𝑠 0 , l = 𝑙1 , 𝑖 (0) = 𝑖 0 , 𝑑 (0) = 𝑑1 , 𝑔(0) = 𝑔2 ] Sample 𝑔(1) : Sampling 𝐺 from 𝑃∅ 𝐺 𝑑1 , 𝑖 0 . 0 1 Gibbs Sampling – Example (with 𝑠 and 𝑙 ) [s = 𝑠 0 , l = 𝑙1 , 𝑔(1) = 𝑔3 , 𝑖 (0) = 𝑖 0 , 𝑑(0) = 𝑑1 ] Sample 𝑖 (1) : Sampling 𝐼 from 𝑃∅ 𝐼 𝑑1 , 𝑔3 : 𝑃∅ 𝐼 𝑑1 , 𝑔3 𝑃 𝐼 𝑃 𝑠 0 𝐼 𝑃(𝑔3 |𝐼, 𝑑1 ) = 0 𝑃 𝑖 𝑃 𝑠 𝑖 𝑃(𝑔3 |𝑖, 𝑑1 ) 𝑖 assume we get 𝑖 1 . 0 1 Gibbs Sampling – Example (with 𝑠 and 𝑙 ) [s = 𝑠 0 , l = 𝑙1 , 𝑔(1) = 𝑔3 , 𝑖 (1) = 𝑖1 , 𝑑 (0) = 𝑑1 ] Sample 𝑑(1) : Sampling 𝐼 from 𝑃∅ 𝐷 𝑔3 , 𝑖1 : 𝑃∅ 𝐷 𝑔3 , 𝑖1 = 𝑃 𝐷 𝑃(𝑔3 |𝑖1 , 𝐷) 3 |𝑖 1 , 𝑑) 𝑃 𝑑 𝑃(𝑔 𝑑 And we end up getting a new sample: [s = 𝑠 0 , l = 𝑙1 , 𝑔(1) = 𝑔3 , 𝑖 (1) = 𝑖1 , 𝑑(1) = 𝑑1 ] General Outline We will:  See the algorithm for Gibbs sampling (+example).  Define and explain what are Markov chains.  Connect the two, and define the Gibbs chain (Markov chain created by Gibbs algorithm).  Discuss a larger class of Markov chain, and Metropolis-Hastings Algorithm.  Touch the definition of mixing time and methods to check whether our chain has mixed. Markov Chains The formal definition: We also demand that: ∀𝑋. 𝑥′ 𝜏 𝑥 𝑥′ = 1 Simply put: A Markov chain is made of: • A set of states (in out case, each state will represent an instance of our probability space). • A transition model, that holds for each state, the distribution of “which state can we visit next”. *Note: the transition model can be depended on the number of steps we have already taken (i.e: depend on the time our chain has been running). We only speak about homogeneous Markov chains, where the transition model does not change over time. Example: drunken grasshopper •We define our states to be the integers from (-4)to (4). And our drunken grasshopper defines a transition model with 25% of hopping one spot, and 50% of staying in the same place •Formally, for i between -3 and 3 we define: •And for the edges, where we cannot go farther, we higher the chances of the self loop: Markov chain as samples of a distribution •Namely, we look at each step as a distribution of “where can we be at that step”. •Each such distribution is defined by the previous one (summation over our chance of coming from a specific state, multiplied by our chance of getting to that state at the previous step. •Each step now represents a distribution over the states of our chain. Thus, a distribution over the probability space X. Asymptotic Behavior •For our purposes, the most important aspect of a Markov chain is its long-term behavior. •Drunken grasshopper revisited: -We observe the location at time T is a random variable. There’s a hunch about the asymptotic behavior: •For the two first steps we get something like that: •For T=10, we already have the probabilities of ~0.05 for +4\-4, and only ~0.17 left for value of 0 •At T=50, we get that all the 9 states has probabilities between 0.1107 and 0.1116. UNIFORM! Markov chain Monte Carlo (MCMC) Sampling Input: 𝑃(0) (X) – initial distribution τ − The transition model T – number of steps (and full samples) Output: 𝑋 (0) , . . . , 𝑋 (𝑇) − A set of samples from strolling over the chain. Each taken from the t’s step distribution: 𝑃(𝑡) (X) (in our case: each sample is a vector, containing values for all 𝑋𝑖 ∈ 𝑋) Stationary Distributions Just like in numerical analysis, we expect the following to hold: Thus, given the transition model, we get a set of |X| variables, and |X| equalities. We don’t forget to finally normalize our probability, adding the equation: 𝑃 𝑡 𝑥 =1 𝑥∈𝑉𝑎𝑙(𝑿) And finally we got a distribution out of our chain. Key questions: Does the process really converge? If so - Is there only one such target distribution? Stationary Distributions – cont. We formally define: That is, π 𝑥 should satisfy the exact equation we had desired for 𝑃 𝑡 𝑥 . Meaning, that if for some t: 𝑃 𝑡 𝑥 gets close enough to π 𝑥 , our process will converge. Stationary Distributions – example Stationary Distributions – bad examples We note that generally, there exists a possibility to have more than one stationary distribution for a given Markov chain. For example, in a set called reducible Markov Chains. In that set, we have that the chain holds several areas (sets of states) that are unreachable from one another. Thus, the starting state will set our area and therefore the stationary distribution to which our process will converge. There can also be no stationary distribution at all, as shown: This kind of Markov chains are called periodic. (equivalent to 𝑋𝑛+1 = 1 − 𝑋𝑛 ) Regular Markov chain The formal definition: Simply put: we can find an integer k, such that steps of length k in the chain can be used to reach from each state to any another state (with probability > 0: meaning that no transition in the path was 0, so each step is legitimate). Simpler demands: We will only demand the following, which together ensure that our chain will be regular: 1) between each two states, there is a positive probability of travel (namely: 𝜏 𝑋 𝑋 ′ > 0). 2) every state has a positive-probability for a self loop (namely: 𝜏 𝑋 𝑋 > 0). Those 2 demands often hold in practice, and will ensure us regularity. Regular Markov chain – cont. What is regularity good for? Why would we even demand regularity (or anything that will ensure us regularity) for our Markov chain? Theorem 12.3: (So, that’s a good enough reason…) MCMC sampling - revisited What does that mean? •When we have a transition model of switching between states in the chain, we will simply use it to travel through the chain. •After reaching the number T, our chain will presumable have the stationary distribution. •Having traveled T steps, we simply keep on traveling through the chain to collect samples from the distribution! •We later discuss briefly about the issue of defining T. (In practice we simply keep running the algorithm until we have enough evidence that we have converged to the stationary) General Outline We will:  See the algorithm for Gibbs sampling (+example).  Define and explain what are Markov chains.  Connect the two, and define the Gibbs chain (Markov chain created by Gibbs algorithm).  Discuss a larger class of Markov chain, and Metropolis-Hastings Algorithm.  Touch the definition of mixing time and methods to check whether our chain has mixed. Gibbs Sampling revisited • We would like to define a Markov chain that would represent samples from our distribution • But the distribution is given by a graphical model. So… We’ll have to use that. • Great Idea! We will use Gibbs sampling process as a way to travel through a Markov chain, in which the state space is simply the whole probability space (each state is an instantiation for all 𝑿𝒊 ’s that appears in the graphical model). •Problem: Gibbs sampling takes each time only one variable, and sample it depending on the others. How can we handle it? Multiple Transition Models •We can define a cumbersome transition model for our chain, but it will be easier to consider one transition model 𝝉𝒊 for each variable 𝑿𝒊 . •We now have a set of transition models 𝝉𝒊 . We define each such 𝝉𝒊 to be called a kernel in our chain. How do we combine between several kernels? How do we define the soul transition model that defines our travel through the chain? 2 options: •Randomly choosing which kernel to use at each step (for k variables, we get 𝝉: = we note that the new transition model is legal (sum of outgoing edges is still 1) 𝟏 𝒌 𝒌 𝒊=𝟏 𝝉𝒊 •Sequentially switching between the kernels. This creates a bit of a pickle, since we demanded many things from our transition model, that now not necessarily hold (homogeneously, for example). But.. We regard the whole k steps (one of each kernel) as a single step in the world, and thus keep our chain homogeneous. We can show that the rest of our demands hold whenever each of the kernels satisfy some weaker demands. Gibbs chain •We will use the following kernels, taken from Gibbs algorithm: •Just like in the algorithm, when sampling a new value for the variable 𝑿𝒊 , we disregard it’s current value. We have the most updates values for the other variables at that point. •Note: when looking at the conditional distribution , we simply reduce the evidence in our graphical model, to get an appropriate set of factors. We then use: •We notice that like before, the evaluation of each sample is Gibbs chain – cont. Evaluating each value in a kernel: We have: And therefore we derive: (The set of relevant factors are also called the Markov Blanket of 𝑿𝒊 ) Gibbs chain – bad example We note that deriving Gibbs chain from a graphical model, is not enough to ensure the derived Markov chain will be regular (and therefore we do not know that it will converge!) Note the following example: we observe 𝑷(𝑿𝟏 |𝒀 = 𝟏). Thus, our probability space holds only two viable options (values (0,1,1) and (1,0,1)). -If we start off in (0,1,1), and sample 𝑿𝟐 , we can only get 1 (given 𝑿𝟏 = 𝟏, as evidence). Next, we’ll sample 𝑿𝟏 , given 𝑿𝟐 = 𝟏, and get 0. Thus, we will stay in the case (0,1,1) forever. -if we start off in (1,0,1) – we’ll get the mirror case. Thus, we got a reducible Gibbs chain (with two stationary distributions)! Gibbs chain – cont. •So, we must be careful when constructing a Gibbs chain. •Luckily, we have another important theorem to ensure a unique stationary distribution: •We notice that the last example happened because we had a transition with strict chance of zero [i.e: the transition: (1,0,1)(1,1,1)] •But we should notice, that if we “correct” such models by inserting an 𝜺 chance instead of 0, we get our theorem to hold. •However! That 𝜺 will cause us other problems, in form of bad mixing time (discussed ahead). Gibbs chains - summary •Gibbs chains converts the hard problem of inference to a sequence of “easy” sampling steps, using many information we generally know about Markov chains. •Pros: -The simplest way one can think of, to generate a Markov chain for a probability -computationally efficient •Cons: -Often slow to mix (converge). Especially when we have transition values close to 0 or 1. -Only applies if we can sample from product of factors. That maybe be good in our case, but generally it is very limiting (for example: the model will not work for continuous distributions) General Outline We will:  See the algorithm for Gibbs sampling (+example).  Define and explain what are Markov chains.  Connect the two, and define the Gibbs chain (Markov chain created by Gibbs algorithm).  Discuss a larger class of Markov chain, and Metropolis-Hastings Algorithm.  Touch the definition of mixing time and methods to check whether our chain has mixed. Broader class of Markov chains • We have seen Gibbs chains, and some conditions to make sure they converge. • We now show a more general way to create a Markov chain given some distribution. • It focuses on ensuring the convergence. First, we’ll need some definitions… Reversible Markov chains Definition: Using that definition, we can easily achieve the following: Reversible Markov chains – cont. We prove that last proposition: we have that: π 𝒙 𝝉 𝒙 𝒙′ = π 𝒙′ 𝝉 𝒙′ 𝒙 and therefore we can sum those equations over all x’s, and we get: π 𝒙 𝝉 𝒙 𝒙′ = π 𝒙′ 𝝉 𝒙′ 𝒙 = π 𝒙′ 𝝉 𝒙′ 𝒙 = π 𝒙′ ∙ 𝟏 = π 𝒙′ Which is exactly the definition for a stationary distribution π 𝒙 over the transition model 𝝉 𝒙 𝒙 𝒙 Metropolis-Hastings Algorithm •We use the idea of a proposal distribution (already showed by Dafna, in the case of importance sampling) •We define a transition model 𝝉𝑸 based on that distribution. •We use another “approver” to decide whether that move should occur or not. The choosing of this approver will give us freedom. The real transition model τ will be defined like that: (follows from our description…) Metropolis-Hastings Algorithm – cont. •How do we choose our approver? We want the detailed balance equation (12.24) to hold. So we want the following to hold: Thus, we will need this to hold: 𝑨(𝒙 𝑨(𝒙′ 𝒙′ ) π 𝒙 𝝉𝑸 𝒙 𝒙′ = =: μ 𝒙) π 𝒙′ 𝝉𝑸 𝒙′ 𝒙 1 Now, if we have μ > 1, we replace the nominator with the denominator and get < 1. μ To maximize the acceptance rate (so our chain will move more freely, and mix faster, we will simply take the approver of one side to take the value “1” and the other to take the fraction. Finally we get a generic formula: Finally, we conclude from our demands, That the chain we described does indeed converge to π 𝒙 . General Outline We will:  See the algorithm for Gibbs sampling (+example).  Define and explain what are Markov chains.  Connect the two, and define the Gibbs chain (Markov chain created by Gibbs algorithm).  Discuss a larger class of Markov chain, and Metropolis-Hastings Algorithm.  Touch the definition of mixing time and methods to check whether our chain has mixed. Mixing time •So, we have constructed a Markov chain, and guaranteed it converges to our target distribution. •Using Gibbs, we even achieved the skill of sampling from a distribution that it would be hard to sample otherwise. •But, one important question is left unanswered: -How much time will it take our chain to reach the desired distribution? •General answer: We cannot know!!! There is very little math behind this issue, and not a lot of boundaries as to the number of samples. •We show a few definitions, and then jump on to practice Mixing time - definitions We define the mixing time as follows: Where: Or, alternatively: Conductance We define the conductance of our Markov chain as follows: where: This characteristic of the chain, can give us a clue about the chances we manage visiting all around the chain, instead of getting “stuck” in a specific area. Example Here is an example where low conductance shows we can expect a long mixing time: Assume we have two areas, where the only way to transition between the two is by two unique states, and we also assume that this transition has a very low chance. Thus, the conductance will be very low (taking S = {𝑥 1 , 𝑥 2 , 𝑥 3 }). We can also expect that the mixing time of that chain would be pretty high. In practice •In practice we have only the self-examining as a tool to check whether our chain has mixed. •We do it by a few different ways: -for example: We run a number of chains in parallel, comparing the results. -We know that both converge (at some point) to the same distribution, so we can say for sure when we have NOT achieved our mixing time. If the two examples are too different. -Repeating this test (or running a higher number of chains in parallel), we can have many observations that are “not bad”, concluding that overall our chains have mixed enough. Some statistics of self-observation Some statistics of self-observation Some statistics of self-observation Some statistics of self-observation No Maybe Another little issue Another issue that can be discussed is that having taken T steps, we can now start sampling from our travel over the chain. BUT! We notice that our samples are definitely dependent in one another. Solution: We can consider taking some number of steps (d) between each two samples. -It is easy to see that by doing that, we “lie to ourselves” and only lose viable information we had in the samples we through away. -Nonetheless, in such a case where processing each sample could take a while, we can use this method to process only a “slightly more independent” samples. Summary To sum up: We’ve defined and learned how to use Markov chains. We’ve seen a general algorithm to construct a chain for a target distribution. We’ve seen a more particular way to construct these chains for a graphical model (based on factors, but can easily be replaced by CPDs) We’ve discussed the issues of practically showing convergence time, and some statistical tools to overcome that issue by running a whole lot of time and by self observation.

presentation

Related documents

Products

Support

presentation

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib