Probabilistic Graphical Models Chapter 12: Particle-Based Approximate Inference UR I ME I R DA F N A SA DE H Particle-Based Approximate Inference Methods for approximate the joint distributions as a set of instantiations to all or some of the variables in the network. These instantiations, often called ๐๐๐๐ก๐๐๐๐๐ , are designed to provide a good representation of the overall probability distribution, The general framework for most of this lecture is: Consider some distribution ๐ ๐ณ , and assume we want to estimate the probability of some event ๐ = ๐ฆ relative to ๐, for some ๐ ⊆ ๐ณ and ๐ฆ ∈ ๐๐๐ ๐ . More generally, we might want to estimate the expectation of some function relative to ๐. We approximate this expectation by generating a set of M particles, estimating the value of the function relative to each of the particles, and then aggregating the results. Particle-Based Approximate Inference For example: sampled IID from P. If ๐ ๐ฅ = 1 = ๐ the estimator for P : More generally, for any distribution P, function f: Forward Sampling Input - โฌ, a Bayesian network over ๐ณ. Output - ๐ = ๐ฅ1 , … ๐ฅ๐ , a sample from โฌ to ๐ณ. Forward Sampling - Example Forward Sampling - Example i = 1: Sampling D Assume D=๐1 Forward Sampling - Example ๐ท = ๐1 i = 2: Sampling ๐ผ Assume ๐ผ = ๐ 0 Forward Sampling - Example ๐ท = ๐1 , ๐ผ = ๐ 0 i = 3: Sampling ๐บ from ๐ ๐บ ๐ 0 , ๐1 . And continue… Forward Sampling – The estimates - a set of particles (samples) ๐ − a function over ๐ณ - an estimate for ๐ผ๐ ๐ โซืืขืืืขืืืขโฌ - an estimate for ๐(๐ฆ) Forward Sampling – Complexity ๐ – the total number of particles generated ๐= ๐ณ ๐โ๐ ๐๐ฃ๐๐๐๐๐ ๐๐๐ ๐ก ๐๐ : Forward Sampling – Absolute Error From the ๐ป๐๐๐๐๐๐๐๐ ๐๐๐ข๐๐: Thus, to achieve an estimate whose error is bounded by ε, with probability at least 1 − ๐ฟ, we required: Equivalently: Forward Sampling – Relative Error From the ๐ถโ๐๐๐๐๐๐ ๐๐๐ข๐๐: Thus, to achieve an estimate whose error is bounded by ε, with probability at least 1 − ๐ฟ, we required: Forward Sampling – Relative Error From the ๐ถโ๐๐๐๐๐๐ ๐๐๐ข๐๐: Thus, to achieve an estimate whose error is bounded by ε, with probability at least 1 − ๐ฟ, we required: If ๐(๐ฆ) is very small, it’s likely that we will not generated any samples where this event holds. Our estimate of 0 is not going to be within any relative error Forward Sampling – Relative Error From the ๐ถโ๐๐๐๐๐๐ ๐๐๐ข๐๐: Thus, to achieve an estimate whose error is bounded by ε, with probability at least 1 − ๐ฟ, we required: We do not know ๐ ๐ฆ … Forward Sampling – Relative Error From the ๐ถโ๐๐๐๐๐๐ ๐๐๐ข๐๐: Thus, to achieve an estimate whose error is bounded by ε, with probability at least 1 − ๐ฟ, we required: Conditional Probability Queries We are interested in conditional probabilities of the form ๐ ๐ฆ ๐ธ = ๐ . Unfortunately, it turns out that this estimation task is significantly harder. -Rejection Sampling: Generate samples ๐ฅ from ๐(๐) with forward sampling, reject any sample that is not compatible with ๐. The resulting samples are sampled from ๐ ๐ ๐ . The problem is that the expected number of particles that are not rejected from an original sample set of size ๐ is ๐ โ ๐(๐). Conditional Probability Queries Estimates separately ๐(๐ฆ, ๐) and ๐ ๐ and compute the ratio. If ๐ ๐ ∈ 1 − ๐ ๐ ๐ , 1 + ๐ ๐ ๐ ๐ ๐ฆ, ๐ ∈ 1 − ๐ ๐ ๐ฆ, ๐ , 1 + ๐ ๐ ๐ฆ, ๐ then: 1− 2๐ ๐(๐ฆ,๐) 1+๐ ๐(๐) ≤ ๐ ๐ฆ,๐ ๐ ๐ ≤ 1+ 2๐ ๐(๐ฆ,๐) 1+๐ ๐(๐) But, the number of samples required to get a low relative error also grows linearly with 1 ๐(๐). This is not a problem when ๐(๐) gets an absolute error, but it’s not suffice to get any type of bound for ๐(๐ฆ,๐) ๐(๐). Likelihood weighting The rejection sampling process seems very wasteful in the way it handles evidence. It seems much more sensible to simply force the samples to take on the appropriate values at observed nodes. This simple approach can generate incorrect results… Likelihood weighting Assume our evidence is ๐ = ๐ 1 . Using the naive process, the expected number of samples that have ๐ผ = ๐1 is 30 percent. Thus, this approach fails to conclude that the posterior probability of ๐1 is higher when we observe ๐ 1 (0.41). We should conclude that a sample where we have ๐ผ = ๐1 and force ๐ = ๐ 1 should be worth 80 percent of a sample, whereas one where we have ๐ผ = ๐ 0 and force ๐ = ๐ 1 should be worth 5 percent of a sample Likelihood weighting Likelihood Weighting - Example ๐ = ๐ฟ = ๐0 , ๐ = ๐ 1 ๐ค=1 sample ๐ท = ๐1 Likelihood Weighting - Example ๐ = ๐ฟ = ๐0 , ๐ = ๐ 1 ๐ค=1 ๐ท = ๐1 Sample ๐ผ = ๐ 0 Likelihood Weighting - Example ๐ = ๐ฟ = ๐0 , ๐ = ๐ 1 ๐ค = 1, ๐ท = ๐1 , ๐ผ = ๐ 0 Set ๐ = ๐ 1 ๐ค ← ๐ค โ 0.05 Likelihood Weighting - Example ๐ = ๐ฟ = ๐0 , ๐ = ๐ 1 ๐ค = 0.05 ๐ท = ๐1 , ๐ผ = ๐ 0 , ๐ = ๐ 1 Sample G= ๐2 Likelihood Weighting - Example ๐ = ๐ฟ = ๐0 , ๐ = ๐ 1 ๐ค = 0.05 ๐ท = ๐1 , ๐ผ = ๐ 0 , ๐ = ๐ 1 , G = ๐ 2 Set ๐ฟ = ๐ 0 ๐ค ← ๐ค โ 0.4 Likelihood Weighting - Example ๐ = ๐ฟ = ๐0 , ๐ = ๐ 1 ๐ค = 0.02 ๐ = < ๐ท = ๐1 , ๐ผ = ๐ 0 , ๐ = ๐ 1 , G = ๐ 2 , ๐ฟ = ๐ 0 > Likelihood Weighting– The estimates - a set of particles (samples) - an estimate for ๐(๐ฆ|๐) The same set of particles can be used to estimate the probability of any event ๐ฆ. Importance sampling ๐ผ๐๐๐๐๐ก๐๐๐๐ ๐ ๐๐๐๐๐๐๐ is a general approach for estimating the expectation of a function ๐ ๐ฅ relative to some distribution ๐ญ๐๐ซ๐ ๐๐ญ ๐๐ข๐ฌ๐ญ๐ซ๐ข๐๐ฎ๐ญ๐ข๐จ๐ง ๐ท ๐ฟ . As we seen, we can estimate this expectation by generating samples ๐ฅ 1 , … ๐ฅ[๐] from ๐, and then estimating: Sometimes, it might be impossible or computationally very expensive to generate samples from ๐. For example, ๐ might be a posterior distribution for a Bayesian network or a prior distribution for a Markov network. Thus, we might prefer to generate samples from a different distribution, the ๐๐๐๐๐๐๐๐ ๐ ๐๐๐๐๐๐๐๐๐๐๐ ๐ธ. Unnormalized Importance sampling If we generate samples from ๐ instead of ๐, we need to adjust our estimator to compensate for the incorrect sampling distribution. We define the ๐ข๐๐๐๐๐๐๐๐๐ง๐๐ ๐๐๐๐๐๐ก๐๐๐๐ ๐ ๐๐๐๐๐๐๐ estimator: When the set of samples ๐ = {๐ฅ 1 , … ๐ฅ ๐ } generate from ๐. The new estimator is based on the observation that: ๐ธ๐ ๐ ๐ ๐(๐) ๐ ๐ ๐ ๐ ๐ฅ[๐] ๐ฅ[๐] The factor ๐ We define: = ๐ฅ ๐ ๐ฅ ๐ ๐ฅ ๐ ๐ฅ = ๐ ๐ฅ ๐ ๐ฅ ๐(๐ฅ) = ๐ธ๐ ๐ ๐ฅ can be viewed as a correction weight to the term ๐(๐ฅ[๐]). ๐ ๐ฅ ๐ฅ ๐ค ๐ฅ =๐ . ๐ ๐ Unnormalized Importance sampling Our analysis immediately implies that the estimator is ๐ข๐๐๐๐๐ ๐๐, that is, it’s mean for any data set is precisely the desired value: ๐ ๐ ๐ ๐ธ๐ท ๐ธ๐ท (๐) = ๐ธ๐(๐) ๐(๐) ๐ = ๐ธ๐(๐) ๐(๐) Unnormalized Importance sampling From the Central Limit Theorem, we have that since ๐ ∞: ๐๐2 ๐ธ๐ท (๐)~๐ฉ(๐ผ๐ ๐ , ) ๐ Where, The variance decreases linearly with the number of samples. Normalized Importance sampling One problem with the ๐ข๐๐๐๐๐๐๐๐๐ง๐๐ ๐๐๐๐๐๐ก๐๐๐๐ ๐ ๐๐๐๐๐๐๐ estimator, is that it assumes that ๐ is known. A frequent situation is that ๐ is known only up to a normalizing constant ๐. Specifically, what we have access to is a function ๐, such that ๐ ๐ =๐โ๐ ๐ . For example, in a Bayesian network โฌ, we might have ๐ ๐ = ๐โฌ ๐ ๐ , ๐(X)=๐โฌ ๐, ๐ and ๐ = ๐โฌ ๐ . In this context, we define: Normalized Importance sampling We define the ๐๐๐๐๐๐๐๐ง๐๐ ๐๐๐๐๐๐ก๐๐๐๐ ๐ ๐๐๐๐๐๐๐ ๐๐ ๐ก๐๐๐๐ก๐๐: The estimator is based on the observation that: And: Normalized Importance sampling The normalized estimator involves a quotient, and it is therefore much more difficult to analyze theoretically. Unlike the unnormalized estimator, the normalized estimator is not unbiased. It’s immediate in the case ๐ = 1. Here, the estimator reduces to: Here, it’s mean is . Conversely, when ๐ goes to infinity, the numerators and denominators converges to the expected values. In general, the bias goes down as 1 ๐. Normalized Importance sampling One can show that the variance of the importance sampling estimator with M data instances is approximately: This can be used to provide an estimate on the quality of a set of samples generated using normalized importance sampling. Assume that we were to estimate ๐ผ๐ [๐] using standard sampling method, where we generate ๐ ๐ผ๐ผ๐ท samples from ๐(๐), this approach would result in a variance ratio between these two variances is: ๐๐๐๐ ๐ ๐ ๐ Thus, we would expect ๐ weighted samples generated by importance sampling to be “equivalent” to samples generated by ๐ผ๐ผ๐ท samples from ๐. . The The Mutilated Network Proposal Distribution Assume that we are interested in a particular event (๐ต = ๐ง) ๐บ = ๐2 , either because we wish to estimate its probability, or because we have observed it as evidence. We wish to focus our sampling process on the parts of the joint that are consistent with this event. It’s easy to take this event into consideration when sampling ๐ฟ, but it’s more difficult to account for ๐บ’s influence on ๐ท, ๐ผ and ๐. We define a simple proposal distribution that “sets” the value of ๐ ∈ ๐ต to take the prespecified value. The Mutilated Network Proposal Distribution The Mutilated Network Proposal Distribution The Mutilated Network Proposal Distribution Importance sampling with proposal distribution ๐ induced by the mutilated network โฌ๐ต=๐ง , ๐ ๐ณ = ๐โฌ (๐ณ, ๐ง) is precisely equivalent to the ๐ฟ๐๐๐๐๐โ๐๐๐ ๐๐๐๐โ๐ก๐๐๐ algorithm with ๐ต = ๐ง: ๐๐๐๐๐๐ ๐๐ก๐๐๐: ๐ฟ๐๐ก ๐ ๐๐ ๐ ๐ ๐๐๐๐๐ ๐๐๐๐๐๐๐ก๐๐ ๐๐ฆ ๐ฟ. ๐ โฌ, ๐ต = ๐ง ๐๐๐๐๐๐๐กโ๐ ๐๐๐ ๐ค ๐๐ ๐๐ก๐ ๐ค๐๐๐โ๐ก. ๐โ๐๐ ๐กโ๐ ๐๐๐ ๐ก๐๐๐๐ข๐ก๐๐๐ ๐๐ฃ๐๐ ๐ ๐๐ ๐๐ ๐๐๐๐๐๐๐ ๐๐ฆ ๐กโ๐ ๐๐๐ก๐ค๐๐๐ โฌ๐ต=๐ง , ๐๐๐ ๐ค ๐ = ๐โฌ ๐ ๐(๐) = ๐โฌ๐=๐ง (๐) ๐(๐) The Mutilated Network Proposal Distribution ๐๐๐๐๐๐ ๐๐ก๐๐๐: ๐ฟ๐๐ก ๐ ๐๐ ๐ ๐ ๐๐๐๐๐ ๐๐๐๐๐๐๐ก๐๐ ๐๐ฆ ๐ฟ. ๐ โฌ, ๐ต = ๐ง ๐๐๐๐๐๐๐กโ๐ ๐๐๐ ๐ค ๐๐ ๐๐ก๐ ๐ค๐๐๐โ๐ก. ๐โ๐๐ ๐กโ๐ ๐๐๐ ๐ก๐๐๐๐ข๐ก๐๐๐ ๐โฌ ๐ โฌ๐=๐ง (๐) ๐๐ฃ๐๐ ๐ ๐๐ ๐๐ ๐๐๐๐๐๐๐ ๐๐ฆ ๐กโ๐ ๐๐๐ก๐ค๐๐๐ โฌ๐ต=๐ง , ๐๐๐, ๐ค ๐ = ๐ Proof: Let ๐′ be some assignment to X. 0 , ๐′ ๐ต ≠ ๐ง ๐ ๐ = ๐′ = ๐๐๐ ๐ ๐ฅ∉๐ต ๐โฌ ๐ฅ = ๐′ ๐ฅ |๐′ ๐๐๐ , Let ๐’’ be a sample generated by Forward Sampling with โฌ๐ต=๐ง , , then: 0 , ๐′ ๐ต ≠ ๐ง ๐ ๐′′ = ๐ ′ = ๐ โฌ๐ต=๐ง ๐ ′ ๐โฌ ๐ฅ = ๐′ ๐ฅ |๐′ ๐๐๐ , ๐ฅ∉๐ต ๐๐๐ ๐ ๐(๐) = ๐(๐). The Mutilated Network Proposal Distribution ๐๐๐๐๐๐ ๐๐ก๐๐๐: ๐ฟ๐๐ก ๐ ๐๐ ๐ ๐ ๐๐๐๐๐ ๐๐๐๐๐๐๐ก๐๐ ๐๐ฆ ๐ฟ. ๐ โฌ, ๐ต = ๐ง ๐๐๐๐๐๐๐กโ๐ ๐๐๐ ๐ค ๐๐ ๐๐ก๐ ๐ค๐๐๐โ๐ก. ๐โ๐๐ ๐กโ๐ ๐๐๐ ๐ก๐๐๐๐ข๐ก๐๐๐ ๐โฌ ๐ โฌ๐=๐ง (๐) ๐๐ฃ๐๐ ๐ ๐๐ ๐๐ ๐๐๐๐๐๐๐ ๐๐ฆ ๐กโ๐ ๐๐๐ก๐ค๐๐๐ โฌ๐ต=๐ง , ๐๐๐, ๐ค ๐ = ๐ ๐(๐) = ๐(๐). Proof: ๐ค ๐ = ๐ ๐ ๐ฅ ๐ ๐๐๐ฅ ๐ฅ∈๐ = ๐ฅ๐ ๐ ๐ฅ ๐ ๐๐๐ฅ ๐ฅ∉๐ ๐ ๐ ๐ฅ ๐ ๐๐๐ฅ = ๐โฌ ๐ ๐โฌ๐=๐ง (๐) Markov Chain Monte Carlo Methods We now present an alternative sampling approach that generates sequences of samples. This sequence is constructed so that, although the first sample may be generated from the prior, successive samples are generated from distributions that provably get closer and closer to the desired posterior we define: ๐∅ (X) = P(X|e) unlike forward sampling methods (including likelihood weighting), Markov chain methods apply equally well to directed and to undirected models. Indeed, the algorithm is easier to present in the context of a distribution ๐∅ defined in terms of a general set of factors ∅. General Outline We will: ๏ See the algorithm for Gibbs sampling (+example). ๏ Define and explain what are Markov chains. ๏ Connect the two, and define the Gibbs chain (Markov chain created by Gibbs algorithm). ๏ Discuss a larger class of Markov chain, and Metropolis-Hastings Algorithm. ๏ Touch the definition of mixing time and methods to check whether our chain has mixed. General Outline We will: ๏ See the algorithm for Gibbs sampling (+example). ๏ Define and explain what are Markov chains. ๏ Connect the two, and define the Gibbs chain (Markov chain created by Gibbs algorithm). ๏ Discuss a larger class of Markov chain, and Metropolis-Hastings Algorithm. ๏ Touch the definition of mixing time and methods to check whether our chain has mixed. Gibbs Sampling Input: X − set of variables ∅ – set of factors, ๐(0) (X) – initial distribution T – number of steps (and full samples) Output: ๐ (0) , . . . , ๐ (๐) − A set of samples (each one is a vector, containing values for all ๐๐ ∈ ๐!) 0 1 Gibbs Sampling – Example (with ๐ and ๐ ) We look for samples of D, I, G, given ๐ 0 and ๐1 . First, we sample once (say, by forward sampling). Let us assume we got: ๐ (0) = ๐ 0 , ๐(0) = ๐1 , ๐(0) = ๐2 Now, we start creating new samples over G ,I ,D at some order over them. 0 1 Gibbs Sampling – Example (with ๐ and ๐ ) [s = ๐ 0 , l = ๐1 , ๐ (0) = ๐ 0 , ๐ (0) = ๐1 , ๐(0) = ๐2 ] Sample ๐(1) : Sampling ๐บ from ๐∅ ๐บ ๐1 , ๐ 0 . 0 1 Gibbs Sampling – Example (with ๐ and ๐ ) [s = ๐ 0 , l = ๐1 , ๐(1) = ๐3 , ๐ (0) = ๐ 0 , ๐(0) = ๐1 ] Sample ๐ (1) : Sampling ๐ผ from ๐∅ ๐ผ ๐1 , ๐3 : ๐∅ ๐ผ ๐1 , ๐3 ๐ ๐ผ ๐ ๐ 0 ๐ผ ๐(๐3 |๐ผ, ๐1 ) = 0 ๐ ๐ ๐ ๐ ๐ ๐(๐3 |๐, ๐1 ) ๐ assume we get ๐ 1 . 0 1 Gibbs Sampling – Example (with ๐ and ๐ ) [s = ๐ 0 , l = ๐1 , ๐(1) = ๐3 , ๐ (1) = ๐1 , ๐ (0) = ๐1 ] Sample ๐(1) : Sampling ๐ผ from ๐∅ ๐ท ๐3 , ๐1 : ๐∅ ๐ท ๐3 , ๐1 = ๐ ๐ท ๐(๐3 |๐1 , ๐ท) 3 |๐ 1 , ๐) ๐ ๐ ๐(๐ ๐ And we end up getting a new sample: [s = ๐ 0 , l = ๐1 , ๐(1) = ๐3 , ๐ (1) = ๐1 , ๐(1) = ๐1 ] General Outline We will: ๏ See the algorithm for Gibbs sampling (+example). ๏ Define and explain what are Markov chains. ๏ Connect the two, and define the Gibbs chain (Markov chain created by Gibbs algorithm). ๏ Discuss a larger class of Markov chain, and Metropolis-Hastings Algorithm. ๏ Touch the definition of mixing time and methods to check whether our chain has mixed. Markov Chains The formal definition: We also demand that: ∀๐. ๐ฅ′ ๐ ๐ฅ ๐ฅ′ = 1 Simply put: A Markov chain is made of: • A set of states (in out case, each state will represent an instance of our probability space). • A transition model, that holds for each state, the distribution of “which state can we visit next”. *Note: the transition model can be depended on the number of steps we have already taken (i.e: depend on the time our chain has been running). We only speak about homogeneous Markov chains, where the transition model does not change over time. Example: drunken grasshopper •We define our states to be the integers from (-4)to (4). And our drunken grasshopper defines a transition model with 25% of hopping one spot, and 50% of staying in the same place •Formally, for i between -3 and 3 we define: •And for the edges, where we cannot go farther, we higher the chances of the self loop: Markov chain as samples of a distribution •Namely, we look at each step as a distribution of “where can we be at that step”. •Each such distribution is defined by the previous one (summation over our chance of coming from a specific state, multiplied by our chance of getting to that state at the previous step. •Each step now represents a distribution over the states of our chain. Thus, a distribution over the probability space X. Asymptotic Behavior •For our purposes, the most important aspect of a Markov chain is its long-term behavior. •Drunken grasshopper revisited: -We observe the location at time T is a random variable. There’s a hunch about the asymptotic behavior: •For the two first steps we get something like that: •For T=10, we already have the probabilities of ~0.05 for +4\-4, and only ~0.17 left for value of 0 •At T=50, we get that all the 9 states has probabilities between 0.1107 and 0.1116. UNIFORM! Markov chain Monte Carlo (MCMC) Sampling Input: ๐(0) (X) – initial distribution τ − The transition model T – number of steps (and full samples) Output: ๐ (0) , . . . , ๐ (๐) − A set of samples from strolling over the chain. Each taken from the t’s step distribution: ๐(๐ก) (X) (in our case: each sample is a vector, containing values for all ๐๐ ∈ ๐) Stationary Distributions Just like in numerical analysis, we expect the following to hold: Thus, given the transition model, we get a set of |X| variables, and |X| equalities. We don’t forget to finally normalize our probability, adding the equation: ๐ ๐ก ๐ฅ =1 ๐ฅ∈๐๐๐(๐ฟ) And finally we got a distribution out of our chain. Key questions: Does the process really converge? If so - Is there only one such target distribution? Stationary Distributions – cont. We formally define: That is, π ๐ฅ should satisfy the exact equation we had desired for ๐ ๐ก ๐ฅ . Meaning, that if for some t: ๐ ๐ก ๐ฅ gets close enough to π ๐ฅ , our process will converge. Stationary Distributions – example Stationary Distributions – bad examples We note that generally, there exists a possibility to have more than one stationary distribution for a given Markov chain. For example, in a set called reducible Markov Chains. In that set, we have that the chain holds several areas (sets of states) that are unreachable from one another. Thus, the starting state will set our area and therefore the stationary distribution to which our process will converge. There can also be no stationary distribution at all, as shown: This kind of Markov chains are called periodic. (equivalent to ๐๐+1 = 1 − ๐๐ ) Regular Markov chain The formal definition: Simply put: we can find an integer k, such that steps of length k in the chain can be used to reach from each state to any another state (with probability > 0: meaning that no transition in the path was 0, so each step is legitimate). Simpler demands: We will only demand the following, which together ensure that our chain will be regular: 1) between each two states, there is a positive probability of travel (namely: ๐ ๐ ๐ ′ > 0). 2) every state has a positive-probability for a self loop (namely: ๐ ๐ ๐ > 0). Those 2 demands often hold in practice, and will ensure us regularity. Regular Markov chain – cont. What is regularity good for? Why would we even demand regularity (or anything that will ensure us regularity) for our Markov chain? Theorem 12.3: (So, that’s a good enough reason…) MCMC sampling - revisited What does that mean? •When we have a transition model of switching between states in the chain, we will simply use it to travel through the chain. •After reaching the number T, our chain will presumable have the stationary distribution. •Having traveled T steps, we simply keep on traveling through the chain to collect samples from the distribution! •We later discuss briefly about the issue of defining T. (In practice we simply keep running the algorithm until we have enough evidence that we have converged to the stationary) General Outline We will: ๏ See the algorithm for Gibbs sampling (+example). ๏ Define and explain what are Markov chains. ๏ Connect the two, and define the Gibbs chain (Markov chain created by Gibbs algorithm). ๏ Discuss a larger class of Markov chain, and Metropolis-Hastings Algorithm. ๏ Touch the definition of mixing time and methods to check whether our chain has mixed. Gibbs Sampling revisited • We would like to define a Markov chain that would represent samples from our distribution • But the distribution is given by a graphical model. So… We’ll have to use that. • Great Idea! We will use Gibbs sampling process as a way to travel through a Markov chain, in which the state space is simply the whole probability space (each state is an instantiation for all ๐ฟ๐ ’s that appears in the graphical model). •Problem: Gibbs sampling takes each time only one variable, and sample it depending on the others. How can we handle it? Multiple Transition Models •We can define a cumbersome transition model for our chain, but it will be easier to consider one transition model ๐๐ for each variable ๐ฟ๐ . •We now have a set of transition models ๐๐ . We define each such ๐๐ to be called a kernel in our chain. How do we combine between several kernels? How do we define the soul transition model that defines our travel through the chain? 2 options: •Randomly choosing which kernel to use at each step (for k variables, we get ๐: = we note that the new transition model is legal (sum of outgoing edges is still 1) ๐ ๐ ๐ ๐=๐ ๐๐ •Sequentially switching between the kernels. This creates a bit of a pickle, since we demanded many things from our transition model, that now not necessarily hold (homogeneously, for example). But.. We regard the whole k steps (one of each kernel) as a single step in the world, and thus keep our chain homogeneous. We can show that the rest of our demands hold whenever each of the kernels satisfy some weaker demands. Gibbs chain •We will use the following kernels, taken from Gibbs algorithm: •Just like in the algorithm, when sampling a new value for the variable ๐ฟ๐ , we disregard it’s current value. We have the most updates values for the other variables at that point. •Note: when looking at the conditional distribution , we simply reduce the evidence in our graphical model, to get an appropriate set of factors. We then use: •We notice that like before, the evaluation of each sample is Gibbs chain – cont. Evaluating each value in a kernel: We have: And therefore we derive: (The set of relevant factors are also called the Markov Blanket of ๐ฟ๐ ) Gibbs chain – bad example We note that deriving Gibbs chain from a graphical model, is not enough to ensure the derived Markov chain will be regular (and therefore we do not know that it will converge!) Note the following example: we observe ๐ท(๐ฟ๐ |๐ = ๐). Thus, our probability space holds only two viable options (values (0,1,1) and (1,0,1)). -If we start off in (0,1,1), and sample ๐ฟ๐ , we can only get 1 (given ๐ฟ๐ = ๐, as evidence). Next, we’ll sample ๐ฟ๐ , given ๐ฟ๐ = ๐, and get 0. Thus, we will stay in the case (0,1,1) forever. -if we start off in (1,0,1) – we’ll get the mirror case. Thus, we got a reducible Gibbs chain (with two stationary distributions)! Gibbs chain – cont. •So, we must be careful when constructing a Gibbs chain. •Luckily, we have another important theorem to ensure a unique stationary distribution: •We notice that the last example happened because we had a transition with strict chance of zero [i.e: the transition: (1,0,1)๏ (1,1,1)] •But we should notice, that if we “correct” such models by inserting an ๐บ chance instead of 0, we get our theorem to hold. •However! That ๐บ will cause us other problems, in form of bad mixing time (discussed ahead). Gibbs chains - summary •Gibbs chains converts the hard problem of inference to a sequence of “easy” sampling steps, using many information we generally know about Markov chains. •Pros: -The simplest way one can think of, to generate a Markov chain for a probability -computationally efficient •Cons: -Often slow to mix (converge). Especially when we have transition values close to 0 or 1. -Only applies if we can sample from product of factors. That maybe be good in our case, but generally it is very limiting (for example: the model will not work for continuous distributions) General Outline We will: ๏ See the algorithm for Gibbs sampling (+example). ๏ Define and explain what are Markov chains. ๏ Connect the two, and define the Gibbs chain (Markov chain created by Gibbs algorithm). ๏ Discuss a larger class of Markov chain, and Metropolis-Hastings Algorithm. ๏ Touch the definition of mixing time and methods to check whether our chain has mixed. Broader class of Markov chains • We have seen Gibbs chains, and some conditions to make sure they converge. • We now show a more general way to create a Markov chain given some distribution. • It focuses on ensuring the convergence. First, we’ll need some definitions… Reversible Markov chains Definition: Using that definition, we can easily achieve the following: Reversible Markov chains – cont. We prove that last proposition: we have that: π ๐ ๐ ๐ ๐′ = π ๐′ ๐ ๐′ ๐ and therefore we can sum those equations over all x’s, and we get: π ๐ ๐ ๐ ๐′ = π ๐′ ๐ ๐′ ๐ = π ๐′ ๐ ๐′ ๐ = π ๐′ โ ๐ = π ๐′ Which is exactly the definition for a stationary distribution π ๐ over the transition model ๐ ๐ ๐ ๐ Metropolis-Hastings Algorithm •We use the idea of a proposal distribution (already showed by Dafna, in the case of importance sampling) •We define a transition model ๐๐ธ based on that distribution. •We use another “approver” to decide whether that move should occur or not. The choosing of this approver will give us freedom. The real transition model τ will be defined like that: (follows from our description…) Metropolis-Hastings Algorithm – cont. •How do we choose our approver? We want the detailed balance equation (12.24) to hold. So we want the following to hold: Thus, we will need this to hold: ๐จ(๐ ๐จ(๐′ ๐′ ) π ๐ ๐๐ธ ๐ ๐′ = =: μ ๐) π ๐′ ๐๐ธ ๐′ ๐ 1 Now, if we have μ > 1, we replace the nominator with the denominator and get < 1. μ To maximize the acceptance rate (so our chain will move more freely, and mix faster, we will simply take the approver of one side to take the value “1” and the other to take the fraction. Finally we get a generic formula: Finally, we conclude from our demands, That the chain we described does indeed converge to π ๐ . General Outline We will: ๏ See the algorithm for Gibbs sampling (+example). ๏ Define and explain what are Markov chains. ๏ Connect the two, and define the Gibbs chain (Markov chain created by Gibbs algorithm). ๏ Discuss a larger class of Markov chain, and Metropolis-Hastings Algorithm. ๏ Touch the definition of mixing time and methods to check whether our chain has mixed. Mixing time •So, we have constructed a Markov chain, and guaranteed it converges to our target distribution. •Using Gibbs, we even achieved the skill of sampling from a distribution that it would be hard to sample otherwise. •But, one important question is left unanswered: -How much time will it take our chain to reach the desired distribution? •General answer: We cannot know!!! There is very little math behind this issue, and not a lot of boundaries as to the number of samples. •We show a few definitions, and then jump on to practice Mixing time - definitions We define the mixing time as follows: Where: Or, alternatively: Conductance We define the conductance of our Markov chain as follows: where: This characteristic of the chain, can give us a clue about the chances we manage visiting all around the chain, instead of getting “stuck” in a specific area. Example Here is an example where low conductance shows we can expect a long mixing time: Assume we have two areas, where the only way to transition between the two is by two unique states, and we also assume that this transition has a very low chance. Thus, the conductance will be very low (taking S = {๐ฅ 1 , ๐ฅ 2 , ๐ฅ 3 }). We can also expect that the mixing time of that chain would be pretty high. In practice •In practice we have only the self-examining as a tool to check whether our chain has mixed. •We do it by a few different ways: -for example: We run a number of chains in parallel, comparing the results. -We know that both converge (at some point) to the same distribution, so we can say for sure when we have NOT achieved our mixing time. If the two examples are too different. -Repeating this test (or running a higher number of chains in parallel), we can have many observations that are “not bad”, concluding that overall our chains have mixed enough. Some statistics of self-observation Some statistics of self-observation Some statistics of self-observation Some statistics of self-observation No Maybe Another little issue Another issue that can be discussed is that having taken T steps, we can now start sampling from our travel over the chain. BUT! We notice that our samples are definitely dependent in one another. Solution: We can consider taking some number of steps (d) between each two samples. -It is easy to see that by doing that, we “lie to ourselves” and only lose viable information we had in the samples we through away. -Nonetheless, in such a case where processing each sample could take a while, we can use this method to process only a “slightly more independent” samples. Summary To sum up: ๏We’ve defined and learned how to use Markov chains. ๏We’ve seen a general algorithm to construct a chain for a target distribution. ๏We’ve seen a more particular way to construct these chains for a graphical model (based on factors, but can easily be replaced by CPDs) ๏We’ve discussed the issues of practically showing convergence time, and some statistical tools to overcome that issue by running a whole lot of time and by self observation.