Unbiased Big Bayes Following Paths of Partial Posteriors Heiko Strathmann, Dino Sejdinovic, Mark Girolami Gatsby Computational Neuroscience Unit, University College London Department of Statistics, University of Oxford Department of Statistics, University of Warwick Being Bayesian: Averaging beliefs of the unknown p (x ∗ ) = ˆ d θ p(x ∗|θ) p(θ|D) | {z } | {z } likelihood posterior where p(θ) p(θ|D) ∝ p| (D|θ) {z } |{z} likelihood data prior Being Bayesian: Averaging beliefs of the unknown p(x ∗) ≈ where θ(j ) ∼ p (θ|D) iid 1 m m X j =1 p(x ∗|θ(j )) Being Bayesian: Averaging beliefs of the unknown p(x ∗) ≈ j where lim →∞ 1 m θ(j ) ∼ p (θ|D) m X j =1 p(x ∗|θ(j )) Metropolis Hastings Transition Kernel π(θ) ∝ p (θ|D) (j ) At iteration j + 1, state θ 0 (j ) Propose some θ ∼ q θ|θ (j + ) Accept θ ← θ0 with probability Target I I I 1 min I Reject θ(j + ) ← θ(j ) 1 q (θ(j ) |θ0 ) π(θ0 ) × ,1 π(θ(j ) ) q (θ0 |θ(j ) ) otherwise. Big D & MCMC I Need to evaluate π(θ) ∝ p (θ|D) ∝ p (D|θ)p (θ) in every iteration. I Usually, D = {x , . . . , xN }, 1 p(D|θ) = N Y i= p(xi |θ) 1 I Lots of current research: Can we use subsets of D? Existing methodology... ... is based on construction of an alternative transistion kernel Problem: Most methods I have no convergence guarantees I are biased I mix badly I are data intensive Right approach? Existing methodology... ... is based on construction of an alternative transistion kernel Problem: Most methods I have no convergence guarantees I are biased I mix badly I are data intensive Right approach? Expectations! Eπ(θ)ϕ(θ) ϕ:Θ→R Partial Posterior Paths 8 Prior 1/100 2/100 4/100 8/100 16/100 32/100 64/100 100/100 6 µ2 4 2 0 −2 −8 −6 −4 −2 0 µ1 2 4 6 Debiasing Lemma (Rhee & Glynn 2012, 2014) {φt }t ≥ n o E |φt − φ| = 0 I Take a converging sequence lim t →∞ I Randomly truncate with 1 2 P [T ≥ t ] > 0 I Unbiased estimator for limit E{φ} T X φt − φt − φT = t = P [T ≥ t ] ∗ 1 1 Sub-linear average computational costs! Algorithm Cartoon 4 3 µ1 2 1 0 −1 Prior mean −2 0 2 4 µ2 6 Algorithm Cartoon 4 3 µ1 2 1 0 −1 Prior mean −2 0 2 4 µ2 6 Algorithm Cartoon 4 3 µ1 2 1 0 −1 Prior mean −2 0 2 4 µ2 6 Algorithm Cartoon 4 3 µ1 2 1 0 −1 Prior mean −2 0 2 4 µ2 6 Algorithm Cartoon 4 3 µ1 2 1 0 −1 Prior mean −2 0 2 4 µ2 6 Algorithm Cartoon 4 3 µ1 2 1 0 −1 Prior mean −2 0 2 4 µ2 6 Algorithm Cartoon 4 3 µ1 2 1 0 −1 Prior mean −2 0 2 4 µ2 6 Algorithm Cartoon 4 3 µ1 2 1 0 −1 −2 Prior mean Debiasing estimate 1 R PR ∗ r=1 ϕr True Posterior mean Debiasing estimates ϕ∗r 0 2 4 µ2 6 Log normal (Bardenet, Doucet, Holmes 2014) log N (µ, σ 2 ), posterior mean σ 3.0 Êπt {σ} 2.5 2.0 1.5 1.0 0.5 101 102 103 104 Number of data nt 105 Log normal (Bardenet, Doucet, Holmes 2014), 8 2.0 1.8 1.6 1.4 1.2 1.0 log N (µ, σ 2 ), posterior mean σ Running 1 R PR ∗ r=1 ϕr N = 10 0 50 100 150 200 250 300 Replication r Used Data 4000 PTr t=1 nt 6000 2000 0 0 100 200 300 Replication r % less than 25 likelihood evaluations than a single MCMC step Log. regression (Welling & Teh 2011), N = 32561 Logistic Regression a9a, posterior mean of βi 1.0 0.5 0.0 −0.5 −1.0 −1.5 −2.0 101 102 103 Number of data nt 104 Log. regression (Welling & Teh 2011), N = 32561 same likelihood evaluations as 48 MCMC iterations Extensions I Non-factored likelihoods I Not restricted to MCMC, also closed form I Example: Gaussian Process Regression I I I Vanilla: O(N ) Finite rank approximation: O(m N ) Finite rank approximation + debiasing: O(m N 3 2 2 −α ) 1 Approximate GP Regression, 1.2 N = 10 4 GP Regression, predictive mean 1.0 MSE 0.8 0.6 0.4 0.2 0.0 101 102 103 Number of data nt 104 Approximate GP Regression, N = 10 4 Summary I Unbiased estimation of posterior expectations I Partial Posterior Paths I No need to simulate from full p(θ|D) I Sub-linear average computational costs I Not restricted to factored likelihoods Thank you! Questions?