Unbiased Big Bayes Following Paths of Partial Posteriors

advertisement
Unbiased Big Bayes
Following Paths of Partial Posteriors
Heiko Strathmann, Dino Sejdinovic, Mark Girolami
Gatsby Computational Neuroscience Unit, University College London
Department of Statistics, University of Oxford
Department of Statistics, University of Warwick
Being Bayesian: Averaging beliefs of the unknown
p (x ∗ ) =
ˆ
d θ p(x ∗|θ) p(θ|D)
| {z } | {z }
likelihood posterior
where
p(θ)
p(θ|D) ∝ p| (D|θ)
{z } |{z}
likelihood data
prior
Being Bayesian: Averaging beliefs of the unknown
p(x ∗) ≈
where
θ(j ) ∼ p (θ|D)
iid
1
m
m
X
j
=1
p(x ∗|θ(j ))
Being Bayesian: Averaging beliefs of the unknown
p(x ∗) ≈
j
where lim →∞
1
m
θ(j ) ∼ p (θ|D)
m
X
j
=1
p(x ∗|θ(j ))
Metropolis Hastings Transition Kernel
π(θ) ∝ p (θ|D)
(j )
At iteration j + 1, state θ
0
(j )
Propose some θ ∼ q θ|θ
(j + )
Accept θ
← θ0 with probability
Target
I
I
I
1
min
I Reject
θ(j + ) ← θ(j )
1
q (θ(j ) |θ0 )
π(θ0 )
×
,1
π(θ(j ) ) q (θ0 |θ(j ) )
otherwise.
Big
D
& MCMC
I Need to evaluate
π(θ) ∝ p (θ|D) ∝ p (D|θ)p (θ)
in every iteration.
I Usually,
D = {x , . . . , xN },
1
p(D|θ) =
N
Y
i=
p(xi |θ)
1
I Lots of current research: Can we use subsets of
D?
Existing methodology...
... is based on construction of an alternative transistion kernel
Problem: Most methods
I have no convergence guarantees
I are biased
I mix badly
I are data intensive
Right approach?
Existing methodology...
... is based on construction of an alternative transistion kernel
Problem: Most methods
I have no convergence guarantees
I are biased
I mix badly
I are data intensive
Right approach?
Expectations!
Eπ(θ)ϕ(θ)
ϕ:Θ→R
Partial Posterior Paths
8
Prior
1/100
2/100
4/100
8/100
16/100
32/100
64/100
100/100
6
µ2
4
2
0
−2
−8
−6
−4
−2
0
µ1
2
4
6
Debiasing Lemma (Rhee & Glynn 2012, 2014)
{φt }t ≥
n
o
E |φt − φ| = 0
I Take a converging sequence
lim
t →∞
I Randomly truncate with
1
2
P [T ≥ t ] > 0
I Unbiased estimator for limit
E{φ}
T
X
φt − φt −
φT =
t = P [T ≥ t ]
∗
1
1
Sub-linear average computational costs!
Algorithm Cartoon
4
3
µ1
2
1
0
−1
Prior mean
−2
0
2
4
µ2
6
Algorithm Cartoon
4
3
µ1
2
1
0
−1
Prior mean
−2
0
2
4
µ2
6
Algorithm Cartoon
4
3
µ1
2
1
0
−1
Prior mean
−2
0
2
4
µ2
6
Algorithm Cartoon
4
3
µ1
2
1
0
−1
Prior mean
−2
0
2
4
µ2
6
Algorithm Cartoon
4
3
µ1
2
1
0
−1
Prior mean
−2
0
2
4
µ2
6
Algorithm Cartoon
4
3
µ1
2
1
0
−1
Prior mean
−2
0
2
4
µ2
6
Algorithm Cartoon
4
3
µ1
2
1
0
−1
Prior mean
−2
0
2
4
µ2
6
Algorithm Cartoon
4
3
µ1
2
1
0
−1
−2
Prior mean
Debiasing estimate
1
R
PR
∗
r=1 ϕr
True Posterior mean
Debiasing estimates ϕ∗r
0
2
4
µ2
6
Log normal (Bardenet, Doucet, Holmes 2014)
log N (µ, σ 2 ), posterior mean σ
3.0
Êπt {σ}
2.5
2.0
1.5
1.0
0.5
101
102
103
104
Number of data nt
105
Log normal (Bardenet, Doucet, Holmes 2014),
8
2.0
1.8
1.6
1.4
1.2
1.0
log N (µ, σ 2 ), posterior mean σ
Running
1
R
PR
∗
r=1 ϕr
N = 10
0
50
100
150
200
250
300
Replication r
Used Data
4000
PTr
t=1
nt
6000
2000
0
0
100
200
300
Replication r
%
less than 25
likelihood evaluations than a single MCMC step
Log. regression (Welling & Teh 2011),
N = 32561
Logistic Regression a9a, posterior mean of βi
1.0
0.5
0.0
−0.5
−1.0
−1.5
−2.0
101
102
103
Number of data nt
104
Log. regression (Welling & Teh 2011),
N = 32561
same likelihood evaluations as 48 MCMC iterations
Extensions
I Non-factored likelihoods
I Not restricted to MCMC, also closed form
I Example: Gaussian Process Regression
I
I
I
Vanilla: O(N )
Finite rank approximation: O(m N )
Finite rank approximation + debiasing: O(m N
3
2
2
−α )
1
Approximate GP Regression,
1.2
N = 10
4
GP Regression, predictive mean
1.0
MSE
0.8
0.6
0.4
0.2
0.0
101
102
103
Number of data nt
104
Approximate GP Regression,
N = 10
4
Summary
I Unbiased estimation of posterior expectations
I Partial Posterior Paths
I No need to simulate from full
p(θ|D)
I Sub-linear average computational costs
I Not restricted to factored likelihoods
Thank you!
Questions?
Download