Cutting-edge issues in objective Bayesian model comparison
Subjective views of a Bayesian barber
Luca La Rocca luca.larocca@unimore.it
Dipartimento di Scienze Fisiche, Informatiche e Matematiche
Università di Modena e Reggio Emilia (UNIMORE)
Big data in biomedicine. Big models?
Centre for Research in Statistical Methodology (CRiSM)
University of Warwick, 27 February 2014
L. La Rocca (UNIMORE)
CRiSM Workshop 27 Feb 2014 1 / 22
Outline
1
2
3
4
L. La Rocca (UNIMORE)
CRiSM Workshop 27 Feb 2014 2 / 22
Ockham’s razor
Bayes factors
Two sampling models for a sequence of discrete observations z n
,
M n
0
M n
1
= { f n
0
( ·| γ
0
) , γ
0
= { f n
1
( ·| γ
1
) , γ
1
∈ Γ
0
} ,
∈ Γ
1
} , compared by means of the Bayes factor for M n
1 against M n
0
,
BF
10
( z n
) =
R
R
Γ
1
Γ
0 f f n
1
( z n | γ
1
) p
1
( γ
1
) d γ
1 n
0
( z n | γ
0
) p
0
( γ
0
) d γ
0
, where p i
( γ i
) is a parameter prior under M i n
( i = 0 , 1); then
Pr ( M n
1
| z n
) =
1
BF
10
( z n
)
+ BF
10
( z n )
, assuming Pr ( M n
0
) = Pr ( M n
1
) = 1 / 2, where z n
= ( z
1
, . . . , z n
) .
L. La Rocca (UNIMORE)
CRiSM Workshop 27 Feb 2014 3 / 22
Ockham’s razor
Asymptotic learning rate
Let M n
0 be nested in M n
1
( Γ
0
≡ ˜
0
⊂ Γ
1
) with dimensions d
0
< d
1
.
Assume p
0
( · ) is a local prior (continuous and strictly positive on Γ
0
).
Typically p
1
( · ) is also a local prior, so that (under regularity conditions)
BF
10
( z n
) = n
−
( d
1
− d
0
2
) e
O
P
( 1 )
, as n → ∞ , if the sampling distribution of z n belongs to M n
0
,
BF
10
( z n
) = e
Kn + O
P
( n
1 / 2
)
, for some K > 0, if the sampling distribution of z n belongs to M n
1
\ M n
0
.
This imbalance in the asymptotic learing rate motivated the introduction of non-local priors
1
. . .
1
Johnson, V. E. and Rossell, D. (2010). On the use of non-local prior densities
in Bayesian hypothesis tests. J. R. Stat. Soc. Ser. B Stat. Methodol. 72, 143–170.
L. La Rocca (UNIMORE)
CRiSM Workshop 27 Feb 2014 4 / 22
The whetstone
Generalized moment priors
. . . such as generalized moment priors 2 of order h : p
M
1
( γ
1
| h ) ∝ g h
( γ
1
) p
1
( γ
1
) , γ
1
∈ Γ
1
, where g h
( · ) is a smooth function from Γ vanishing on
1 to <
+
,
˜
0 together with its first 2 h − 1 derivatives, while g h
( 2 h )
( γ
1
) > 0 for all γ
1
∈ ˜
0
; let g
0
( γ
1
) ≡ 1.
Asymptotic learning rate changed to
BF
10
( z n
) = n
− h −
( d
1
− d
0
2
) e
O
P
( 1 )
, as n → ∞ , if the sampling distribution of z n unchanged if the sampling distribution of z n belongs to M belongs to M n
0 n
1
;
\ M n
0
.
2
Consonni, G. , Forster, J. J. and La Rocca, L. (2013). The whetstone and the alum block: Balanced objective Bayesian comparison of nested models for discrete data.
Statist. Sci. 38, 398–423.
L. La Rocca (UNIMORE)
CRiSM Workshop 27 Feb 2014 5 / 22
The whetstone
Comparing two proportions
Let the larger model be the product of two binomial models, f
1 n
1
+ n
2 ( y
1
, y
2
| θ
1
, θ
2
) = Bin ( y
1
| n
1
, θ
1
) Bin ( y
2
| n
2
, θ
2
) , ( θ
1
, θ
2
) ∈ ] 0 , 1 [
2
, and the null model assume θ
1
= θ
2
= θ , f
0 n
1
+ n
2 ( y
1
, y
2
| θ ) = Bin ( y
1
| n
1
, θ ) Bin ( y
2
| n
2
, θ ) ,
Starting from the conjugate local prior
θ ∈ ] 0 , 1 [ .
p
1
( θ
1
, θ
2
| a ) = Beta ( θ
1
| a
11
, a
12
) Beta ( θ
2
| a
21
, a
22
) , under M
1
, where a is 2 × 2 matrix of strictly positive real numbers, define the conjugate moment prior of order h as p
M
1
( θ
1
, θ
2
| a , h ) ∝ ( θ
1
− θ
2
)
2 h
Beta ( θ
1
| a
11
, a
12
) Beta ( θ
2
| a
21
, a
22
); assume a
11
= a
12
= b
1 and a
21
= a
22
= b
2
.
L. La Rocca (UNIMORE)
CRiSM Workshop 27 Feb 2014 6 / 22
The whetstone
Going non-local from a default prior h = 0 h = 1
4
2
0
1.0
0.5
0.0
0.0
0.5
1.0
4
2
0
1.0
0.5
0.0
0.0
0.5
1.0
L. La Rocca (UNIMORE) b
1
= b
2
=
1
b
1
= b
2
=
1
CRiSM Workshop 27 Feb 2014 7 / 22
The whetstone
Increasing the order of a default moment prior h = 2 h = 1
5
0
1.0
0.5
0.0
0.0
0.5
1.0
5
0
1.0
0.5
0.0
0.0
0.5
1.0
L. La Rocca (UNIMORE) b
1
= b
2
=
1
b
1
= b
2
=
1
CRiSM Workshop 27 Feb 2014 8 / 22
The whetstone
Jeffreys-Lindley-Bartlett paradox
Related to the limiting argument of the JLB paradox, the idea that probability mass should not be “wasted” in parameter areas too remote from the null is both old and new:
If a rare event for H
0 occurs that also is rare for typical H
1 for rejecting H values, it provides little evidence
0 in favor of H
1
3
A vague prior distribution assigns much of its probability on values that are never going to be plausible, and this disturbs the posterior probabilities more than we tend to expect—something that we probably do not think about enough in our routine applications of standard statistical methods
4
3
Morris, C. N. (1987). Comments on “Testing a point null hypothesis:
The irreconcilability of P values and evidence.” J. Amer. Statist. Assoc. 82, 131–133.
4
Gelman, A. (2013). P values and statistical practice. Epidemiology 24, 69–72.
L. La Rocca (UNIMORE)
CRiSM Workshop 27 Feb 2014 9 / 22
The whetstone
Were you wearing a red tie, Sir?
Learning Rate
θ
1
= θ
2
=
0.25
θ
1
= 0.25
, θ
2
= 0.40
0 h = 0 h = 1
100 200 n
1
= n
2
300
L. La Rocca (UNIMORE)
400 500
Using a default local prior for θ under M
0
: p
0
( θ | b
0
) = Beta ( θ | b
0
, b
0
) .
Vertical line at n = 50 .
CRiSM Workshop 27 Feb 2014 10 / 22
The alum block
Intrinsic moment priors
Mixing
5 over all possible training samples x t the intrinsic moment prior on γ
1 is given by
= ( x
1
, . . . , x t
) of size t , p
IM
1
( γ
1
| h , t ) =
X p
M
1
( γ
1
| x t
, h ) m
0
( x t
) , x t
γ
1
∈ Γ
1
, where p and m
0
M
1
( x let p
IM
1 t
( ·| x
) = t
,
R h
Γ
0
) f is the posterior of
( ·| h , 0 ) = p
M
1 t
0
(
( x
·| t h
|
)
γ
.
0
) p
0
( γ
0
) d γ
0
γ
1 under M
1
, given is the marginal of x t x t
, under M
0
;
As the training sample size t grows, the intrinsic moment prior increases its concentration on regions around the subspace
˜
0
, while the non-local nature of p
M
1
( ·| h ) is preserved.
5
Pérez, J. M. and Berger, J. O. (2002). Expected-posterior prior distributions for model selection. Biometrika 89, 491–511.
L. La Rocca (UNIMORE)
CRiSM Workshop 27 Feb 2014 11 / 22
The alum block
Pulling the mass back toward the null h = 1, t = (0,0) h = 1, t = (9,9)
4
2
0
1.0
0.5
0.0
0.0
0.5
1.0
4
2
0
1.0
0.5
0.0
0.0
0.5
1.0
L. La Rocca (UNIMORE) b
0
=
2, b
1
= b
2
=
1
b
0
=
2, b
1
= b
2
=
1
CRiSM Workshop 27 Feb 2014 12 / 22
Bleeding stopped
Learning Rate
The alum block
θ
1
= θ
2
=
0.25
θ
1
= 0.25
, θ
2
= 0.40
0 h = 0, t = (0,0) h = 1, t = (9,9) h = 2, t = (17,17)
100 200 n
1
= n
2
300
L. La Rocca (UNIMORE)
400 500
Total training sample size for the comparison of two proportions: t
+
= t
1
+ t
2
.
Vertical line at n = 50 .
CRiSM Workshop 27 Feb 2014 13 / 22
How was t chosen?
The alum block
Minimal Dataset Evidence
0 5 b
0
= 2 , b
1
= b
2
= 1
10 t
+
2
15
L. La Rocca (UNIMORE) h = 0 h = 1 h = 2
20
Total Weight Of Evidence:
X log BF
IM
10
( z m
| b , h , t ) z m summing over all possible observations z m with minimal sample size m such that data can discriminate between M
0 and M
1
.
Here m = ( 1 , 1 ) .
CRiSM Workshop 27 Feb 2014 14 / 22
The alum block
How about the choice of h ?
Choice h = 1 recommended, based on the following considerations: switching from h = 0 to h = 1 changes the asymptotic learning rate from sublinear to superlinear (making a big difference); switching from h = 1 to h = 2 results in a less remarkable difference (while aggravating the problem with small samples).
Inverse moment priors (Johnson and Rossell, 2010, JRSS-B) achieve an exponential learning rate also when the sampling distribution belongs to the smaller model; do you really want to drop that fast a model with the sampling distribution on its boundary?
L. La Rocca (UNIMORE)
CRiSM Workshop 27 Feb 2014 15 / 22
The alum block
Predictive performance
Cross−Validation Study
0.16
0.12
0.08
0.04
0.00
−0.04
−0.08
−0.12
−0.16
h = 1 h = 2
41 23 27
4 2
22 24 35 18 14 13 38 20 34 39 32 15 26 16
Table (Efron, 1996)
10 29 19 36 30
1 6
21 31 12 11 17 28 25 40
L. La Rocca (UNIMORE)
CRiSM Workshop 27 Feb 2014 16 / 22
Discussion
Computational burden
The Bayes factor against M n
0 under M n
1 can be written as using a generalized moment prior
BF
M
10
( z n
| h ) =
R
Γ
1
R g h
( γ
1
) p
1
( γ
1
| z n
) d γ
1
Γ
1 g h
( γ
1
) p
1
( γ
1
) d γ
1
BF
10
( z n
) , so that the extra effort required amounts to computing some generalized moments of the local prior and posterior.
The Bayes factor against M n
0 under M n
1 can be written as a using an intrinsic moment prior mixture of conditional Bayes factors:
BF
IM
10
( z n
| h , t ) =
X
BF
M
10
( z n
| x t
, h ) m
0
( x t
) , x t where BF under M n
1
M
10
( z t | x t
, h ) is the Bayes factor using p
M
1
; recall that m
0
( x t
) = R
Γ
0 f t
0
( x t | γ
0
) p
0
( γ
( ·|
) d γ x t
.
, h ) as prior
L. La Rocca (UNIMORE)
CRiSM Workshop 27 Feb 2014 17 / 22
Discussion
Why not just increase prior sample size?
h = 1, t = (9,9) h = 1, t = (0,0)
8
6
4
2
0
1.0
0.5
0.0
0.0
0.5
1.0
8
6
4
2
0
1.0
0.5
0.0
0.0
0.5
1.0
L. La Rocca (UNIMORE) b
0
=
2, b
1
= b
2
=
1
b
0
=
2, b
1
= b
2
=
9
CRiSM Workshop 27 Feb 2014 18 / 22
Discussion
Logistic regression models
Suppose we observe y = ( y
1
, . . . , y
N
) with f ( y i
| θ i
) = Bin ( y i
| n i
, θ i
) , i = 1 , . . . , N , and we let log
θ i
1 − θ i
= β
0
+ k
X w ij
β j
, j = 1 i = 1 , . . . , N , where w ij
, j = 1 , . . . , k , are the values of k explanatory variables observed with y i
; the likelihood is f k n
+ ( y | β ) = n
Q
N i = 1 n i y i o
L k
( β | y , n ) , where β = ( β
0
, β
1
, . . . , β k
) , n = ( n
1
, . . . , n
N
) , and
L k
( β | y , n ) =
N
Y e y i
( β
0
+ P k j = 1 w ij
β j
) − n i log ( 1 + exp
{
β
0
+ P k j = 1 w ij
β j
} )
.
i = 1
Special cases : N = 2, k = 1, w ij
= ( i − 1 ) & N = 2, k = 0
L. La Rocca (UNIMORE)
CRiSM Workshop 27 Feb 2014 19 / 22
Discussion
Logistic regression priors
Conjugate local prior
6 where u = ( u
1
, . . . , u
N given by p
C k
( β | u , v ) ∝ L k
) and v = ( v
1
, . . . , v
N
) ;
( β | u , v ) , default specification of these hyperparameters: v i
= v
+ n i n
+
, u i
= v i
,
2 i = 1 , . . . , N , for some v
+
> 0 representing a prior sample size; the condition u i
= v i
/ 2 ensures that the prior mode is at β = 0.
In the special cases corresponding to comparing two proportions, the induced prior on the common proportion is θ ∼ Beta ( v
+
/ 2 , v
+
/ 2 ) while ( θ
1
, θ
2
) ∼ Beta ( v
1
/ 2 , v
1
/ 2 ) ⊗ Beta ( v
2
/ 2 , v
2
/ 2 ) . . .
6
Bedrick, E. J., Christensen, R. and Johnson, W. (1996). A new perspective
on priors for generalized linear models. J. Amer. Statist. Assoc. 91, 1450–1460.
L. La Rocca (UNIMORE)
CRiSM Workshop 27 Feb 2014 20 / 22
Discussion
Going non-local in a different parameterization
. . . whereas the product moment prior
7 on β p
M k
( β | u , v , h ) ∝ k
Y
β j
2 h p
C k
( β | u , v ) , j = 1 in the special case N = 2, k = 1, w ij
= ( i − 1 ) , induces on ( θ
1
, θ
2
) a prior not in the family of conjugate moment priors considered before; how much can results differ?
Intrinsic procedure successfully applied to p
M k but maybe increasing v
+
( ·| u , v , h ) , is an interesting alternative?
7
Johnson, V. E. and Rossell, D. (2012). Bayesian model selection
in high-dimensional settings. J. Amer. Statist. Assoc. 107, 649–660.
L. La Rocca (UNIMORE)
CRiSM Workshop 27 Feb 2014 21 / 22
Thank you!
http://xianblog.wordpress.com/2013/08/01/whetstone- and- alum- and- occams- razor/
L. La Rocca (UNIMORE)
CRiSM Workshop 27 Feb 2014 22 / 22