Cutting-edge issues in objective Bayesian model comparison Luca La Rocca

advertisement

Cutting-edge issues in objective Bayesian model comparison

Subjective views of a Bayesian barber

Luca La Rocca luca.larocca@unimore.it

Dipartimento di Scienze Fisiche, Informatiche e Matematiche

Università di Modena e Reggio Emilia (UNIMORE)

Big data in biomedicine. Big models?

Centre for Research in Statistical Methodology (CRiSM)

University of Warwick, 27 February 2014

L. La Rocca (UNIMORE)

Objective Bayesian Razors

CRiSM Workshop 27 Feb 2014 1 / 22

Outline

1

Ockham’s razor

2

The whetstone

3

The alum block

4

Discussion

L. La Rocca (UNIMORE)

Objective Bayesian Razors

CRiSM Workshop 27 Feb 2014 2 / 22

Ockham’s razor

Bayes factors

Two sampling models for a sequence of discrete observations z n

,

M n

0

M n

1

= { f n

0

( ·| γ

0

) , γ

0

= { f n

1

( ·| γ

1

) , γ

1

∈ Γ

0

} ,

∈ Γ

1

} , compared by means of the Bayes factor for M n

1 against M n

0

,

BF

10

( z n

) =

R

R

Γ

1

Γ

0 f f n

1

( z n | γ

1

) p

1

( γ

1

) d γ

1 n

0

( z n | γ

0

) p

0

( γ

0

) d γ

0

, where p i

( γ i

) is a parameter prior under M i n

( i = 0 , 1); then

Pr ( M n

1

| z n

) =

1

BF

10

( z n

)

+ BF

10

( z n )

, assuming Pr ( M n

0

) = Pr ( M n

1

) = 1 / 2, where z n

= ( z

1

, . . . , z n

) .

L. La Rocca (UNIMORE)

Objective Bayesian Razors

CRiSM Workshop 27 Feb 2014 3 / 22

Ockham’s razor

Asymptotic learning rate

Let M n

0 be nested in M n

1

( Γ

0

≡ ˜

0

⊂ Γ

1

) with dimensions d

0

< d

1

.

Assume p

0

( · ) is a local prior (continuous and strictly positive on Γ

0

).

Typically p

1

( · ) is also a local prior, so that (under regularity conditions)

BF

10

( z n

) = n

( d

1

− d

0

2

) e

O

P

( 1 )

, as n → ∞ , if the sampling distribution of z n belongs to M n

0

,

BF

10

( z n

) = e

Kn + O

P

( n

1 / 2

)

, for some K > 0, if the sampling distribution of z n belongs to M n

1

\ M n

0

.

This imbalance in the asymptotic learing rate motivated the introduction of non-local priors

1

. . .

1

Johnson, V. E. and Rossell, D. (2010). On the use of non-local prior densities

in Bayesian hypothesis tests. J. R. Stat. Soc. Ser. B Stat. Methodol. 72, 143–170.

L. La Rocca (UNIMORE)

Objective Bayesian Razors

CRiSM Workshop 27 Feb 2014 4 / 22

The whetstone

Generalized moment priors

. . . such as generalized moment priors 2 of order h : p

M

1

( γ

1

| h ) ∝ g h

( γ

1

) p

1

( γ

1

) , γ

1

∈ Γ

1

, where g h

( · ) is a smooth function from Γ vanishing on

1 to <

+

,

˜

0 together with its first 2 h − 1 derivatives, while g h

( 2 h )

( γ

1

) > 0 for all γ

1

∈ ˜

0

; let g

0

( γ

1

) ≡ 1.

Asymptotic learning rate changed to

BF

10

( z n

) = n

− h −

( d

1

− d

0

2

) e

O

P

( 1 )

, as n → ∞ , if the sampling distribution of z n unchanged if the sampling distribution of z n belongs to M belongs to M n

0 n

1

;

\ M n

0

.

2

Consonni, G. , Forster, J. J. and La Rocca, L. (2013). The whetstone and the alum block: Balanced objective Bayesian comparison of nested models for discrete data.

Statist. Sci. 38, 398–423.

L. La Rocca (UNIMORE)

Objective Bayesian Razors

CRiSM Workshop 27 Feb 2014 5 / 22

The whetstone

Comparing two proportions

Let the larger model be the product of two binomial models, f

1 n

1

+ n

2 ( y

1

, y

2

| θ

1

, θ

2

) = Bin ( y

1

| n

1

, θ

1

) Bin ( y

2

| n

2

, θ

2

) , ( θ

1

, θ

2

) ∈ ] 0 , 1 [

2

, and the null model assume θ

1

= θ

2

= θ , f

0 n

1

+ n

2 ( y

1

, y

2

| θ ) = Bin ( y

1

| n

1

, θ ) Bin ( y

2

| n

2

, θ ) ,

Starting from the conjugate local prior

θ ∈ ] 0 , 1 [ .

p

1

( θ

1

, θ

2

| a ) = Beta ( θ

1

| a

11

, a

12

) Beta ( θ

2

| a

21

, a

22

) , under M

1

, where a is 2 × 2 matrix of strictly positive real numbers, define the conjugate moment prior of order h as p

M

1

( θ

1

, θ

2

| a , h ) ∝ ( θ

1

− θ

2

)

2 h

Beta ( θ

1

| a

11

, a

12

) Beta ( θ

2

| a

21

, a

22

); assume a

11

= a

12

= b

1 and a

21

= a

22

= b

2

.

L. La Rocca (UNIMORE)

Objective Bayesian Razors

CRiSM Workshop 27 Feb 2014 6 / 22

The whetstone

Going non-local from a default prior h = 0 h = 1

4

2

0

1.0

0.5

0.0

0.0

0.5

1.0

4

2

0

1.0

0.5

0.0

0.0

0.5

1.0

L. La Rocca (UNIMORE) b

1

= b

2

=

1

Objective Bayesian Razors

b

1

= b

2

=

1

CRiSM Workshop 27 Feb 2014 7 / 22

The whetstone

Increasing the order of a default moment prior h = 2 h = 1

5

0

1.0

0.5

0.0

0.0

0.5

1.0

5

0

1.0

0.5

0.0

0.0

0.5

1.0

L. La Rocca (UNIMORE) b

1

= b

2

=

1

Objective Bayesian Razors

b

1

= b

2

=

1

CRiSM Workshop 27 Feb 2014 8 / 22

The whetstone

Jeffreys-Lindley-Bartlett paradox

Related to the limiting argument of the JLB paradox, the idea that probability mass should not be “wasted” in parameter areas too remote from the null is both old and new:

If a rare event for H

0 occurs that also is rare for typical H

1 for rejecting H values, it provides little evidence

0 in favor of H

1

3

A vague prior distribution assigns much of its probability on values that are never going to be plausible, and this disturbs the posterior probabilities more than we tend to expect—something that we probably do not think about enough in our routine applications of standard statistical methods

4

3

Morris, C. N. (1987). Comments on “Testing a point null hypothesis:

The irreconcilability of P values and evidence.” J. Amer. Statist. Assoc. 82, 131–133.

4

Gelman, A. (2013). P values and statistical practice. Epidemiology 24, 69–72.

L. La Rocca (UNIMORE)

Objective Bayesian Razors

CRiSM Workshop 27 Feb 2014 9 / 22

The whetstone

Were you wearing a red tie, Sir?

Learning Rate

θ

1

= θ

2

=

0.25

θ

1

= 0.25

, θ

2

= 0.40

0 h = 0 h = 1

100 200 n

1

= n

2

300

L. La Rocca (UNIMORE)

400 500

Objective Bayesian Razors

Using a default local prior for θ under M

0

: p

0

( θ | b

0

) = Beta ( θ | b

0

, b

0

) .

Vertical line at n = 50 .

CRiSM Workshop 27 Feb 2014 10 / 22

The alum block

Intrinsic moment priors

Mixing

5 over all possible training samples x t the intrinsic moment prior on γ

1 is given by

= ( x

1

, . . . , x t

) of size t , p

IM

1

( γ

1

| h , t ) =

X p

M

1

( γ

1

| x t

, h ) m

0

( x t

) , x t

γ

1

∈ Γ

1

, where p and m

0

M

1

( x let p

IM

1 t

( ·| x

) = t

,

R h

Γ

0

) f is the posterior of

( ·| h , 0 ) = p

M

1 t

0

(

( x

·| t h

|

)

γ

.

0

) p

0

( γ

0

) d γ

0

γ

1 under M

1

, given is the marginal of x t x t

, under M

0

;

As the training sample size t grows, the intrinsic moment prior increases its concentration on regions around the subspace

˜

0

, while the non-local nature of p

M

1

( ·| h ) is preserved.

5

Pérez, J. M. and Berger, J. O. (2002). Expected-posterior prior distributions for model selection. Biometrika 89, 491–511.

L. La Rocca (UNIMORE)

Objective Bayesian Razors

CRiSM Workshop 27 Feb 2014 11 / 22

The alum block

Pulling the mass back toward the null h = 1, t = (0,0) h = 1, t = (9,9)

4

2

0

1.0

0.5

0.0

0.0

0.5

1.0

4

2

0

1.0

0.5

0.0

0.0

0.5

1.0

L. La Rocca (UNIMORE) b

0

=

2, b

1

= b

2

=

1

Objective Bayesian Razors

b

0

=

2, b

1

= b

2

=

1

CRiSM Workshop 27 Feb 2014 12 / 22

Bleeding stopped

Learning Rate

The alum block

θ

1

= θ

2

=

0.25

θ

1

= 0.25

, θ

2

= 0.40

0 h = 0, t = (0,0) h = 1, t = (9,9) h = 2, t = (17,17)

100 200 n

1

= n

2

300

L. La Rocca (UNIMORE)

400 500

Objective Bayesian Razors

Total training sample size for the comparison of two proportions: t

+

= t

1

+ t

2

.

Vertical line at n = 50 .

CRiSM Workshop 27 Feb 2014 13 / 22

How was t chosen?

The alum block

Minimal Dataset Evidence

0 5 b

0

= 2 , b

1

= b

2

= 1

10 t

+

2

15

L. La Rocca (UNIMORE) h = 0 h = 1 h = 2

20

Objective Bayesian Razors

Total Weight Of Evidence:

X log BF

IM

10

( z m

| b , h , t ) z m summing over all possible observations z m with minimal sample size m such that data can discriminate between M

0 and M

1

.

Here m = ( 1 , 1 ) .

CRiSM Workshop 27 Feb 2014 14 / 22

The alum block

How about the choice of h ?

Choice h = 1 recommended, based on the following considerations: switching from h = 0 to h = 1 changes the asymptotic learning rate from sublinear to superlinear (making a big difference); switching from h = 1 to h = 2 results in a less remarkable difference (while aggravating the problem with small samples).

Inverse moment priors (Johnson and Rossell, 2010, JRSS-B) achieve an exponential learning rate also when the sampling distribution belongs to the smaller model; do you really want to drop that fast a model with the sampling distribution on its boundary?

L. La Rocca (UNIMORE)

Objective Bayesian Razors

CRiSM Workshop 27 Feb 2014 15 / 22

The alum block

Predictive performance

Cross−Validation Study

0.16

0.12

0.08

0.04

0.00

−0.04

−0.08

−0.12

−0.16

h = 1 h = 2

41 23 27

4 2

22 24 35 18 14 13 38 20 34 39 32 15 26 16

Table (Efron, 1996)

10 29 19 36 30

1 6

21 31 12 11 17 28 25 40

L. La Rocca (UNIMORE)

Objective Bayesian Razors

CRiSM Workshop 27 Feb 2014 16 / 22

Discussion

Computational burden

The Bayes factor against M n

0 under M n

1 can be written as using a generalized moment prior

BF

M

10

( z n

| h ) =

R

Γ

1

R g h

( γ

1

) p

1

( γ

1

| z n

) d γ

1

Γ

1 g h

( γ

1

) p

1

( γ

1

) d γ

1

BF

10

( z n

) , so that the extra effort required amounts to computing some generalized moments of the local prior and posterior.

The Bayes factor against M n

0 under M n

1 can be written as a using an intrinsic moment prior mixture of conditional Bayes factors:

BF

IM

10

( z n

| h , t ) =

X

BF

M

10

( z n

| x t

, h ) m

0

( x t

) , x t where BF under M n

1

M

10

( z t | x t

, h ) is the Bayes factor using p

M

1

; recall that m

0

( x t

) = R

Γ

0 f t

0

( x t | γ

0

) p

0

( γ

0

( ·|

) d γ x t

0

.

, h ) as prior

L. La Rocca (UNIMORE)

Objective Bayesian Razors

CRiSM Workshop 27 Feb 2014 17 / 22

Discussion

Why not just increase prior sample size?

h = 1, t = (9,9) h = 1, t = (0,0)

8

6

4

2

0

1.0

0.5

0.0

0.0

0.5

1.0

8

6

4

2

0

1.0

0.5

0.0

0.0

0.5

1.0

L. La Rocca (UNIMORE) b

0

=

2, b

1

= b

2

=

1

Objective Bayesian Razors

b

0

=

2, b

1

= b

2

=

9

CRiSM Workshop 27 Feb 2014 18 / 22

Discussion

Logistic regression models

Suppose we observe y = ( y

1

, . . . , y

N

) with f ( y i

| θ i

) = Bin ( y i

| n i

, θ i

) , i = 1 , . . . , N , and we let log

θ i

1 − θ i

= β

0

+ k

X w ij

β j

, j = 1 i = 1 , . . . , N , where w ij

, j = 1 , . . . , k , are the values of k explanatory variables observed with y i

; the likelihood is f k n

+ ( y | β ) = n

Q

N i = 1 n i y i o

L k

( β | y , n ) , where β = ( β

0

, β

1

, . . . , β k

) , n = ( n

1

, . . . , n

N

) , and

L k

( β | y , n ) =

N

Y e y i

( β

0

+ P k j = 1 w ij

β j

) − n i log ( 1 + exp

{

β

0

+ P k j = 1 w ij

β j

} )

.

i = 1

Special cases : N = 2, k = 1, w ij

= ( i − 1 ) & N = 2, k = 0

L. La Rocca (UNIMORE)

Objective Bayesian Razors

CRiSM Workshop 27 Feb 2014 19 / 22

Discussion

Logistic regression priors

Conjugate local prior

6 where u = ( u

1

, . . . , u

N given by p

C k

( β | u , v ) ∝ L k

) and v = ( v

1

, . . . , v

N

) ;

( β | u , v ) , default specification of these hyperparameters: v i

= v

+ n i n

+

, u i

= v i

,

2 i = 1 , . . . , N , for some v

+

> 0 representing a prior sample size; the condition u i

= v i

/ 2 ensures that the prior mode is at β = 0.

In the special cases corresponding to comparing two proportions, the induced prior on the common proportion is θ ∼ Beta ( v

+

/ 2 , v

+

/ 2 ) while ( θ

1

, θ

2

) ∼ Beta ( v

1

/ 2 , v

1

/ 2 ) ⊗ Beta ( v

2

/ 2 , v

2

/ 2 ) . . .

6

Bedrick, E. J., Christensen, R. and Johnson, W. (1996). A new perspective

on priors for generalized linear models. J. Amer. Statist. Assoc. 91, 1450–1460.

L. La Rocca (UNIMORE)

Objective Bayesian Razors

CRiSM Workshop 27 Feb 2014 20 / 22

Discussion

Going non-local in a different parameterization

. . . whereas the product moment prior

7 on β p

M k

( β | u , v , h ) ∝ k

Y

β j

2 h p

C k

( β | u , v ) , j = 1 in the special case N = 2, k = 1, w ij

= ( i − 1 ) , induces on ( θ

1

, θ

2

) a prior not in the family of conjugate moment priors considered before; how much can results differ?

Intrinsic procedure successfully applied to p

M k but maybe increasing v

+

( ·| u , v , h ) , is an interesting alternative?

7

Johnson, V. E. and Rossell, D. (2012). Bayesian model selection

in high-dimensional settings. J. Amer. Statist. Assoc. 107, 649–660.

L. La Rocca (UNIMORE)

Objective Bayesian Razors

CRiSM Workshop 27 Feb 2014 21 / 22

Thank you!

http://xianblog.wordpress.com/2013/08/01/whetstone- and- alum- and- occams- razor/

L. La Rocca (UNIMORE)

Objective Bayesian Razors

CRiSM Workshop 27 Feb 2014 22 / 22

Download