Bayesian Interval Mapping

advertisement
Bayesian Interval Mapping
1. Bayesian strategy
3-19
2. Markov chain sampling
20-27
3. sampling genetic architectures
28-35
4. criteria for model selection
36-44
QTL 2: Bayes
Seattle SISG: Yandell © 2008
1
QTL model selection: key players
•
observed measurements
– y = phenotypic trait
– m = markers & linkage map
– i = individual index (1,…,n)
•
observed
m
X
missing data
– missing marker data
– q = QT genotypes
q
Q
missing
• alleles QQ, Qq, or qq at locus
•
•
unknown quantities
–  = QT locus (or loci)
–  = phenotype model parameters
–  = QTL model/genetic architecture
unknown


pr(q|m,,) genotype model
– grounded by linkage map, experimental cross
– recombination yields multinomial for q given m
•
Yy
pr(y|q,,) phenotype model
– distribution shape (assumed normal here)
– unknown parameters  (could be non-parametric)
QTL 2: Bayes
Seattle SISG: Yandell © 2008

after
Sen Churchill (2001)
2
1. Bayesian strategy for QTL study
• augment data (y,m) with missing genotypes q
• study unknowns (,,) given augmented data (y,m,q)
– find better genetic architectures 
– find most likely genomic regions = QTL = 
– estimate phenotype parameters = genotype means = 
• sample from posterior in some clever way
– multiple imputation (Sen Churchill 2002)
– Markov chain Monte Carlo (MCMC)
• (Satagopan et al. 1996; Yi et al. 2005, 2007)
posterior 
posterior for q,  ,  ,  
pr ( q,  ,  ,  | y , m) 
QTL 2: Bayes
likelihood * prior
constant
phenotype likelihood * [prior for q,  ,  ,  ]
constant
pr ( y | q,  ,  ) * [pr ( q | m,  ,  ) pr (  |  ) pr ( | m,  ) pr ( )]
pr ( y | m)
Seattle SISG: Yandell © 2008
3
6
8
10
prior mean
actual mean
n small prior
n large
n large
prior mean
n small
prior
actual mean
Bayes posterior for normal data
12
14
16
6
8
y = phenotype values
small prior variance
QTL 2: Bayes
10
12
14
16
y = phenotype values
large prior variance
Seattle SISG: Yandell © 2008
4
Bayes posterior for normal data
model
environment
likelihood
prior
yi =  + ei
e ~ N( 0, 2 ), 2 known
y ~ N( , 2 )
 ~ N( 0, 2 ),  known
posterior:
single individual
mean tends to sample mean
 ~ N( 0 + b1(y1 – 0), b12)
sample of n individuals
 ~ N bn y  (1  bn ) 0 , bn 2 / n 
with y  sum yi / n
{i 1,..., n }
n
shrinkage factor
(shrinks to 1)
bn 
QTL 2: Bayes
Seattle SISG: Yandell © 2008
n  1
1
5
what values are the genotypic means?
phenotype model pr(y|q,)
prior mean
data mean
n small prior
data means
n large
posterior means
6
qq
QTL 2: Bayes
8
10
Qq
12
y = phenotype values
Seattle SISG: Yandell © 2008
14
16
QQ
6
Bayes posterior QTL means
posterior centered on sample genotypic mean
but shrunken slightly toward overall mean
phenotype mean:
E ( y | q)

q
V ( y | q)   2
genotypic prior:
E ( q )

y
V ( q )   2
posterior:
E ( q | y )  bq yq  (1  bq ) y V ( q | y )  bq 2 / nq
nq
shrinkage:
QTL 2: Bayes
bq


count {qi  q}
nq
nq  1
yq  sum yi / nq
{qi  q}
1
Seattle SISG: Yandell © 2008
7
partition genotypic effects
on phenotype
• phenotype depends on genotype
• genotypic value partitioned into
– main effects of single QTL
– epistasis (interaction) between pairs of QTL
q   0   q  E (Y ; q)
 q   ( q2 )   ( q2 )   ( q1 , q2 )
QTL 2: Bayes
Seattle SISG: Yandell © 2008
8
partitition genotypic variance
• consider same 2 QTL + epistasis
• centering variance
V (  0 )   0  s
2
V (  q )  1        
2
• genotypic variance
• heritability
QTL 2: Bayes
2
h 
2
q

2
q
 q2   2
Seattle SISG: Yandell © 2008
2
q
2
1
2
2
 h12  h22  h122
9
2
12
posterior mean ≈ LS estimate
 q | y ~ N (bq ˆq , bqCq 2 )
2
ˆ
 N (  q , Cq )
LS estimate ˆq  sum i [sum j ˆ ( qij )]  sum i wqi yi
variance
V ( ˆq )  sum i wqi2  2  Cq 2
shrinkage
bq  1 /(1  Cq )  1
QTL 2: Bayes
Seattle SISG: Yandell © 2008
10
pr(q|m,) recombination model
pr(q|m,) = pr(geno | map, locus) 
pr(geno | flanking markers, locus)
m1 m2

QTL 2: Bayes
q?
m3
m4
markers
m5
m6
distance along chromosome
Seattle SISG: Yandell © 2008
11
QTL 2: Bayes
Seattle SISG: Yandell © 2008
12
what are likely QTL genotypes q?
how does phenotype y improve guess?
D4Mit41
D4Mit214
what are probabilities
for genotype q
between markers?
120
bp
110
recombinants AA:AB
100
all 1:1 if ignore y
and if we use y?
90
AA
AA
AB
AA
AA
AB
AB
AB
Genotype
QTL 2: Bayes
Seattle SISG: Yandell © 2008
13
posterior on QTL genotypes q
• full conditional of q given data, parameters
– proportional to prior pr(q | m, )
• weight toward q that agrees with flanking markers
– proportional to likelihood pr(y | q, )
• weight toward q with similar phenotype values
– posterior recombination model balances these two
• this is the E-step of EM computations
pr ( y | q,  ) * pr ( q | m,  )
pr ( q | y, m,  ,  ) 
pr ( y | m,  ,  )
QTL 2: Bayes
Seattle SISG: Yandell © 2008
14
Where are the loci  on the genome?
• prior over genome for QTL positions
– flat prior = no prior idea of loci
– or use prior studies to give more weight to some regions
• posterior depends on QTL genotypes q
pr( | m,q) = pr() pr(q | m,) / constant
– constant determined by averaging
• over all possible genotypes q
• over all possible loci  on entire map
• no easy way to write down posterior
QTL 2: Bayes
Seattle SISG: Yandell © 2008
15
what is the genetic architecture ?
• which positions correspond to QTLs?
– priors on loci (previous slide)
• which QTL have main effects?
– priors for presence/absence of main effects
• same prior for all QTL
• can put prior on each d.f. (1 for BC, 2 for F2)
• which pairs of QTL have epistatic interactions?
– prior for presence/absence of epistatic pairs
• depends on whether 0,1,2 QTL have main effects
• epistatic effects less probable than main effects
QTL 2: Bayes
Seattle SISG: Yandell © 2008
16
 = genetic architecture:
loci:
main QTL
epistatic pairs
effects:
add, dom
aa, ad, dd
QTL 2: Bayes
Seattle SISG: Yandell © 2008
17
Bayesian priors & posteriors
• augmenting with missing genotypes q
– prior is recombination model
– posterior is (formally) E step of EM algorithm
• sampling phenotype model parameters 
– prior is “flat” normal at grand mean (no information)
– posterior shrinks genotypic means toward grand mean
– (details for unexplained variance omitted here)
• sampling QTL loci 
– prior is flat across genome (all loci equally likely)
• sampling QTL genetic architecture model 
– number of QTL
• prior is Poisson with mean from previous IM study
– genetic architecture of main effects and epistatic interactions
• priors on epistasis depend on presence/absence of main effects
QTL 2: Bayes
Seattle SISG: Yandell © 2008
18
2. Markov chain sampling
• construct Markov chain around posterior
– want posterior as stable distribution of Markov chain
– in practice, the chain tends toward stable distribution
• initial values may have low posterior probability
• burn-in period to get chain mixing well
• sample QTL model components from full conditionals
–
–
–
–
sample locus  given q, (using Metropolis-Hastings step)
sample genotypes q given ,,y, (using Gibbs sampler)
sample effects  given q,y, (using Gibbs sampler)
sample QTL model  given ,,y,q (using Gibbs or M-H)
( , q,  ,  ) ~ pr ( , q,  ,  | y, m)
( , q,  ,  )1  ( , q,  ,  )2    ( , q,  ,  ) N
QTL 2: Bayes
Seattle SISG: Yandell © 2008
19
MCMC sampling of unknowns (q,µ,)
for given genetic architecture 
• Gibbs sampler
– genotypes q
– effects µ
– not loci 
q ~ pr ( q | yi , mi ,  ,  )
pr ( y | q,  ) pr (  )
~
pr ( y | q )
pr ( q | m,  ) pr ( | m )
~
pr ( q | m )
• Metropolis-Hastings sampler
– extension of Gibbs sampler
– does not require normalization
• pr( q | m ) = sum pr( q | m,  ) pr( )
QTL 2: Bayes
Seattle SISG: Yandell © 2008
20
Gibbs sampler
for two genotypic means
• want to study two correlated effects
– could sample directly from their bivariate distribution
– assume correlation  is known
• instead use Gibbs sampler:
– sample each effect from its full conditional given the other
– pick order of sampling at random
– repeat many times
  0  1
 1 
  ~ N   , 
 2 
  0  
 
 
1 
1 ~ N 2 ,1   2 
2 ~ N 1 ,1   2 
QTL 2: Bayes
Seattle SISG: Yandell © 2008
21
Gibbs sampler samples:  = 0.6
N = 200 samples
3
-2
1
0
-2
-1
Gibbs: mean 2
2
1
0
-1
Gibbs: mean 1
2
3
2
1
0
Gibbs: mean 2
-1
1
0
-1
-2
-2
Gibbs: mean 1
2
N = 50 samples
2
0
100
150
200
-2
Gibbs: mean 2
-1
0
1
2
3
Gibbs: mean 1
2
3
2
1
0
-2
-2
Gibbs: mean 2
50
Markov chain index
2
1
1
0
-1
3
2
1
0
-1
-2
Gibbs: mean 2
-1
Gibbs: mean 1
0
-2
-1
50
Gibbs: mean 2
40
-2
30
1
20
0
10
Markov chain index
-1
0
0
10
20
30
40
Markov chain index
QTL 2: Bayes
50
-2
-1
0
1
Gibbs: mean 1
2
0
50
100
150
Markov chain index
Seattle SISG: Yandell © 2008
200
-2
-1
0
1
2
Gibbs: mean 1
22
3
full conditional for locus
• cannot easily sample from locus full conditional
pr( |y,m,µ,q) = pr(  | m,q)
= pr( q | m,  ) pr( ) / constant
• constant is very difficult to compute explicitly
– must average over all possible loci  over genome
– must do this for every possible genotype q
• Gibbs sampler will not work in general
– but can use method based on ratios of probabilities
– Metropolis-Hastings is extension of Gibbs sampler
QTL 2: Bayes
Seattle SISG: Yandell © 2008
23
Metropolis-Hastings idea
f()
0.4
• want to study distribution f()
• unless too complicated
0.2
– take Monte Carlo samples
– propose new value *
• near (?) current value 
• from some distribution g
– accept new value with prob a
0
2
4
6
0.4
• Metropolis-Hastings samples:
0.0
– take samples using ratios of f
0.2
0.0
-4
QTL 2: Bayes
10
g(–*)
• Gibbs sampler: a = 1 always
 f (* ) g (*   ) 

a  min 1,
* 
 f ( ) g (   ) 
8
Seattle SISG: Yandell © 2008
-2
0
2
4
24
0
0.0
0.1
2000
0.2
pr( |Y)
0.3
0.4
mcmc sequence
4000
6000
0.5
8000
0.6
10000
Metropolis-Hastings for locus 
0
2
4

6
8
10
2
3
4
5

6
7
8
added twist: occasionally propose from entire genome
QTL 2: Bayes
Seattle SISG: Yandell © 2008
25
800
400
0
mcmc sequence
800
400
0
histogram
pr( |Y)
1.0
0.0
0 2 4 6 8
Seattle SISG: Yandell © 2008
0.0 0.2 0.4 0.6

2.0

0 2 4 6 8
0 2 4 6 8
0 2 4 6 8
histogram
pr( |Y)
0.0 0.4 0.8 1.2
pr( |Y)
histogram
4
2
0
pr( |Y)
histogram
6

QTL 2: Bayes
N = 1000 samples
narrow g
wide g
0 2 4 6 8
0 2 4 6 8
0 2 4 6 8
mcmc sequence
150
0 50
150
mcmc sequence
N = 200 samples
narrow g
wide g
0 50
mcmc sequence
Metropolis-Hastings samples

0 2 4 6 8
26
3. sampling genetic architectures
• search across genetic architectures A of various sizes
– allow change in number of QTL
– allow change in types of epistatic interactions
• methods for search
– reversible jump MCMC
– Gibbs sampler with loci indicators
• complexity of epistasis
– Fisher-Cockerham effects model
– general multi-QTL interaction & limits of inference
QTL 2: Bayes
Seattle SISG: Yandell © 2008
27
reversible jump MCMC
• consider known genotypes q at 2 known loci 
– models with 1 or 2 QTL
• M-H step between 1-QTL and 2-QTL models
– model changes dimension (via careful bookkeeping)
– consider mixture over QTL models H
  1 QTL : Y   0   ( q1 )  e
  2 QTL : Y   0   ( q1 )   ( q 2 )  e
QTL 2: Bayes
Seattle SISG: Yandell © 2008
28
geometry of reversible jump
0.6
0.6
0.8
Reversible Jump Sequence
0.8
Move Between Models
b2
0.2 0.4
b2
0.2 0.4
c21 = 0.7
0.0
0.0
m=2
m=1
0.0
QTL 2: Bayes
0.2
0.4
1b1
0.6
0.8
0.0
0.2
0.4
b1
0.6
0.8
1
Seattle SISG: Yandell © 2008
29
geometry allowing q and  to change
first 1000 with m<3
0.0
0.0
0.05
b2
0.1 0.2
b2
0.10
0.3
0.15
0.4
a short sequence
0.05
QTL 2: Bayes
0.10
 b1
1
0.15
-0.3 -0.2 -0.1 0.0 0.1 0.2
b1

Seattle SISG: Yandell © 2008
1
30
collinear QTL = correlated effects
8-week
additive
2
-0.2
-0.1
cor = -0.7
-0.6
-0.3
-0.4
-0.2
additive 2
cor = -0.81
0.0
0.0
4-week
-0.6
-0.4
-0.2
0.0
0.2
-0.2
additive 1
-0.1
0.0
0.1
0.2
additive 1
effect 1
effect 1
• linked QTL = collinear genotypes
 correlated estimates of effects (negative if in coupling phase)
 sum of linked effects usually fairly constant
QTL 2: Bayes
Seattle SISG: Yandell © 2008
31
sampling across QTL models 
0
1
m+1 2 … m
L
action steps: draw one of three choices
• update QTL model  with probability 1-b()-d()
– update current model using full conditionals
– sample QTL loci, effects, and genotypes
• add a locus with probability b()
– propose a new locus along genome
– innovate new genotypes at locus and phenotype effect
– decide whether to accept the “birth” of new locus
• drop a locus with probability d()
– propose dropping one of existing loci
– decide whether to accept the “death” of locus
QTL 2: Bayes
Seattle SISG: Yandell © 2008
32
Gibbs sampler with loci indicators
• consider only QTL at pseudomarkers
– every 1-2 cM
– modest approximation with little bias
• use loci indicators in each pseudomarker
–  = 1 if QTL present
–  = 0 if no QTL present
• Gibbs sampler on loci indicators 
– relatively easy to incorporate epistasis
– Yi, Yandell, Churchill, Allison, Eisen, Pomp (2005 Genetics)
• (see earlier work of Nengjun Yi and Ina Hoeschele)
 q     1  ( q1 )   2  ( q2 ) ,  k  0,1
QTL 2: Bayes
Seattle SISG: Yandell © 2008
33
Bayesian shrinkage estimation
• soft loci indicators
– strength of evidence for j depends on 
– 0    1 (grey scale)
– shrink most s to zero
• Wang et al. (2005 Genetics)
– Shizhong Xu group at U CA Riverside
 q   0   1 1 ( q1 )   2  2 ( q1 ), 0   k  1
QTL 2: Bayes
Seattle SISG: Yandell © 2008
34
4. criteria for model selection
balance fit against complexity
• classical information criteria
– penalize likelihood L by model size ||
– IC = – 2 log L( | y) + penalty()
– maximize over unknowns
• Bayes factors
– marginal posteriors pr(y | )
– average over unknowns
QTL 2: Bayes
Seattle SISG: Yandell © 2008
35
classical information criteria
• start with likelihood L( | y, m)
– measures fit of architecture () to phenotype (y)
• given marker data (m)
– genetic architecture () depends on parameters
• have to estimate loci (µ) and effects ()
• complexity related to number of parameters
– | | = size of genetic architecture
• BC:
| | = 1 + n.qtl + n.qtl(n.qtl - 1) = 1 + 4 + 12 = 17
• F2:
| | = 1 + 2n.qtl +4n.qtl(n.qtl - 1) = 1 + 8 + 48 = 57
QTL 2: Bayes
Seattle SISG: Yandell © 2008
36
classical information criteria
• construct information criteria
– balance fit to complexity
– Akaike
AIC = –2 log(L) + 2 ||
– Bayes/Schwartz BIC = –2 log(L) + || log(n)
– Broman
BIC = –2 log(L) +  || log(n)
– general form: IC = –2 log(L) + || D(n)
• compare models
– hypothesis testing: designed for one comparison
• 2 log[LR(1, 2)] = L(y|m, 2) – L(y|m, 1)
– model selection: penalize complexity
• IC(1, 2) = 2 log[LR(1, 2)] + (|2| – |1|) D(n)
QTL 2: Bayes
Seattle SISG: Yandell © 2008
37
information criteria vs. model size
WinQTL 2.0
SCD data on F2
A=AIC
1=BIC(1)
2=BIC(2)
d=BIC()
models
d
d
information criteria
300
320
340
•
•
•
•
•
•
•
360
d
d
d
1
A
1
1
3
1
1
1A
A
2
2
2
• 2+5+9+2
• 2:2 AD
2
2
d2
d
2
– 1,2,3,4 QTL
– epistasis
2
d
2
A
A
A
4
5
6
7
model parameters p
1
1
A
A
8
9
epistasis
QTL 2: Bayes
Seattle SISG: Yandell © 2008
38
Bayes factors
• ratio of model likelihoods
– ratio of posterior to prior odds for architectures
– averaged over unknowns
pr ( 1 | y , m) / pr ( 2 | y , m) pr ( y | m,  1 )
B12 

pr ( 1 ) / pr ( 2 )
pr ( y | m,  2 )
• roughly equivalent to BIC
– BIC maximizes over unknowns
– BF averages over unknowns
 2 log( B12 )  2 log( LR)  (|  2 |  |  1 |) log( n)
QTL 2: Bayes
Seattle SISG: Yandell © 2008
39
scan of marginal Bayes factor & effect
QTL 2: Bayes
Seattle SISG: Yandell © 2008
40
issues in computing Bayes factors
• BF insensitive to shape of prior on 
– geometric, Poisson, uniform
– precision improves when prior mimics posterior
• BF sensitivity to prior variance on effects 
– prior variance should reflect data variability
– resolved by using hyper-priors
• automatic algorithm; no need for user tuning
• easy to compute Bayes factors from samples
– sample posterior using MCMC
– posterior pr( | y, m) is marginal histogram
QTL 2: Bayes
Seattle SISG: Yandell © 2008
41
• sampled marginal histogram
• shape affected by prior pr(A)
BF 1 , 2
prior probability
0.10
0.20
– prior pr() chosen by user
– posterior pr( |y,m)
e
e
pr ( 1|y , m) /pr ( 1 )

pr ( 2|y , m) /pr ( 2 )
e
p
u
exponential
Poisson
uniform
p p
e u u
p u u
u u u
p
e
p
e
e p
p
e
0.00
• | | = number of QTL
0.30
Bayes factors & genetic architecture 
0
e e
p
e e
p
u p
u p
u u
2
4
6
8
m = number of QTL
10
• pattern of QTL across genome
• gene action and epistasis
QTL 2: Bayes
Seattle SISG: Yandell © 2008
42
3 4
BF sensitivity to fixed prior for effects
Bayes factors
0.5
1
2
4
3
2
3
4
2
4
3
2
4
3
2
4
3
2
4
3
2
4
3
2
3
4
2
4
3
2
1
1
1
1
1
0.2
3
4
2
4
3
2
1
1
1
1
0.05
1
0.20
1
0.50
2.00 5.00 2
hyper-prior heritability h
4
3
2
1
B45
B34
B23
B12
20.00 50.00
2
qj ~ N0, G2 / m, G2  h2 total
, h2 fixed
QTL 2: Bayes
Seattle SISG: Yandell © 2008
43
BF insensitivity to random effects prior
insensitivity to hyper-prior
1.0
3.0
hyper-prior density 2*Beta(a,b)
0.0
0.5
1.0
1.5
2
hyper-parameter heritability h
3
2
3
2
3
2
3
2
Bayes factors
0.2
0.4 0.6
0.0
density
1.0
2.0
0.25,9.75
0.5,9.5
1,9
2,10
1,3
1,1
2.0
3
2
1
1
1
1
0.05
0.10
0.20
2
Eh
1
3
3
2
2
B34
B23
B12
1
1
0.50
1.00
2
qj ~ N0, G2 / m, G2  h2 total
, 12 h2 ~ Beta (a, b)
QTL 2: Bayes
Seattle SISG: Yandell © 2008
44
Download