Model Selection

advertisement
QTL Model Selection
1. Bayesian strategy
2. Markov chain sampling
3. sampling genetic architectures
4. criteria for model selection
Model Selection
Seattle SISG: Yandell © 2012
1
QTL model selection: key players
•
observed measurements
– y = phenotypic trait
– m = markers & linkage map
– i = individual index (1,…,n)
•
observed
m
X
missing data
– missing marker data
– q = QT genotypes
q
Q
missing
• alleles QQ, Qq, or qq at locus
•
•
unknown quantities
–  = QT locus (or loci)
–  = phenotype model parameters
–  = QTL model/genetic architecture
unknown


pr(q|m,,) genotype model
– grounded by linkage map, experimental cross
– recombination yields multinomial for q given m
•
Yy
pr(y|q,,) phenotype model
– distribution shape (assumed normal here)
– unknown parameters  (could be non-parametric)
Model Selection
Seattle SISG: Yandell © 2012

after
Sen Churchill (2001)
2
QTL mapping (from ZB Zeng)
phenotype model pr(y|q,,)
genotypes Q
pr(q|m,,)
markers M
Model Selection
Seattle SISG: Yandell © 2012
3
classical likelihood approach
• genotype model pr(q|m,,)
– missing genotypes q depend on observed markers
m across genome
• phenotype model pr(y|q,,)
– link phenotypes y to genotypes q
LOD( )  log10 {max pr( y | m,  ,  )}  c
likelihoodmixesover missing QT Lgenotypes:
pr( y | m,  ,  )  q pr( y | q,  )pr(q | m,  )
Model Selection
Seattle SISG: Yandell © 2012
4
EM approach
• Iterate E and M steps
– expectation (E): geno prob’s pr(q|m,,)
– maximization (M): pheno model parameters
• mean, effects, variance
– careful attention when many QTL present
• Multiple papers by Zhao-Bang Zeng and others
– Start with simple initial model
• Add QTL, epistatic effects sequentially
Model Selection
Seattle SISG: Yandell © 2012
5
classic model search
• initial model from single QTL analysis
• search for additional QTL
• search for epistasis between pairs of QTL
– Both in model? One in model? Neither?
• Refine model
– Update QTL positions
– Check if existing QTL can be dropped
• Analogous to stepwise regression
Model Selection
Seattle SISG: Yandell © 2012
6
comparing models (details later)
• balance model fit against model complexity
– want to fit data well (maximum likelihood)
– without getting too complicated a model
smaller model
fit model
miss key features
estimate phenotype may be biased
predict new data
may be biased
interpret model
easier
estimate effects
low variance
SysGen: Overview
Seattle SISG: Yandell © 2012
bigger model
fits better
no bias
no bias
more complicated
high variance
7
1. Bayesian strategy for QTL study
• augment data (y,m) with missing genotypes q
• study unknowns (,,) given augmented data (y,m,q)
– find better genetic architectures 
– find most likely genomic regions = QTL = 
– estimate phenotype parameters = genotype means = 
• sample from posterior in some clever way
– multiple imputation (Sen Churchill 2002)
– Markov chain Monte Carlo (MCMC)
• (Satagopan et al. 1996; Yi et al. 2005, 2007)
posterior
posteriorfor q,  ,  ,  
pr( q,  ,  ,  | y , m) 
Model Selection
likelihood* prior
constant
phenotypelikelihood* [priorfor q,  ,  ,  ]
constant
pr( y | q,  ,  ) * [pr( q | m,  ,  )pr(  |  )pr( | m,  )pr( )]
pr( y | m)
Seattle SISG: Yandell © 2012
8
6
8
10
prior mean
actual mean
n small prior
n large
n large
prior mean
n small
prior
actual mean
Bayes posterior for normal data
12
14
16
6
8
y = phenotype values
small prior variance
Model Selection
10
12
14
16
y = phenotype values
large prior variance
Seattle SISG: Yandell © 2012
9
Posterior on genotypic means?
phenotype model pr(y|q,)
prior mean
data mean
n small prior
data means
n large
posterior means
6
qq
Model Selection
8
10
Qq
12
y = phenotype values
Seattle SISG: Yandell © 2012
14
16
QQ
10
Bayes posterior QTL means
posterior centered on sample genotypic mean
but shrunken slightly toward overall mean
phenotype mean:
E ( y | q)

q
V ( y | q)   2
genotypic prior:
E ( q )

y
V ( q )   2
posterior:
E ( q | y )  bq yq  (1  bq ) y V ( q | y )  bq 2 / nq
nq
shrinkage:
QTL 2: Bayes
bq


count{qi  q}
nq
nq  1
yq  sum yi / nq
{qi  q}
1
Seattle SISG: Yandell © 2010
11
pr(q|m,) recombination model
pr(q|m,) = pr(geno | map, locus) 
pr(geno | flanking markers, locus)
m1 m2

Model Selection
q? m3
m4
markers
m5
m6
distance along chromosome
Seattle SISG: Yandell © 2012
12
Model Selection
Seattle SISG: Yandell © 2012
13
what are likely QTL genotypes q?
how does phenotype y improve guess?
D4Mit41
D4Mit214
what are probabilities
for genotype q
between markers?
120
bp
110
recombinants AA:AB
100
all 1:1 if ignore y
and if we use y?
90
AA
AA
AB
AA
AA
AB
AB
AB
Genotype
Model Selection
Seattle SISG: Yandell © 2012
14
posterior on QTL genotypes q
• full conditional of q given data, parameters
– proportional to prior pr(q | m, )
• weight toward q that agrees with flanking markers
– proportional to likelihood pr(y | q, )
• weight toward q with similar phenotype values
– posterior recombination model balances these two
• this is the E-step of EM computations
pr( y | q,  ) * pr( q | m,  )
pr( q | y, m,  ,  ) 
pr( y | m,  ,  )
Model Selection
Seattle SISG: Yandell © 2012
15
Where are the loci  on the genome?
• prior over genome for QTL positions
– flat prior = no prior idea of loci
– or use prior studies to give more weight to some regions
• posterior depends on QTL genotypes q
pr( | m,q) = pr() pr(q | m,) / constant
– constant determined by averaging
• over all possible genotypes q
• over all possible loci  on entire map
• no easy way to write down posterior
Model Selection
Seattle SISG: Yandell © 2012
16
what is the genetic architecture ?
• which positions correspond to QTLs?
– priors on loci (previous slide)
• which QTL have main effects?
– priors for presence/absence of main effects
• same prior for all QTL
• can put prior on each d.f. (1 for BC, 2 for F2)
• which pairs of QTL have epistatic interactions?
– prior for presence/absence of epistatic pairs
• depends on whether 0,1,2 QTL have main effects
• epistatic effects less probable than main effects
Model Selection
Seattle SISG: Yandell © 2012
17
 = genetic architecture:
loci:
main QTL
epistatic pairs
effects:
add, dom
aa, ad, dd
Model Selection
Seattle SISG: Yandell © 2012
18
Bayesian priors & posteriors
• augmenting with missing genotypes q
– prior is recombination model
– posterior is (formally) E step of EM algorithm
• sampling phenotype model parameters 
– prior is “flat” normal at grand mean (no information)
– posterior shrinks genotypic means toward grand mean
– (details for unexplained variance omitted here)
• sampling QTL loci 
– prior is flat across genome (all loci equally likely)
• sampling QTL genetic architecture model 
– number of QTL
• prior is Poisson with mean from previous IM study
– genetic architecture of main effects and epistatic interactions
• priors on epistasis depend on presence/absence of main effects
Model Selection
Seattle SISG: Yandell © 2012
19
2. Markov chain sampling
• construct Markov chain around posterior
– want posterior as stable distribution of Markov chain
– in practice, the chain tends toward stable distribution
• initial values may have low posterior probability
• burn-in period to get chain mixing well
• sample QTL model components from full conditionals
–
–
–
–
sample locus  given q, (using Metropolis-Hastings step)
sample genotypes q given ,,y, (using Gibbs sampler)
sample effects  given q,y, (using Gibbs sampler)
sample QTL model  given ,,y,q (using Gibbs or M-H)
( , q, ,  ) ~ pr( , q, ,  | y, m)
( , q, ,  )1  ( , q, ,  )2    ( , q, ,  ) N
Model Selection
Seattle SISG: Yandell © 2012
20
MCMC sampling of unknowns (q,µ,)
for given genetic architecture 
• Gibbs sampler
– genotypes q
– effects µ
– not loci 
q ~ pr( q | yi , mi ,  ,  )
pr( y | q,  ) pr(  )
~
pr( y | q)
pr( q | m,  ) pr( | m)
~
pr( q | m)
• Metropolis-Hastings sampler
– extension of Gibbs sampler
– does not require normalization
• pr( q | m ) = sum pr( q | m,  ) pr( )
Model Selection
Seattle SISG: Yandell © 2012
21
Gibbs sampler
for two genotypic means
• want to study two correlated effects
– could sample directly from their bivariate distribution
– assume correlation  is known
• instead use Gibbs sampler:
– sample each effect from its full conditional given the other
– pick order of sampling at random
– repeat many times
  0  1
 1 
  ~ N   , 
 2 
  0  
 
 
1 
1 ~ N  2 ,1   2 
2 ~ N  1 ,1   2 
Model Selection
Seattle SISG: Yandell © 2012
22
Gibbs sampler samples:  = 0.6
N = 200 samples
3
-2
1
0
-2
-1
Gibbs: mean 2
2
1
0
-1
Gibbs: mean 1
2
3
2
1
0
Gibbs: mean 2
-1
1
0
-1
-2
-2
Gibbs: mean 1
2
N = 50 samples
2
0
100
150
200
-2
Gibbs: mean 2
-1
0
1
2
3
Gibbs: mean 1
2
3
2
1
0
-2
-2
Gibbs: mean 2
50
Markov chain index
2
1
1
0
-1
3
2
1
0
-1
-2
Gibbs: mean 2
-1
Gibbs: mean 1
0
-2
-1
50
Gibbs: mean 2
40
-2
30
1
20
0
10
Markov chain index
-1
0
0
10
20
30
40
Markov chain index
Model Selection
50
-2
-1
0
1
Gibbs: mean 1
2
0
50
100
150
Markov chain index
Seattle SISG: Yandell © 2012
200
-2
-1
0
1
2
Gibbs: mean 1
23
3
full conditional for locus
• cannot easily sample from locus full conditional
pr( |y,m,µ,q) = pr(  | m,q)
= pr( q | m,  ) pr( ) / constant
• constant is very difficult to compute explicitly
– must average over all possible loci  over genome
– must do this for every possible genotype q
• Gibbs sampler will not work in general
– but can use method based on ratios of probabilities
– Metropolis-Hastings is extension of Gibbs sampler
Model Selection
Seattle SISG: Yandell © 2012
24
Metropolis-Hastings idea
f()
0.4
• want to study distribution f()
• unless too complicated
0.2
– take Monte Carlo samples
– propose new value *
• near (?) current value 
• from some distribution g
– accept new value with prob a
0
2
4
6
0.4
• Metropolis-Hastings samples:
0.0
– take samples using ratios of f
0.2
0.0
-4
Model Selection
10
g(–*)
• Gibbs sampler: a = 1 always
 f (* ) g (*   ) 

a  min1,
* 
 f ( ) g (   ) 
8
Seattle SISG: Yandell © 2012
-2
0
2
4
25
0
0.0
0.1
2000
0.2
pr( |Y)
0.3
0.4
mcmc sequence
4000
6000
0.5
8000
0.6
10000
Metropolis-Hastings for locus 
0
2
4

6
8
10
2
3
4
5

6
7
8
added twist: occasionally propose from entire genome
Model Selection
Seattle SISG: Yandell © 2012
26
800
400
0
mcmc sequence
800
400
0
histogram
pr( |Y)
1.0
0.0
0 2 4 6 8
Seattle SISG: Yandell © 2012
0.0 0.2 0.4 0.6

2.0

0 2 4 6 8
0 2 4 6 8
0 2 4 6 8
histogram
pr( |Y)
0.0 0.4 0.8 1.2
pr( |Y)
histogram
4
2
0
pr( |Y)
histogram
6

Model Selection
N = 1000 samples
narrow g
wide g
0 2 4 6 8
0 2 4 6 8
0 2 4 6 8
mcmc sequence
150
0 50
150
mcmc sequence
N = 200 samples
narrow g
wide g
0 50
mcmc sequence
Metropolis-Hastings samples

0 2 4 6 8
27
3. sampling genetic architectures
• search across genetic architectures  of various sizes
– allow change in number of QTL
– allow change in types of epistatic interactions
• methods for search
– reversible jump MCMC
– Gibbs sampler with loci indicators
• complexity of epistasis
– Fisher-Cockerham effects model
– general multi-QTL interaction & limits of inference
Model Selection
Seattle SISG: Yandell © 2012
28
reversible jump MCMC
• consider known genotypes q at 2 known loci 
– models with 1 or 2 QTL
• M-H step between 1-QTL and 2-QTL models
– model changes dimension (via careful bookkeeping)
– consider mixture over QTL models H
  1 QTL : Y   0   ( q1 )  e
  2 QTL : Y   0  1 ( q1 )   2 ( q 2 )  e
Model Selection
Seattle SISG: Yandell © 2012
29
geometry of reversible jump
0.6
0.6
0.8
Reversible Jump Sequence
0.8
Move Between Models
b2
0.2 0.4
b2
0.2 0.4
c21 = 0.7
0.0
0.0
m=2
m=1
0.0
0.2
Model Selection
0.4
1b1
0.6
0.8
0.0
0.2
0.4
b1
0.6
0.8
1
Seattle SISG: Yandell © 2012
30
geometry allowing q and  to change
first 1000 with m<3
0.0
0.0
0.05
b2
0.1 0.2
b2
0.10
0.3
0.15
0.4
a short sequence
0.05
Model Selection
0.10
 b1
1
0.15
-0.3 -0.2 -0.1 0.0 0.1 0.2
b1

Seattle SISG: Yandell © 2012
1
31
collinear QTL = correlated effects
8-week
additive
2
-0.2
-0.1
cor = -0.7
-0.6
-0.3
additive
2
-0.4
-0.2
cor = -0.81
0.0
0.0
4-week
-0.6
-0.4
-0.2
0.0
0.2
-0.2
additive 1
-0.1
0.0
0.1
0.2
additive 1
effect 1
effect 1
• linked QTL = collinear genotypes
 correlated estimates of effects (negative if in coupling phase)
 sum of linked effects usually fairly constant
Model Selection
Seattle SISG: Yandell © 2012
32
sampling across QTL models 
0
1
m+1 2 … m
L
action steps: draw one of three choices
• update QTL model  with probability 1-b()-d()
– update current model using full conditionals
– sample QTL loci, effects, and genotypes
• add a locus with probability b()
– propose a new locus along genome
– innovate new genotypes at locus and phenotype effect
– decide whether to accept the “birth” of new locus
• drop a locus with probability d()
– propose dropping one of existing loci
– decide whether to accept the “death” of locus
Model Selection
Seattle SISG: Yandell © 2012
33
Gibbs sampler with loci indicators
• consider only QTL at pseudomarkers
– every 1-2 cM
– modest approximation with little bias
• use loci indicators in each pseudomarker
–  = 1 if QTL present
–  = 0 if no QTL present
• Gibbs sampler on loci indicators 
– relatively easy to incorporate epistasis
– Yi, Yandell, Churchill, Allison, Eisen, Pomp (2005 Genetics)
• (see earlier work of Nengjun Yi and Ina Hoeschele)
 q     1 1 ( q1 )   2  2 ( q 2 ),  k  0,1
Model Selection
Seattle SISG: Yandell © 2012
34
Bayesian shrinkage estimation
• soft loci indicators
– strength of evidence for j depends on 
– 0    1 (grey scale)
– shrink most s to zero
• Wang et al. (2005 Genetics)
– Shizhong Xu group at U CA Riverside
 q   0  11 ( q1 )  2  2 ( q1 ), 0   k  1
Model Selection
Seattle SISG: Yandell © 2012
35
other model selection approaches
• include all potential loci in model
• assume “true” model is “sparse” in some sense
• Sparse partial least squares
– Chun, Keles (2009 Genetics; 2010 JRSSB)
• LASSO model selection
– Foster (2006); Foster Verbyla Pitchford (2007 JABES)
– Xu (2007 Biometrics); Yi Xu (2007 Genetics)
– Shi Wahba Wright Klein Klein (2008 Stat & Infer)
Model Selection
Seattle SISG: Yandell © 2012
36
4. criteria for model selection
balance fit against complexity
• classical information criteria
– penalize likelihood L by model size ||
– IC = – 2 log L( | y) + penalty()
– maximize over unknowns
• Bayes factors
– marginal posteriors pr(y | )
– average over unknowns
Model Selection
Seattle SISG: Yandell © 2012
37
classical information criteria
• start with likelihood L( | y, m)
– measures fit of architecture () to phenotype (y)
• given marker data (m)
– genetic architecture () depends on parameters
• have to estimate loci (µ) and effects ()
• complexity related to number of parameters
– | | = size of genetic architecture
• BC:
| | = 1 + n.qtl + n.qtl(n.qtl - 1) = 1 + 4 + 12 = 17
• F2:
| | = 1 + 2n.qtl +4n.qtl(n.qtl - 1) = 1 + 8 + 48 = 57
Model Selection
Seattle SISG: Yandell © 2012
38
classical information criteria
• construct information criteria
– balance fit to complexity
– Akaike
AIC = –2 log(L) + 2 ||
– Bayes/Schwartz BIC = –2 log(L) + || log(n)
– Broman
BIC = –2 log(L) +  || log(n)
– general form: IC = –2 log(L) + || D(n)
• compare models
– hypothesis testing: designed for one comparison
• 2 log[LR(1, 2)] = L(y|m, 2) – L(y|m, 1)
– model selection: penalize complexity
• IC(1, 2) = 2 log[LR(1, 2)] + (|2| – |1|) D(n)
Model Selection
Seattle SISG: Yandell © 2012
39
information criteria vs. model size
WinQTL 2.0
SCD data on F2
A=AIC
1=BIC(1)
2=BIC(2)
d=BIC()
models
d
d
information criteria
300
320
340
•
•
•
•
•
•
•
360
d
d
d
1
A
1
1
3
1
1
1A
A
2
2
2
• 2+5+9+2
• 2:2 AD
2
2
d2
d
2
– 1,2,3,4 QTL
– epistasis
2
d
2
A
A
A
4
5
6
7
model parameters p
1
1
A
A
8
9
epistasis
Model Selection
Seattle SISG: Yandell © 2012
40
Bayes factors
• ratio of model likelihoods
– ratio of posterior to prior odds for architectures
– averaged over unknowns
pr( 1 | y, m) / pr( 2 | y, m) pr( y | m,  1 )
B12 

pr( 1 ) / pr( 2 )
pr( y | m,  2 )
• roughly equivalent to BIC
– BIC maximizes over unknowns
– BF averages over unknowns
 2 log(B12 )  2 log(LR)  (|  2 |  |  1 |) log(n)
Model Selection
Seattle SISG: Yandell © 2012
41
scan of marginal Bayes factor & effect
Model Selection
Seattle SISG: Yandell © 2012
42
issues in computing Bayes factors
• BF insensitive to shape of prior on 
– geometric, Poisson, uniform
– precision improves when prior mimics posterior
• BF sensitivity to prior variance on effects 
– prior variance should reflect data variability
– resolved by using hyper-priors
• automatic algorithm; no need for user tuning
• easy to compute Bayes factors from samples
– sample posterior using MCMC
– posterior pr( | y, m) is marginal histogram
Model Selection
Seattle SISG: Yandell © 2012
43
• sampled marginal histogram
• shape affected by prior pr(A)
BF 1 , 2
prior probability
0.10
0.20
– prior pr() chosen by user
– posterior pr( |y,m)
e
e
pr( 1|y, m) /pr( 1 )

pr( 2|y, m) /pr( 2 )
e
p
u
exponential
Poisson
uniform
p p
e u u
p u u
u u u
p
e
p
e
e p
p
e
0.00
• | | = number of QTL
0.30
Bayes factors & genetic architecture 
0
e e
p
e e
p
u p
u p
u u
2
4
6
8
m = number of QTL
10
• pattern of QTL across genome
• gene action and epistasis
Model Selection
Seattle SISG: Yandell © 2012
44
3 4
BF sensitivity to fixed prior for effects
Bayes factors
0.5
1
2
4
3
2
3
4
2
4
3
2
4
3
2
4
3
2
4
3
2
4
3
2
3
4
2
4
3
2
1
1
1
1
1
0.2
3
4
2
4
3
2
1
1
1
1
0.05
1
0.20
1
0.50
2.00 5.00 2
hyper-prior heritability h
4
3
2
1
B45
B34
B23
B12
20.00 50.00
2
qj ~ N0, G2 / m, G2  h2 total
, h2 fixed
Model Selection
Seattle SISG: Yandell © 2012
45
BF insensitivity to random effects prior
insensitivity to hyper-prior
1.0
3.0
hyper-prior density 2*Beta(a,b)
0.0
0.5
1.0
1.5
2
hyper-parameter heritability h
3
2
3
2
3
2
3
2
Bayes factors
0.2
0.4 0.6
0.0
density
1.0
2.0
0.25,9.75
0.5,9.5
1,9
2,10
1,3
1,1
2.0
3
2
1
1
1
1
0.05
0.10
0.20
2
Eh
1
3
3
2
2
B34
B23
B12
1
1
0.50
1.00
2
qj ~ N0, G2 / m, G2  h2 total
, 12 h2 ~ Beta(a, b)
Model Selection
Seattle SISG: Yandell © 2012
46
Download