QTL Model Selection
1. Bayesian strategy
2. Markov chain sampling
3. sampling genetic architectures
4. criteria for model selection
Model Selection
Seattle SISG: Yandell © 2012
1
QTL model selection: key players
•
observed measurements
– y = phenotypic trait
– m = markers & linkage map
– i = individual index (1,…,n)
•
observed
m
X
missing data
– missing marker data
– q = QT genotypes
q
Q
missing
• alleles QQ, Qq, or qq at locus
•
•
unknown quantities
– = QT locus (or loci)
– = phenotype model parameters
– = QTL model/genetic architecture
unknown
pr(q|m,,) genotype model
– grounded by linkage map, experimental cross
– recombination yields multinomial for q given m
•
Yy
pr(y|q,,) phenotype model
– distribution shape (assumed normal here)
– unknown parameters (could be non-parametric)
Model Selection
Seattle SISG: Yandell © 2012
after
Sen Churchill (2001)
2
QTL mapping (from ZB Zeng)
phenotype model pr(y|q,,)
genotypes Q
pr(q|m,,)
markers M
Model Selection
Seattle SISG: Yandell © 2012
3
classical likelihood approach
• genotype model pr(q|m,,)
– missing genotypes q depend on observed markers
m across genome
• phenotype model pr(y|q,,)
– link phenotypes y to genotypes q
LOD( ) log10 {max pr( y | m, , )} c
likelihoodmixesover missing QT Lgenotypes:
pr( y | m, , ) q pr( y | q, )pr(q | m, )
Model Selection
Seattle SISG: Yandell © 2012
4
EM approach
• Iterate E and M steps
– expectation (E): geno prob’s pr(q|m,,)
– maximization (M): pheno model parameters
• mean, effects, variance
– careful attention when many QTL present
• Multiple papers by Zhao-Bang Zeng and others
– Start with simple initial model
• Add QTL, epistatic effects sequentially
Model Selection
Seattle SISG: Yandell © 2012
5
classic model search
• initial model from single QTL analysis
• search for additional QTL
• search for epistasis between pairs of QTL
– Both in model? One in model? Neither?
• Refine model
– Update QTL positions
– Check if existing QTL can be dropped
• Analogous to stepwise regression
Model Selection
Seattle SISG: Yandell © 2012
6
comparing models (details later)
• balance model fit against model complexity
– want to fit data well (maximum likelihood)
– without getting too complicated a model
smaller model
fit model
miss key features
estimate phenotype may be biased
predict new data
may be biased
interpret model
easier
estimate effects
low variance
SysGen: Overview
Seattle SISG: Yandell © 2012
bigger model
fits better
no bias
no bias
more complicated
high variance
7
1. Bayesian strategy for QTL study
• augment data (y,m) with missing genotypes q
• study unknowns (,,) given augmented data (y,m,q)
– find better genetic architectures
– find most likely genomic regions = QTL =
– estimate phenotype parameters = genotype means =
• sample from posterior in some clever way
– multiple imputation (Sen Churchill 2002)
– Markov chain Monte Carlo (MCMC)
• (Satagopan et al. 1996; Yi et al. 2005, 2007)
posterior
posteriorfor q, , ,
pr( q, , , | y , m)
Model Selection
likelihood* prior
constant
phenotypelikelihood* [priorfor q, , , ]
constant
pr( y | q, , ) * [pr( q | m, , )pr( | )pr( | m, )pr( )]
pr( y | m)
Seattle SISG: Yandell © 2012
8
6
8
10
prior mean
actual mean
n small prior
n large
n large
prior mean
n small
prior
actual mean
Bayes posterior for normal data
12
14
16
6
8
y = phenotype values
small prior variance
Model Selection
10
12
14
16
y = phenotype values
large prior variance
Seattle SISG: Yandell © 2012
9
Posterior on genotypic means?
phenotype model pr(y|q,)
prior mean
data mean
n small prior
data means
n large
posterior means
6
qq
Model Selection
8
10
Qq
12
y = phenotype values
Seattle SISG: Yandell © 2012
14
16
QQ
10
Bayes posterior QTL means
posterior centered on sample genotypic mean
but shrunken slightly toward overall mean
phenotype mean:
E ( y | q)
q
V ( y | q) 2
genotypic prior:
E ( q )
y
V ( q ) 2
posterior:
E ( q | y ) bq yq (1 bq ) y V ( q | y ) bq 2 / nq
nq
shrinkage:
QTL 2: Bayes
bq
count{qi q}
nq
nq 1
yq sum yi / nq
{qi q}
1
Seattle SISG: Yandell © 2010
11
pr(q|m,) recombination model
pr(q|m,) = pr(geno | map, locus)
pr(geno | flanking markers, locus)
m1 m2
Model Selection
q? m3
m4
markers
m5
m6
distance along chromosome
Seattle SISG: Yandell © 2012
12
Model Selection
Seattle SISG: Yandell © 2012
13
what are likely QTL genotypes q?
how does phenotype y improve guess?
D4Mit41
D4Mit214
what are probabilities
for genotype q
between markers?
120
bp
110
recombinants AA:AB
100
all 1:1 if ignore y
and if we use y?
90
AA
AA
AB
AA
AA
AB
AB
AB
Genotype
Model Selection
Seattle SISG: Yandell © 2012
14
posterior on QTL genotypes q
• full conditional of q given data, parameters
– proportional to prior pr(q | m, )
• weight toward q that agrees with flanking markers
– proportional to likelihood pr(y | q, )
• weight toward q with similar phenotype values
– posterior recombination model balances these two
• this is the E-step of EM computations
pr( y | q, ) * pr( q | m, )
pr( q | y, m, , )
pr( y | m, , )
Model Selection
Seattle SISG: Yandell © 2012
15
Where are the loci on the genome?
• prior over genome for QTL positions
– flat prior = no prior idea of loci
– or use prior studies to give more weight to some regions
• posterior depends on QTL genotypes q
pr( | m,q) = pr() pr(q | m,) / constant
– constant determined by averaging
• over all possible genotypes q
• over all possible loci on entire map
• no easy way to write down posterior
Model Selection
Seattle SISG: Yandell © 2012
16
what is the genetic architecture ?
• which positions correspond to QTLs?
– priors on loci (previous slide)
• which QTL have main effects?
– priors for presence/absence of main effects
• same prior for all QTL
• can put prior on each d.f. (1 for BC, 2 for F2)
• which pairs of QTL have epistatic interactions?
– prior for presence/absence of epistatic pairs
• depends on whether 0,1,2 QTL have main effects
• epistatic effects less probable than main effects
Model Selection
Seattle SISG: Yandell © 2012
17
= genetic architecture:
loci:
main QTL
epistatic pairs
effects:
add, dom
aa, ad, dd
Model Selection
Seattle SISG: Yandell © 2012
18
Bayesian priors & posteriors
• augmenting with missing genotypes q
– prior is recombination model
– posterior is (formally) E step of EM algorithm
• sampling phenotype model parameters
– prior is “flat” normal at grand mean (no information)
– posterior shrinks genotypic means toward grand mean
– (details for unexplained variance omitted here)
• sampling QTL loci
– prior is flat across genome (all loci equally likely)
• sampling QTL genetic architecture model
– number of QTL
• prior is Poisson with mean from previous IM study
– genetic architecture of main effects and epistatic interactions
• priors on epistasis depend on presence/absence of main effects
Model Selection
Seattle SISG: Yandell © 2012
19
2. Markov chain sampling
• construct Markov chain around posterior
– want posterior as stable distribution of Markov chain
– in practice, the chain tends toward stable distribution
• initial values may have low posterior probability
• burn-in period to get chain mixing well
• sample QTL model components from full conditionals
–
–
–
–
sample locus given q, (using Metropolis-Hastings step)
sample genotypes q given ,,y, (using Gibbs sampler)
sample effects given q,y, (using Gibbs sampler)
sample QTL model given ,,y,q (using Gibbs or M-H)
( , q, , ) ~ pr( , q, , | y, m)
( , q, , )1 ( , q, , )2 ( , q, , ) N
Model Selection
Seattle SISG: Yandell © 2012
20
MCMC sampling of unknowns (q,µ,)
for given genetic architecture
• Gibbs sampler
– genotypes q
– effects µ
– not loci
q ~ pr( q | yi , mi , , )
pr( y | q, ) pr( )
~
pr( y | q)
pr( q | m, ) pr( | m)
~
pr( q | m)
• Metropolis-Hastings sampler
– extension of Gibbs sampler
– does not require normalization
• pr( q | m ) = sum pr( q | m, ) pr( )
Model Selection
Seattle SISG: Yandell © 2012
21
Gibbs sampler
for two genotypic means
• want to study two correlated effects
– could sample directly from their bivariate distribution
– assume correlation is known
• instead use Gibbs sampler:
– sample each effect from its full conditional given the other
– pick order of sampling at random
– repeat many times
0 1
1
~ N ,
2
0
1
1 ~ N 2 ,1 2
2 ~ N 1 ,1 2
Model Selection
Seattle SISG: Yandell © 2012
22
Gibbs sampler samples: = 0.6
N = 200 samples
3
-2
1
0
-2
-1
Gibbs: mean 2
2
1
0
-1
Gibbs: mean 1
2
3
2
1
0
Gibbs: mean 2
-1
1
0
-1
-2
-2
Gibbs: mean 1
2
N = 50 samples
2
0
100
150
200
-2
Gibbs: mean 2
-1
0
1
2
3
Gibbs: mean 1
2
3
2
1
0
-2
-2
Gibbs: mean 2
50
Markov chain index
2
1
1
0
-1
3
2
1
0
-1
-2
Gibbs: mean 2
-1
Gibbs: mean 1
0
-2
-1
50
Gibbs: mean 2
40
-2
30
1
20
0
10
Markov chain index
-1
0
0
10
20
30
40
Markov chain index
Model Selection
50
-2
-1
0
1
Gibbs: mean 1
2
0
50
100
150
Markov chain index
Seattle SISG: Yandell © 2012
200
-2
-1
0
1
2
Gibbs: mean 1
23
3
full conditional for locus
• cannot easily sample from locus full conditional
pr( |y,m,µ,q) = pr( | m,q)
= pr( q | m, ) pr( ) / constant
• constant is very difficult to compute explicitly
– must average over all possible loci over genome
– must do this for every possible genotype q
• Gibbs sampler will not work in general
– but can use method based on ratios of probabilities
– Metropolis-Hastings is extension of Gibbs sampler
Model Selection
Seattle SISG: Yandell © 2012
24
Metropolis-Hastings idea
f()
0.4
• want to study distribution f()
• unless too complicated
0.2
– take Monte Carlo samples
– propose new value *
• near (?) current value
• from some distribution g
– accept new value with prob a
0
2
4
6
0.4
• Metropolis-Hastings samples:
0.0
– take samples using ratios of f
0.2
0.0
-4
Model Selection
10
g(–*)
• Gibbs sampler: a = 1 always
f (* ) g (* )
a min1,
*
f ( ) g ( )
8
Seattle SISG: Yandell © 2012
-2
0
2
4
25
0
0.0
0.1
2000
0.2
pr( |Y)
0.3
0.4
mcmc sequence
4000
6000
0.5
8000
0.6
10000
Metropolis-Hastings for locus
0
2
4
6
8
10
2
3
4
5
6
7
8
added twist: occasionally propose from entire genome
Model Selection
Seattle SISG: Yandell © 2012
26
800
400
0
mcmc sequence
800
400
0
histogram
pr( |Y)
1.0
0.0
0 2 4 6 8
Seattle SISG: Yandell © 2012
0.0 0.2 0.4 0.6
2.0
0 2 4 6 8
0 2 4 6 8
0 2 4 6 8
histogram
pr( |Y)
0.0 0.4 0.8 1.2
pr( |Y)
histogram
4
2
0
pr( |Y)
histogram
6
Model Selection
N = 1000 samples
narrow g
wide g
0 2 4 6 8
0 2 4 6 8
0 2 4 6 8
mcmc sequence
150
0 50
150
mcmc sequence
N = 200 samples
narrow g
wide g
0 50
mcmc sequence
Metropolis-Hastings samples
0 2 4 6 8
27
3. sampling genetic architectures
• search across genetic architectures of various sizes
– allow change in number of QTL
– allow change in types of epistatic interactions
• methods for search
– reversible jump MCMC
– Gibbs sampler with loci indicators
• complexity of epistasis
– Fisher-Cockerham effects model
– general multi-QTL interaction & limits of inference
Model Selection
Seattle SISG: Yandell © 2012
28
reversible jump MCMC
• consider known genotypes q at 2 known loci
– models with 1 or 2 QTL
• M-H step between 1-QTL and 2-QTL models
– model changes dimension (via careful bookkeeping)
– consider mixture over QTL models H
1 QTL : Y 0 ( q1 ) e
2 QTL : Y 0 1 ( q1 ) 2 ( q 2 ) e
Model Selection
Seattle SISG: Yandell © 2012
29
geometry of reversible jump
0.6
0.6
0.8
Reversible Jump Sequence
0.8
Move Between Models
b2
0.2 0.4
b2
0.2 0.4
c21 = 0.7
0.0
0.0
m=2
m=1
0.0
0.2
Model Selection
0.4
1b1
0.6
0.8
0.0
0.2
0.4
b1
0.6
0.8
1
Seattle SISG: Yandell © 2012
30
geometry allowing q and to change
first 1000 with m<3
0.0
0.0
0.05
b2
0.1 0.2
b2
0.10
0.3
0.15
0.4
a short sequence
0.05
Model Selection
0.10
b1
1
0.15
-0.3 -0.2 -0.1 0.0 0.1 0.2
b1
Seattle SISG: Yandell © 2012
1
31
collinear QTL = correlated effects
8-week
additive
2
-0.2
-0.1
cor = -0.7
-0.6
-0.3
additive
2
-0.4
-0.2
cor = -0.81
0.0
0.0
4-week
-0.6
-0.4
-0.2
0.0
0.2
-0.2
additive 1
-0.1
0.0
0.1
0.2
additive 1
effect 1
effect 1
• linked QTL = collinear genotypes
correlated estimates of effects (negative if in coupling phase)
sum of linked effects usually fairly constant
Model Selection
Seattle SISG: Yandell © 2012
32
sampling across QTL models
0
1
m+1 2 … m
L
action steps: draw one of three choices
• update QTL model with probability 1-b()-d()
– update current model using full conditionals
– sample QTL loci, effects, and genotypes
• add a locus with probability b()
– propose a new locus along genome
– innovate new genotypes at locus and phenotype effect
– decide whether to accept the “birth” of new locus
• drop a locus with probability d()
– propose dropping one of existing loci
– decide whether to accept the “death” of locus
Model Selection
Seattle SISG: Yandell © 2012
33
Gibbs sampler with loci indicators
• consider only QTL at pseudomarkers
– every 1-2 cM
– modest approximation with little bias
• use loci indicators in each pseudomarker
– = 1 if QTL present
– = 0 if no QTL present
• Gibbs sampler on loci indicators
– relatively easy to incorporate epistasis
– Yi, Yandell, Churchill, Allison, Eisen, Pomp (2005 Genetics)
• (see earlier work of Nengjun Yi and Ina Hoeschele)
q 1 1 ( q1 ) 2 2 ( q 2 ), k 0,1
Model Selection
Seattle SISG: Yandell © 2012
34
Bayesian shrinkage estimation
• soft loci indicators
– strength of evidence for j depends on
– 0 1 (grey scale)
– shrink most s to zero
• Wang et al. (2005 Genetics)
– Shizhong Xu group at U CA Riverside
q 0 11 ( q1 ) 2 2 ( q1 ), 0 k 1
Model Selection
Seattle SISG: Yandell © 2012
35
other model selection approaches
• include all potential loci in model
• assume “true” model is “sparse” in some sense
• Sparse partial least squares
– Chun, Keles (2009 Genetics; 2010 JRSSB)
• LASSO model selection
– Foster (2006); Foster Verbyla Pitchford (2007 JABES)
– Xu (2007 Biometrics); Yi Xu (2007 Genetics)
– Shi Wahba Wright Klein Klein (2008 Stat & Infer)
Model Selection
Seattle SISG: Yandell © 2012
36
4. criteria for model selection
balance fit against complexity
• classical information criteria
– penalize likelihood L by model size ||
– IC = – 2 log L( | y) + penalty()
– maximize over unknowns
• Bayes factors
– marginal posteriors pr(y | )
– average over unknowns
Model Selection
Seattle SISG: Yandell © 2012
37
classical information criteria
• start with likelihood L( | y, m)
– measures fit of architecture () to phenotype (y)
• given marker data (m)
– genetic architecture () depends on parameters
• have to estimate loci (µ) and effects ()
• complexity related to number of parameters
– | | = size of genetic architecture
• BC:
| | = 1 + n.qtl + n.qtl(n.qtl - 1) = 1 + 4 + 12 = 17
• F2:
| | = 1 + 2n.qtl +4n.qtl(n.qtl - 1) = 1 + 8 + 48 = 57
Model Selection
Seattle SISG: Yandell © 2012
38
classical information criteria
• construct information criteria
– balance fit to complexity
– Akaike
AIC = –2 log(L) + 2 ||
– Bayes/Schwartz BIC = –2 log(L) + || log(n)
– Broman
BIC = –2 log(L) + || log(n)
– general form: IC = –2 log(L) + || D(n)
• compare models
– hypothesis testing: designed for one comparison
• 2 log[LR(1, 2)] = L(y|m, 2) – L(y|m, 1)
– model selection: penalize complexity
• IC(1, 2) = 2 log[LR(1, 2)] + (|2| – |1|) D(n)
Model Selection
Seattle SISG: Yandell © 2012
39
information criteria vs. model size
WinQTL 2.0
SCD data on F2
A=AIC
1=BIC(1)
2=BIC(2)
d=BIC()
models
d
d
information criteria
300
320
340
•
•
•
•
•
•
•
360
d
d
d
1
A
1
1
3
1
1
1A
A
2
2
2
• 2+5+9+2
• 2:2 AD
2
2
d2
d
2
– 1,2,3,4 QTL
– epistasis
2
d
2
A
A
A
4
5
6
7
model parameters p
1
1
A
A
8
9
epistasis
Model Selection
Seattle SISG: Yandell © 2012
40
Bayes factors
• ratio of model likelihoods
– ratio of posterior to prior odds for architectures
– averaged over unknowns
pr( 1 | y, m) / pr( 2 | y, m) pr( y | m, 1 )
B12
pr( 1 ) / pr( 2 )
pr( y | m, 2 )
• roughly equivalent to BIC
– BIC maximizes over unknowns
– BF averages over unknowns
2 log(B12 ) 2 log(LR) (| 2 | | 1 |) log(n)
Model Selection
Seattle SISG: Yandell © 2012
41
scan of marginal Bayes factor & effect
Model Selection
Seattle SISG: Yandell © 2012
42
issues in computing Bayes factors
• BF insensitive to shape of prior on
– geometric, Poisson, uniform
– precision improves when prior mimics posterior
• BF sensitivity to prior variance on effects
– prior variance should reflect data variability
– resolved by using hyper-priors
• automatic algorithm; no need for user tuning
• easy to compute Bayes factors from samples
– sample posterior using MCMC
– posterior pr( | y, m) is marginal histogram
Model Selection
Seattle SISG: Yandell © 2012
43
• sampled marginal histogram
• shape affected by prior pr(A)
BF 1 , 2
prior probability
0.10
0.20
– prior pr() chosen by user
– posterior pr( |y,m)
e
e
pr( 1|y, m) /pr( 1 )
pr( 2|y, m) /pr( 2 )
e
p
u
exponential
Poisson
uniform
p p
e u u
p u u
u u u
p
e
p
e
e p
p
e
0.00
• | | = number of QTL
0.30
Bayes factors & genetic architecture
0
e e
p
e e
p
u p
u p
u u
2
4
6
8
m = number of QTL
10
• pattern of QTL across genome
• gene action and epistasis
Model Selection
Seattle SISG: Yandell © 2012
44
3 4
BF sensitivity to fixed prior for effects
Bayes factors
0.5
1
2
4
3
2
3
4
2
4
3
2
4
3
2
4
3
2
4
3
2
4
3
2
3
4
2
4
3
2
1
1
1
1
1
0.2
3
4
2
4
3
2
1
1
1
1
0.05
1
0.20
1
0.50
2.00 5.00 2
hyper-prior heritability h
4
3
2
1
B45
B34
B23
B12
20.00 50.00
2
qj ~ N0, G2 / m, G2 h2 total
, h2 fixed
Model Selection
Seattle SISG: Yandell © 2012
45
BF insensitivity to random effects prior
insensitivity to hyper-prior
1.0
3.0
hyper-prior density 2*Beta(a,b)
0.0
0.5
1.0
1.5
2
hyper-parameter heritability h
3
2
3
2
3
2
3
2
Bayes factors
0.2
0.4 0.6
0.0
density
1.0
2.0
0.25,9.75
0.5,9.5
1,9
2,10
1,3
1,1
2.0
3
2
1
1
1
1
0.05
0.10
0.20
2
Eh
1
3
3
2
2
B34
B23
B12
1
1
0.50
1.00
2
qj ~ N0, G2 / m, G2 h2 total
, 12 h2 ~ Beta(a, b)
Model Selection
Seattle SISG: Yandell © 2012
46
0
You can add this document to your study collection(s)
Sign in Available only to authorized usersYou can add this document to your saved list
Sign in Available only to authorized users(For complaints, use another form )