QTL Model Selection 1. Bayesian strategy 2. Markov chain sampling 3. sampling genetic architectures 4. criteria for model selection Model Selection Seattle SISG: Yandell © 2012 1 QTL model selection: key players • observed measurements – y = phenotypic trait – m = markers & linkage map – i = individual index (1,…,n) • observed m X missing data – missing marker data – q = QT genotypes q Q missing • alleles QQ, Qq, or qq at locus • • unknown quantities – = QT locus (or loci) – = phenotype model parameters – = QTL model/genetic architecture unknown pr(q|m,,) genotype model – grounded by linkage map, experimental cross – recombination yields multinomial for q given m • Yy pr(y|q,,) phenotype model – distribution shape (assumed normal here) – unknown parameters (could be non-parametric) Model Selection Seattle SISG: Yandell © 2012 after Sen Churchill (2001) 2 QTL mapping (from ZB Zeng) phenotype model pr(y|q,,) genotypes Q pr(q|m,,) markers M Model Selection Seattle SISG: Yandell © 2012 3 classical likelihood approach • genotype model pr(q|m,,) – missing genotypes q depend on observed markers m across genome • phenotype model pr(y|q,,) – link phenotypes y to genotypes q LOD( ) log10 {max pr( y | m, , )} c likelihoodmixesover missing QT Lgenotypes: pr( y | m, , ) q pr( y | q, )pr(q | m, ) Model Selection Seattle SISG: Yandell © 2012 4 EM approach • Iterate E and M steps – expectation (E): geno prob’s pr(q|m,,) – maximization (M): pheno model parameters • mean, effects, variance – careful attention when many QTL present • Multiple papers by Zhao-Bang Zeng and others – Start with simple initial model • Add QTL, epistatic effects sequentially Model Selection Seattle SISG: Yandell © 2012 5 classic model search • initial model from single QTL analysis • search for additional QTL • search for epistasis between pairs of QTL – Both in model? One in model? Neither? • Refine model – Update QTL positions – Check if existing QTL can be dropped • Analogous to stepwise regression Model Selection Seattle SISG: Yandell © 2012 6 comparing models (details later) • balance model fit against model complexity – want to fit data well (maximum likelihood) – without getting too complicated a model smaller model fit model miss key features estimate phenotype may be biased predict new data may be biased interpret model easier estimate effects low variance SysGen: Overview Seattle SISG: Yandell © 2012 bigger model fits better no bias no bias more complicated high variance 7 1. Bayesian strategy for QTL study • augment data (y,m) with missing genotypes q • study unknowns (,,) given augmented data (y,m,q) – find better genetic architectures – find most likely genomic regions = QTL = – estimate phenotype parameters = genotype means = • sample from posterior in some clever way – multiple imputation (Sen Churchill 2002) – Markov chain Monte Carlo (MCMC) • (Satagopan et al. 1996; Yi et al. 2005, 2007) posterior posteriorfor q, , , pr( q, , , | y , m) Model Selection likelihood* prior constant phenotypelikelihood* [priorfor q, , , ] constant pr( y | q, , ) * [pr( q | m, , )pr( | )pr( | m, )pr( )] pr( y | m) Seattle SISG: Yandell © 2012 8 6 8 10 prior mean actual mean n small prior n large n large prior mean n small prior actual mean Bayes posterior for normal data 12 14 16 6 8 y = phenotype values small prior variance Model Selection 10 12 14 16 y = phenotype values large prior variance Seattle SISG: Yandell © 2012 9 Posterior on genotypic means? phenotype model pr(y|q,) prior mean data mean n small prior data means n large posterior means 6 qq Model Selection 8 10 Qq 12 y = phenotype values Seattle SISG: Yandell © 2012 14 16 QQ 10 Bayes posterior QTL means posterior centered on sample genotypic mean but shrunken slightly toward overall mean phenotype mean: E ( y | q) q V ( y | q) 2 genotypic prior: E ( q ) y V ( q ) 2 posterior: E ( q | y ) bq yq (1 bq ) y V ( q | y ) bq 2 / nq nq shrinkage: QTL 2: Bayes bq count{qi q} nq nq 1 yq sum yi / nq {qi q} 1 Seattle SISG: Yandell © 2010 11 pr(q|m,) recombination model pr(q|m,) = pr(geno | map, locus) pr(geno | flanking markers, locus) m1 m2 Model Selection q? m3 m4 markers m5 m6 distance along chromosome Seattle SISG: Yandell © 2012 12 Model Selection Seattle SISG: Yandell © 2012 13 what are likely QTL genotypes q? how does phenotype y improve guess? D4Mit41 D4Mit214 what are probabilities for genotype q between markers? 120 bp 110 recombinants AA:AB 100 all 1:1 if ignore y and if we use y? 90 AA AA AB AA AA AB AB AB Genotype Model Selection Seattle SISG: Yandell © 2012 14 posterior on QTL genotypes q • full conditional of q given data, parameters – proportional to prior pr(q | m, ) • weight toward q that agrees with flanking markers – proportional to likelihood pr(y | q, ) • weight toward q with similar phenotype values – posterior recombination model balances these two • this is the E-step of EM computations pr( y | q, ) * pr( q | m, ) pr( q | y, m, , ) pr( y | m, , ) Model Selection Seattle SISG: Yandell © 2012 15 Where are the loci on the genome? • prior over genome for QTL positions – flat prior = no prior idea of loci – or use prior studies to give more weight to some regions • posterior depends on QTL genotypes q pr( | m,q) = pr() pr(q | m,) / constant – constant determined by averaging • over all possible genotypes q • over all possible loci on entire map • no easy way to write down posterior Model Selection Seattle SISG: Yandell © 2012 16 what is the genetic architecture ? • which positions correspond to QTLs? – priors on loci (previous slide) • which QTL have main effects? – priors for presence/absence of main effects • same prior for all QTL • can put prior on each d.f. (1 for BC, 2 for F2) • which pairs of QTL have epistatic interactions? – prior for presence/absence of epistatic pairs • depends on whether 0,1,2 QTL have main effects • epistatic effects less probable than main effects Model Selection Seattle SISG: Yandell © 2012 17 = genetic architecture: loci: main QTL epistatic pairs effects: add, dom aa, ad, dd Model Selection Seattle SISG: Yandell © 2012 18 Bayesian priors & posteriors • augmenting with missing genotypes q – prior is recombination model – posterior is (formally) E step of EM algorithm • sampling phenotype model parameters – prior is “flat” normal at grand mean (no information) – posterior shrinks genotypic means toward grand mean – (details for unexplained variance omitted here) • sampling QTL loci – prior is flat across genome (all loci equally likely) • sampling QTL genetic architecture model – number of QTL • prior is Poisson with mean from previous IM study – genetic architecture of main effects and epistatic interactions • priors on epistasis depend on presence/absence of main effects Model Selection Seattle SISG: Yandell © 2012 19 2. Markov chain sampling • construct Markov chain around posterior – want posterior as stable distribution of Markov chain – in practice, the chain tends toward stable distribution • initial values may have low posterior probability • burn-in period to get chain mixing well • sample QTL model components from full conditionals – – – – sample locus given q, (using Metropolis-Hastings step) sample genotypes q given ,,y, (using Gibbs sampler) sample effects given q,y, (using Gibbs sampler) sample QTL model given ,,y,q (using Gibbs or M-H) ( , q, , ) ~ pr( , q, , | y, m) ( , q, , )1 ( , q, , )2 ( , q, , ) N Model Selection Seattle SISG: Yandell © 2012 20 MCMC sampling of unknowns (q,µ,) for given genetic architecture • Gibbs sampler – genotypes q – effects µ – not loci q ~ pr( q | yi , mi , , ) pr( y | q, ) pr( ) ~ pr( y | q) pr( q | m, ) pr( | m) ~ pr( q | m) • Metropolis-Hastings sampler – extension of Gibbs sampler – does not require normalization • pr( q | m ) = sum pr( q | m, ) pr( ) Model Selection Seattle SISG: Yandell © 2012 21 Gibbs sampler for two genotypic means • want to study two correlated effects – could sample directly from their bivariate distribution – assume correlation is known • instead use Gibbs sampler: – sample each effect from its full conditional given the other – pick order of sampling at random – repeat many times 0 1 1 ~ N , 2 0 1 1 ~ N 2 ,1 2 2 ~ N 1 ,1 2 Model Selection Seattle SISG: Yandell © 2012 22 Gibbs sampler samples: = 0.6 N = 200 samples 3 -2 1 0 -2 -1 Gibbs: mean 2 2 1 0 -1 Gibbs: mean 1 2 3 2 1 0 Gibbs: mean 2 -1 1 0 -1 -2 -2 Gibbs: mean 1 2 N = 50 samples 2 0 100 150 200 -2 Gibbs: mean 2 -1 0 1 2 3 Gibbs: mean 1 2 3 2 1 0 -2 -2 Gibbs: mean 2 50 Markov chain index 2 1 1 0 -1 3 2 1 0 -1 -2 Gibbs: mean 2 -1 Gibbs: mean 1 0 -2 -1 50 Gibbs: mean 2 40 -2 30 1 20 0 10 Markov chain index -1 0 0 10 20 30 40 Markov chain index Model Selection 50 -2 -1 0 1 Gibbs: mean 1 2 0 50 100 150 Markov chain index Seattle SISG: Yandell © 2012 200 -2 -1 0 1 2 Gibbs: mean 1 23 3 full conditional for locus • cannot easily sample from locus full conditional pr( |y,m,µ,q) = pr( | m,q) = pr( q | m, ) pr( ) / constant • constant is very difficult to compute explicitly – must average over all possible loci over genome – must do this for every possible genotype q • Gibbs sampler will not work in general – but can use method based on ratios of probabilities – Metropolis-Hastings is extension of Gibbs sampler Model Selection Seattle SISG: Yandell © 2012 24 Metropolis-Hastings idea f() 0.4 • want to study distribution f() • unless too complicated 0.2 – take Monte Carlo samples – propose new value * • near (?) current value • from some distribution g – accept new value with prob a 0 2 4 6 0.4 • Metropolis-Hastings samples: 0.0 – take samples using ratios of f 0.2 0.0 -4 Model Selection 10 g(–*) • Gibbs sampler: a = 1 always f (* ) g (* ) a min1, * f ( ) g ( ) 8 Seattle SISG: Yandell © 2012 -2 0 2 4 25 0 0.0 0.1 2000 0.2 pr( |Y) 0.3 0.4 mcmc sequence 4000 6000 0.5 8000 0.6 10000 Metropolis-Hastings for locus 0 2 4 6 8 10 2 3 4 5 6 7 8 added twist: occasionally propose from entire genome Model Selection Seattle SISG: Yandell © 2012 26 800 400 0 mcmc sequence 800 400 0 histogram pr( |Y) 1.0 0.0 0 2 4 6 8 Seattle SISG: Yandell © 2012 0.0 0.2 0.4 0.6 2.0 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 histogram pr( |Y) 0.0 0.4 0.8 1.2 pr( |Y) histogram 4 2 0 pr( |Y) histogram 6 Model Selection N = 1000 samples narrow g wide g 0 2 4 6 8 0 2 4 6 8 0 2 4 6 8 mcmc sequence 150 0 50 150 mcmc sequence N = 200 samples narrow g wide g 0 50 mcmc sequence Metropolis-Hastings samples 0 2 4 6 8 27 3. sampling genetic architectures • search across genetic architectures of various sizes – allow change in number of QTL – allow change in types of epistatic interactions • methods for search – reversible jump MCMC – Gibbs sampler with loci indicators • complexity of epistasis – Fisher-Cockerham effects model – general multi-QTL interaction & limits of inference Model Selection Seattle SISG: Yandell © 2012 28 reversible jump MCMC • consider known genotypes q at 2 known loci – models with 1 or 2 QTL • M-H step between 1-QTL and 2-QTL models – model changes dimension (via careful bookkeeping) – consider mixture over QTL models H 1 QTL : Y 0 ( q1 ) e 2 QTL : Y 0 1 ( q1 ) 2 ( q 2 ) e Model Selection Seattle SISG: Yandell © 2012 29 geometry of reversible jump 0.6 0.6 0.8 Reversible Jump Sequence 0.8 Move Between Models b2 0.2 0.4 b2 0.2 0.4 c21 = 0.7 0.0 0.0 m=2 m=1 0.0 0.2 Model Selection 0.4 1b1 0.6 0.8 0.0 0.2 0.4 b1 0.6 0.8 1 Seattle SISG: Yandell © 2012 30 geometry allowing q and to change first 1000 with m<3 0.0 0.0 0.05 b2 0.1 0.2 b2 0.10 0.3 0.15 0.4 a short sequence 0.05 Model Selection 0.10 b1 1 0.15 -0.3 -0.2 -0.1 0.0 0.1 0.2 b1 Seattle SISG: Yandell © 2012 1 31 collinear QTL = correlated effects 8-week additive 2 -0.2 -0.1 cor = -0.7 -0.6 -0.3 additive 2 -0.4 -0.2 cor = -0.81 0.0 0.0 4-week -0.6 -0.4 -0.2 0.0 0.2 -0.2 additive 1 -0.1 0.0 0.1 0.2 additive 1 effect 1 effect 1 • linked QTL = collinear genotypes correlated estimates of effects (negative if in coupling phase) sum of linked effects usually fairly constant Model Selection Seattle SISG: Yandell © 2012 32 sampling across QTL models 0 1 m+1 2 … m L action steps: draw one of three choices • update QTL model with probability 1-b()-d() – update current model using full conditionals – sample QTL loci, effects, and genotypes • add a locus with probability b() – propose a new locus along genome – innovate new genotypes at locus and phenotype effect – decide whether to accept the “birth” of new locus • drop a locus with probability d() – propose dropping one of existing loci – decide whether to accept the “death” of locus Model Selection Seattle SISG: Yandell © 2012 33 Gibbs sampler with loci indicators • consider only QTL at pseudomarkers – every 1-2 cM – modest approximation with little bias • use loci indicators in each pseudomarker – = 1 if QTL present – = 0 if no QTL present • Gibbs sampler on loci indicators – relatively easy to incorporate epistasis – Yi, Yandell, Churchill, Allison, Eisen, Pomp (2005 Genetics) • (see earlier work of Nengjun Yi and Ina Hoeschele) q 1 1 ( q1 ) 2 2 ( q 2 ), k 0,1 Model Selection Seattle SISG: Yandell © 2012 34 Bayesian shrinkage estimation • soft loci indicators – strength of evidence for j depends on – 0 1 (grey scale) – shrink most s to zero • Wang et al. (2005 Genetics) – Shizhong Xu group at U CA Riverside q 0 11 ( q1 ) 2 2 ( q1 ), 0 k 1 Model Selection Seattle SISG: Yandell © 2012 35 other model selection approaches • include all potential loci in model • assume “true” model is “sparse” in some sense • Sparse partial least squares – Chun, Keles (2009 Genetics; 2010 JRSSB) • LASSO model selection – Foster (2006); Foster Verbyla Pitchford (2007 JABES) – Xu (2007 Biometrics); Yi Xu (2007 Genetics) – Shi Wahba Wright Klein Klein (2008 Stat & Infer) Model Selection Seattle SISG: Yandell © 2012 36 4. criteria for model selection balance fit against complexity • classical information criteria – penalize likelihood L by model size || – IC = – 2 log L( | y) + penalty() – maximize over unknowns • Bayes factors – marginal posteriors pr(y | ) – average over unknowns Model Selection Seattle SISG: Yandell © 2012 37 classical information criteria • start with likelihood L( | y, m) – measures fit of architecture () to phenotype (y) • given marker data (m) – genetic architecture () depends on parameters • have to estimate loci (µ) and effects () • complexity related to number of parameters – | | = size of genetic architecture • BC: | | = 1 + n.qtl + n.qtl(n.qtl - 1) = 1 + 4 + 12 = 17 • F2: | | = 1 + 2n.qtl +4n.qtl(n.qtl - 1) = 1 + 8 + 48 = 57 Model Selection Seattle SISG: Yandell © 2012 38 classical information criteria • construct information criteria – balance fit to complexity – Akaike AIC = –2 log(L) + 2 || – Bayes/Schwartz BIC = –2 log(L) + || log(n) – Broman BIC = –2 log(L) + || log(n) – general form: IC = –2 log(L) + || D(n) • compare models – hypothesis testing: designed for one comparison • 2 log[LR(1, 2)] = L(y|m, 2) – L(y|m, 1) – model selection: penalize complexity • IC(1, 2) = 2 log[LR(1, 2)] + (|2| – |1|) D(n) Model Selection Seattle SISG: Yandell © 2012 39 information criteria vs. model size WinQTL 2.0 SCD data on F2 A=AIC 1=BIC(1) 2=BIC(2) d=BIC() models d d information criteria 300 320 340 • • • • • • • 360 d d d 1 A 1 1 3 1 1 1A A 2 2 2 • 2+5+9+2 • 2:2 AD 2 2 d2 d 2 – 1,2,3,4 QTL – epistasis 2 d 2 A A A 4 5 6 7 model parameters p 1 1 A A 8 9 epistasis Model Selection Seattle SISG: Yandell © 2012 40 Bayes factors • ratio of model likelihoods – ratio of posterior to prior odds for architectures – averaged over unknowns pr( 1 | y, m) / pr( 2 | y, m) pr( y | m, 1 ) B12 pr( 1 ) / pr( 2 ) pr( y | m, 2 ) • roughly equivalent to BIC – BIC maximizes over unknowns – BF averages over unknowns 2 log(B12 ) 2 log(LR) (| 2 | | 1 |) log(n) Model Selection Seattle SISG: Yandell © 2012 41 scan of marginal Bayes factor & effect Model Selection Seattle SISG: Yandell © 2012 42 issues in computing Bayes factors • BF insensitive to shape of prior on – geometric, Poisson, uniform – precision improves when prior mimics posterior • BF sensitivity to prior variance on effects – prior variance should reflect data variability – resolved by using hyper-priors • automatic algorithm; no need for user tuning • easy to compute Bayes factors from samples – sample posterior using MCMC – posterior pr( | y, m) is marginal histogram Model Selection Seattle SISG: Yandell © 2012 43 • sampled marginal histogram • shape affected by prior pr(A) BF 1 , 2 prior probability 0.10 0.20 – prior pr() chosen by user – posterior pr( |y,m) e e pr( 1|y, m) /pr( 1 ) pr( 2|y, m) /pr( 2 ) e p u exponential Poisson uniform p p e u u p u u u u u p e p e e p p e 0.00 • | | = number of QTL 0.30 Bayes factors & genetic architecture 0 e e p e e p u p u p u u 2 4 6 8 m = number of QTL 10 • pattern of QTL across genome • gene action and epistasis Model Selection Seattle SISG: Yandell © 2012 44 3 4 BF sensitivity to fixed prior for effects Bayes factors 0.5 1 2 4 3 2 3 4 2 4 3 2 4 3 2 4 3 2 4 3 2 4 3 2 3 4 2 4 3 2 1 1 1 1 1 0.2 3 4 2 4 3 2 1 1 1 1 0.05 1 0.20 1 0.50 2.00 5.00 2 hyper-prior heritability h 4 3 2 1 B45 B34 B23 B12 20.00 50.00 2 qj ~ N0, G2 / m, G2 h2 total , h2 fixed Model Selection Seattle SISG: Yandell © 2012 45 BF insensitivity to random effects prior insensitivity to hyper-prior 1.0 3.0 hyper-prior density 2*Beta(a,b) 0.0 0.5 1.0 1.5 2 hyper-parameter heritability h 3 2 3 2 3 2 3 2 Bayes factors 0.2 0.4 0.6 0.0 density 1.0 2.0 0.25,9.75 0.5,9.5 1,9 2,10 1,3 1,1 2.0 3 2 1 1 1 1 0.05 0.10 0.20 2 Eh 1 3 3 2 2 B34 B23 B12 1 1 0.50 1.00 2 qj ~ N0, G2 / m, G2 h2 total , 12 h2 ~ Beta(a, b) Model Selection Seattle SISG: Yandell © 2012 46