Bayesian Inference for QTLs in Inbred Lines I Brian S Yandell University of Wisconsin-Madison www.stat.wisc.edu/~yandell NCSU Statistical Genetics June 1999 June 1999 NCSU QTL Workshop © Brian S. Yandell 1 Note on Outbred Studies • Interval Mapping – – – – Haley, Knott & Elsen (1994) Genetics Thomas & Cortessis (1992) Hum. Hered. Hoeschele & vanRanden (1993ab) Theor. Appl. Genet. (etc.) Guo & Thompson (1994) Biometrics • Nuances -- faking it – experimental outbred crosses • collapse markers from 4 to 2 alleles – pedigrees • polygenic effects not modeled here • related individuals are correlated (via coancestry) June 1999 NCSU QTL Workshop © Brian S. Yandell 2 Many Thanks Michael Newton Daniel Sorensen Daniel Gianola Jaya Satagopan Patrick Gaffney Fei Zou Liang Li Tom Osborn David Butruille Marcio Ferrera Josh Udahl Pablo Quijada Alan Attie Jonathan Stoehr USDA Hatch Grants June 1999 NCSU QTL Workshop © Brian S. Yandell 3 What is the Goal Today? • show MCMC ideas • Bayesian perspective – Gibbs sampler – Metropolis-Hastings – common in animal model – use “prior” information • handle hard problems • previous experiments • related genomes – image analysis – genetics – large dependent data • resampling our data – permutation tests – MCMC – other (bootstrap,…) June 1999 • inbred lines “easy” – can check against *IM – ready extension • multiple experiments • pedigrees • non-normal data NCSU QTL Workshop © Brian S. Yandell 4 Overview • Part I: Single QTL – – – – – Modeling a trait with a QTL Likelihoods, Frequentists & Bayesians Profile LODs & Bayesian Posteriors Graphical Summaries Testing & Estimation • Part II: Multiple QTLs – – – – June 1999 Sampling from a Markov Chain Issues for Multiple QTLs Bayes Factors & Model Selection Brassica data analysis NCSU QTL Workshop © Brian S. Yandell 5 Part I: Single QTL • Modelling a trait with a QTL – linear model for trait given genotype – recombination near loci for genotype • Bayesian Posterior • Likelihoods – EM & MCMC – Frequentists & Bayesians • Review Interval Maps & Profile LODs • Case Study: Simulated Single QTL June 1999 NCSU QTL Workshop © Brian S. Yandell 6 QTL Components • observed data on individual – trait: field or lab measurement • log( days to flowering ) , yield, … – markers: from wet lab (RFLPs, etc.) • linkage map of markers assumed known • unobserved data on individual – geno: genotype (QQ=1/Qq=0/qq=-1) • unknown model parameters – effects: mean, difference, variance – locus: quantitative trait locus (QTL) June 1999 NCSU QTL Workshop © Brian S. Yandell 7 Single QTL trait Model • trait = mean + additive + error • trait = effect_of_geno + error • prob( trait | geno, effects ) y j b* x*j e j x=1 x=-1 x=0 ( y j | x*j ; , b* , 2 ) y j b* x*j June 1999 10 NCSU QTL Workshop © Brian S. Yandell 12 14 8 6 6 8 8 10 trait 10 12 12 14 14 Simulated Data with 1 QTL x=-1 June 1999 x=1 0 2 NCSU QTL Workshop © Brian S. Yandell 4 6 frequency 8 9 Recombination and Distance • no interference--easy approximation – Haldane map function – no interference with recombination • all computations consistent in approximation – rely on given map • marker loci assumed known – 1-to-1 relation of distance to recombination – all map functions are approximate • assume marker positions along map are known r June 1999 1 2 1 e 2 NCSU QTL Workshop © Brian S. Yandell 10 markers, QTL & recombination rates r3 r2 r1 r5 r4 * x ? M1 M 2 M3 M4 M5 M6 ? June 1999 distance along chromosome NCSU QTL Workshop © Brian S. Yandell 11 Interval Mapping of QT genotype • can express probabilities in terms of distance – locus is distance along linkage map – flanking markers sufficient if no missing data – could consider more complicated relationship prob( geno | locus, map ) = prob( geno | locus, flanking markers ) ( x*j | ) ( x*j | , M j ,k , M j ,k 1 ) June 1999 NCSU QTL Workshop © Brian S. Yandell 12 Building trait Likelihood • likelihood is mixture across possible genotypes • sum over all possible genotypes at locus like( effects, locus | trait ) = sum of prob( trait, genos | effects, locus ) L ( , b* , 2 , | y j ) ( y j | , b* , 2 ; ) * 2 ( y | x ; , b , ) ( x | ) j x 1, 0 ,1 June 1999 NCSU QTL Workshop © Brian S. Yandell 13 Likelihood over Individuals • product of trait probabilities across individuals – product of sum across possible genotypes like( effects, locus | traits, map ) = product of prob( trait | effects, locus, map ) n L( , b* , 2 ; | y ) ( y j | , b* , 2 ; ) j 1 n * 2 ( y | x ; , b , ) ( x | ) j j 1 x 1, 0 ,1 June 1999 NCSU QTL Workshop © Brian S. Yandell 14 25 Profile LOD for 1 QTL 0 5 10 15 LOD 20 QTL IM 0 10 20 30 40 50 60 70 80 90 distance (cM) June 1999 NCSU QTL Workshop © Brian S. Yandell 15 Interval Mapping for Quantitative Trait Loci • profile likelihood (LOD) across QTL – scan whole genome locus by locus • use flanking markers for interval mapping – maximize likelihood ratio (LOD) at locus • best estimates of effects for each locus • EM method (Lander & Botstein 1989) n LOD ( ) (log 10 e) ln j 1 June 1999 (y | x; ˆ , bˆ* , ˆ 2 ) ( x | ) x 1, 0 ,1 * 2 ˆ ( y j | ˆ , b 0, ˆ ) j NCSU QTL Workshop © Brian S. Yandell 16 Interval Mapping Tests • profile LOD across possible loci in genome – maximum likelihood estimates of effects at locus – LOD is rescaling of L(effects, locus|y) • test for evidence of QTL at each locus – LOD score (LR test) – adjust (?) for multiple comparisons June 1999 NCSU QTL Workshop © Brian S. Yandell 17 Basic Idea of Likelihood Use • build likelihood in steps – build from trait & genotypes at locus – likelihood for individual i – log likelihood over individuals • maximize likelihood (interval mapping) – EM method (Lander & Botstein 1989) – MCMC method (Guo & Thompson 1994) • study whole likelihood as posterior (Bayesian) – analytical methods (e.g. Carlin & Louis 1998) – MCMC method (Satagopan et al 1996) June 1999 NCSU QTL Workshop © Brian S. Yandell 18 EM-MCMC duality • EM approach can be redone with MCMC – EM estimates & maximizes – MCMC draws random samples – both can address same problem • sometimes EM is hard (impossible) to use • MCMC is tool of “last resort” – – – – June 1999 use exact methods if you can try other approximate methods be clever! very handy for hard problems in genetics NCSU QTL Workshop © Brian S. Yandell 19 0.2 0.0 0.2 parameter parameter = 4 012345678910 parameter = 6 012345678910 data : y 1,3,8 posterior 0.3 0 0.0 0.1 0.0002 0.2 likelihood 0 2 4 6 8 June 1999 0.0 parameter = 2 012345678910 0.0004 0.0 0.2 Likelihood & Posterior Example parameter : t ? t k e t prob{ y k | t} k! NCSU QTL Workshop © Brian S. Yandell 20 Studying the Likelihood • maximize (*IM) – find the peak – avoid local maxima – profile LOD • across locus • max for effects • sample (Bayes) – get whole curve – summarize later – posterior • EM method – always go up – steepest ascent • MCMC method – – – – jump around go up if you can sometimes go down cool down to find peak • simulated annealing • simulated tempering • locus & effects together June 1999 NCSU QTL Workshop © Brian S. Yandell 21 Bayes Theorem • posteriors and priors – prior: – posterior: prob( parameters ) prob( parameters | data ) • posterior = likelihood * prior / constant • posterior distribution is proportional to – likelihood of parameters given data – prior distribution of parameters ( param , data) (data | param) ( param) ( param | data) (data) (data) ( param | data) (data | param) ( param) June 1999 NCSU QTL Workshop © Brian S. Yandell 22 Bayesian Prior • “prior” belief used to infer “posterior” estimates – higher weight for more probable parameter values • based on prior knowledge – use previous study to inform current study • weather prediction: tomorrow is much like today • previous QTL studies on related organisms – historical criticism: can get “religious” about priors • often want negligible effect of prior on posterior – pick non-informative priors • all parameter values equally likely • large variance on priors – always check sensitivity to prior June 1999 NCSU QTL Workshop © Brian S. Yandell 23 How to Study Posterior? • exact methods • Monte Carlo methods – exact if possible – can be difficult or impossible to analyze • approximate methods – importance sampling – numerical integration – Monte Carlo & other June 1999 – easy to implement – independent samples • MCMC methods – handle hard problems – art to efficient use – correlated samples NCSU QTL Workshop © Brian S. Yandell 24 QTL Bayesian Inference • study posterior distribution of locus & effects – sample joint distribution • locus, effects & genotypes – study marginal distribution of • locus • effects – overall mean, genotype difference, variance • locus & effects together • estimates & confidence regions – histograms, boxplots & scatter plots – HPD regions June 1999 NCSU QTL Workshop © Brian S. Yandell 25 38 40 2.0 2.2 36 1.8 1.8 2.0 additive 2.2 Posterior for locus & effect 42 QTL 1 0.0 0.1 0.2 0.3 distance (cM) 36 38 40 42 distance (cM) June 1999 NCSU QTL Workshop © Brian S. Yandell 26 QTL Marginal Posterior • posterior = likelihood * prior / constant • posterior distribution is proportional to – likelihood of traits & genos given effects & locus – prior distribution of effects & locus (b* | y) is proportional to n (b ) ( y j | x*j ; , b* , 2 ) * j 1 June 1999 NCSU QTL Workshop © Brian S. Yandell 27 Marginal Posterior Summary • marginal posterior for locus & effects • highest probability density (HPD) region – smallest region with highest probability – credible region for locus & effects • HPD with 50,80,90,95% – range of credible levels can be useful – marginal bars and bounding boxes – joint regions (harder to draw) June 1999 NCSU QTL Workshop © Brian S. Yandell 28 1.8 2.0 additive 2.2 HPD Region for locus & effect 36 38 40 42 distance (cM) June 1999 NCSU QTL Workshop © Brian S. Yandell 29 QTL Bayesian Posterior • posterior = likelihood * prior / constant • posterior( paramaters | data ) prob( genos, effects, loci | trait, map ) ( x * ; , b* , 2 ; | y ) is proportional to n * * 2 ( y | x ; , b , ) j j j 1 n ( ) (b* ) ( 2 ) ( ) ( x*j | ) j 1 June 1999 NCSU QTL Workshop © Brian S. Yandell 30 Frequentist or Bayesian? • Frequentist approach – fixed parameters • range of values – maximize likelihood • ML estimates • find the peak – confidence regions • random region • invert a test – hypothesis testing • 2 nested models June 1999 • Bayesian approach – random parameters • distribution – posterior distribution • posterior mean • sample from dist – credible sets • fixed region given data • HPD regions – model selection/critique • Bayes factors NCSU QTL Workshop © Brian S. Yandell 31 Frequentist or Bayesian? • Frequentist approach – maximize over mixture of QT genotypes – locus profile likelihood • max over effects – HPD region for locus • natural for locus – 1-2 LOD drop • work to get effects • Bayesian approach – joint distribution over QT genotypes – sample distribution • joint effects & loci – HPD regions for • joint locus & effects • use density estimator – approximate shape of likelihood peak June 1999 NCSU QTL Workshop © Brian S. Yandell 32 Interval Mapping for Quantitative Trait Loci • profile likelihood (LOD) across QTL – scan whole genome locus by locus • use flanking markers for interval mapping – maximize likelihood ratio (LOD) at locus • best estimates of effects for each locus • EM method (Lander & Botstein 1989) n LOD ( ) (log 10 e) ln j 1 June 1999 (y | x; ˆ , bˆ* , ˆ 2 ) ( x | ) x 1, 0 ,1 * 2 ˆ ( y j | ˆ , b 0, ˆ ) j NCSU QTL Workshop © Brian S. Yandell 33 25 Profile LOD for 1 QTL 0 5 10 15 LOD 20 QTL IM 0 10 20 30 40 50 60 70 80 90 distance (cM) June 1999 NCSU QTL Workshop © Brian S. Yandell 34 Interval Mapping Tests • profile LOD across possible loci in genome – maximum likelihood estimates of effects at locus – LOD is rescaling of L(effects, locus|y) • test for evidence of QTL at each locus – LOD score (LR test) – adjust (?) for multiple comparisons June 1999 NCSU QTL Workshop © Brian S. Yandell 35 Interval Mapping Estimates • confidence region for locus – based on inverting test of no QTL – 2 LODs down gives approximate CI for locus – based on chi-square approximation to LR • confidence region for effects – approximate CI for effect based on normal – point estimate from profile LOD locus CI | LOD (ˆ ) LOD ( ) 2 effect CI bˆ* 1.96se(bˆ* ) June 1999 NCSU QTL Workshop © Brian S. Yandell 36 1.6 1.8 2.0 additive 2.2 2.4 IM Confidence Region 36 38 40 42 distance (cM) June 1999 NCSU QTL Workshop © Brian S. Yandell 37 How to Proceed? • figure out how to construct posterior – Markov chain Monte Carlo • run Markov chain • summarize results • diagnostics June 1999 NCSU QTL Workshop © Brian S. Yandell 38 MCMC Run for 1 locus Data distance (cM) 40 38 36 42 40 38 36 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 11 1 11 1 11 11 11 1 1 1 1 1 11 1 1 1 111 1 11 1 1 1 1 1 1 1 1 11 11 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1111 1 11 111 1 111 11 1 11 1 11 1 11 11 1 111 11 1 1 1 11111 1 1 1111 1 111 1 1 1 1 1 1 1 1 1 1 1 1 11 11 1 1 11 1 1 11 1 1 11 1 11111111 1 1111 1 11 1 111 1 11 111 11 11 1 11111111 1 1 1 1 11 1 1 11 111 11 1 1 11 1 11 1 1 1 11 111 111 11111 11 1 1 1 1 1111 1 11 11111 111 1111 1 1 1 1 1 11111 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 11111111 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 11 1 1111 1 1 11 1 11 111 1 111 111 1 1 111 11111 11111 1 1 1 1 11 11111 111 111111 11 1 11111 111 1 1 1 1 1 1 1 1 1 1 1 11 111111 11 1 1 1 1 11 11 1 111 1 1 1 1 1 111111 1 1 1 1111111 11 1 11 11 11 1 11 1 111111111 111111111111 1 1 11 1111 1 11111 111 11 111 11 1 1 1 111 1 1 11 111111 1111 11 1 11111 1 1 1 11 1 1 1 1 1 1 1 1 1 1 11 1 1 1 1111 1 1111 11 1 1 1 1 1 11 1 111 111 11 1 11 111 11111 1 111 11 1111 11 11 11 1111111 1 1 1 111 11 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1111 11 11 1 1 1 1 1 11 1 1111 1 111111 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 111 111 1 1 1 1 1 111 111 11 11 11 111 1 1 1 11111111111 11 111 1 1 11 11 1 11 1 11 11 1 1 1 111 1 1111 111 1 1 1 11 111 1 11 1 1 11 11 1 1 1 1 1 1 1 1 1 1 11 1 11 1 11 1 1111 1 1 1 1 1 1 11 111 1 1 1 1 1 1 11 11 11 1 1 11 1 1 1 1 11 11 11 1 1 11 1 1 1 11 11 111 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 11 11 1 111 1 1 111 1 1 1 11 1 1 1 1 11 1 11 1 1 1 111 1 11 1 1 1 1 1 11 11 1 1 1 11 1 1 1 1 1 111 1 1 1 1 1 1 1 1 1 11 1 11 1 1 1 1 1 1 1 42 1 1 1 1 1 0 200 400 600 800 1000 0 MCMC run/100 June 1999 NCSU QTL Workshop © Brian S. Yandell 20 40 60 frequency 39 Part II: Multiple QTL • Sampling from the Posterior – Markov chain Monte Carlo • Gibbs sampler for effects • Metropolis-Hastings for loci • Issues for 2 QTL • Bayes factors & Model Selection • Simulated data for 0,1,2 QTL • Brassica data on days to flowering June 1999 NCSU QTL Workshop © Brian S. Yandell 40 Posterior for effects •start with joint distribution of effects – does not depend on locus given genotypes – standard linear model problem ( , b , | y, x ) * 2 * is proportional to n ( ) (b ) ( ) ( y j | x ; , b , ) * 2 * j * 2 j 1 June 1999 NCSU QTL Workshop © Brian S. Yandell 41 How to Study Posterior? • exact methods • Monte Carlo methods – exact if possible – can be difficult or impossible to analyze • approximate methods – importance sampling – numerical integration – Monte Carlo & other June 1999 – easy to implement – independent samples • MCMC methods – handle hard problems – art to efficient use – correlated samples NCSU QTL Workshop © Brian S. Yandell 42 Ordinary Monte Carlo • independent samples of joint distribution • chaining (or peeling) of effects • requires numerical integration – possible analytically here – very messy in general ( , b* , 2 | y, x * ) ( 2 | y, x * ; , b* ) (b* | y, x * ; ) ( | y, x * ) ( | y, x * ) E(b , ) ( , b* , 2 | y, x * ) * June 1999 2 NCSU QTL Workshop © Brian S. Yandell 43 Gibbs Sampler for effects • set up Markov chain around posterior for effects • sample from posterior by sampling from full conditionals – conditional posterior of each parameter given the other – update parameter by sampling full conditional update mean ( | y, x * ; b* , 2 ) ( ) (y | x * ; , b* , 2 ) update additive (b* | y, x * ; , 2 ) (b* ) (y | x * ; , b* , 2 ) update variance ( 2 | y, x * ; , b* ) ( 2 ) (y | x * ; , b* , 2 ) June 1999 NCSU QTL Workshop © Brian S. Yandell 44 Markov chain updates genos mean additive variance traits June 1999 NCSU QTL Workshop © Brian S. Yandell 45 Markov chain idea • future events given the present state are independent of past events • update chain based on current value • can make chain arbitrarily complicated • chain converges to stable pattern (1) p /( p q) p 1-p 0 1 1-q q June 1999 NCSU QTL Workshop © Brian S. Yandell 46 Markov chain idea p 1-p Mitch’s 0 (1) p /( p q) q 1 1-q 1 Series1 Series2 Other 0 1 June 1999 3 5 7 9 11 13 15 17 19 21 NCSU QTL Workshop © Brian S. Yandell 47 Markov chain Monte Carlo • can study arbitrarily complex models – need only specify how parameters affect each other – can reduce to specifying full conditionals • construct Markov chain with “right” model – update some parameters given data and others – can fudge on “right” (importance sampling) – next step depends only on current estimates • nice Markov chains have nice properties – sample summaries make sense – consider almost as random sample from distribution June 1999 NCSU QTL Workshop © Brian S. Yandell 48 0 200 400 600 800 10.4 mean 10.0 10.2 1 1 1 1 1 1 11 11 1 1 1 11 11 1 1 1 111 11 11 1111 1111 1 1111 11 1111 1 1 1 1 1 1 1 1 1 1 1 1 111 111 11 1 11 1 1 1 111 1 111 1 1 1 11 1 11 1 1 1 1111111111 1 111 111 11111 11111111111111111111 1111 11 1 1111 11111111 11111111 111 11 1111 11 111111 111 11 1 11 111111 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 111 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1111111 1 1111111 111111 1111111 1 1 1 1 1 1 1 1111 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11111111111111111111 11111111 1111111 111 11111 1111111 11 1111111 111 11 111 1111111 111 1 1111 11 111111111 11 111 11111 11 1111 1111 11 1111111111 11 111111 11111111 111111111 1111 111111 11 111 11 11 1111111111 1111 11111 111111 1 1 1 1 1 111111 1 1 11 1 1 1 111111 1 11111 1111111 1 1 1 1 1 1 1 1 1 111111111 1 1 1 1 1 1 1 1111 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 11 1111 1111 111111 111111 1 1111111 1111 11 1111111 1111111 1 11 1111111 1 111 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 1 1 11 1 1 1111111 111 111 11 11 111 11 1 1 1 11 1 11 11 1 11 1 11 11 11 1 11 111 1 1 1 1 111 1 1 1 1 1 1 1 11 1 1 1 1 9.8 1 9.6 9.6 9.8 10.0 10.2 10.4 MCMC run of mean & effect 1000 0 20 MCMC run/100 40 60 frequency 11 1 1 1 1 1 1 11 111 1 1 1 11 1 1 1 11 11 1 1 1111 1 11 111 1 11 1 1 1 1 1 1 1 1 1 1 11 1111 11111 1 1 1 11 1 1 1 111 11 1 11 11 111 11111111 111 1 1111 1 11 11 111111111111111 1 11 111 111111111 1 1 1 1 1 1 1 1 1 11 11111 1 1 1111111111 11 11111111 1111 111 11111 1111111 11111111111 1 1111111 11 1111 11 111111111 1111 1111111 11111111111111111 11 111 1111111 1 11 111 1111111 111 11111 1111 111111 1111 1111 1 111111 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 111 1111 1 111 11111 1111 111111111 11 1 1111 11111111111111111111 1 1 1111111 111111111 111 11111 11111 1111 11111 11 11 1 111111 111 11111 111 111111 1 1 11 111 1111111 1 111 11111 1 11 1 1 1 1 1 1 1 1 1 1 11111 1111 11 111111 11111 11 1 111111 1111111 11111 111 11 11111 1111111 11 11111 1111 1111111 11111111 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 11 1 1111111111 111 1111 1 1 1 1111 11 11 1 1 1 111 1 111 1 1111111111 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 111 11 1 1 1111111111 11111 1 11 11 1 1 11 1 11 11 11 111111 11 11 11 11 1 11 1 11111111 111 11 1 11 1 1111111111 11 1 1 1 11 11111111 111 1 1 111 111 11111 1 1 1 11 1 11 1 1 111 1 1 11 1 1 1 1 1 1 1 0 200 400 600 800 1000 additive 2.0 2.2 1.8 1.8 2.0 2.2 1 0 MCMC run/100 June 1999 NCSU QTL Workshop © Brian S. Yandell 20 40 60 frequency 49 MCMC Idea for QTLs • construct Markov chain around posterior – want posterior as stable distribution of Markov chain – in practice, the chain tends toward stable distribution • initial values may have low posterior probability • burn-in period to get chain mixing well • update one (or several) components at a time – update effects given genotypes & traits – update locus given genotypes & traits – update genotypes give locus & effects ( x * ; , b* , 2 ; ) ~ ( x * ; , b* , 2 ; | y ) 1 2 N June 1999 NCSU QTL Workshop © Brian S. Yandell 50 Markov chain updates genos locus effects traits June 1999 NCSU QTL Workshop © Brian S. Yandell 51 Care & Use of MCMC • sample chain for long run (100,000-1,000,000) – longer for more complicated likelihoods – use diagnostic plots to assess “mixing” • standard error of estimates – use histogram of posterior – compute variance of posterior--just another summary • studying the Markov chain – Monte Carlo error of series (Geyer 1992) • time series estimate based on lagged auto-covariances – convergence diagnostics for “proper mixing” June 1999 NCSU QTL Workshop © Brian S. Yandell 52 Full Conditional for genos • full conditional for genotype depends on – effects via trait model – locus via recombination model • can explicitly decompose by individual j – binomial (or trinomial) probability x*j 1, 0, or 1 * * 2 * ( y | x ; , b , ) ( x j j j | ) * * 2 Pj ( x j | y j ; , b , ; ) * 2 ( y | x ; , b , ) ( x | ) j x 1, 0 ,1 June 1999 NCSU QTL Workshop © Brian S. Yandell 53 Prior for locus • prior information from other studies – uniform prior over genome – use framework map •choose interval proportional to length •then pick uniform position within interval 0.0 0.2 •concentrate on credible regions •use posterior of previous study as new prior • no prior information on locus 0 June 1999 20 40 60 distance (cM) NCSU QTL Workshop © Brian S. Yandell 80 54 Full Conditional for locus • cannot easily sample from locus full conditional • cannot explicitly determine full conditional – difficult to normalize – need to consider all possible genotypes over entire map • Gibbs sampler will not work – but can get something proportional ... ( | y, x * ; , b * , 2 ) ( | x * ) n ( ) ( x * j | ) j 1 June 1999 NCSU QTL Workshop © Brian S. Yandell 55 Metropolis-Hastings Step • pick new locus based upon current locus – propose new locus from distribution q( ) • pick value near current one? • pick uniformly across genome? – accept new locus with probability a() • Gibbs sampler is special case of M-H – always accept new proposal • acceptance insures right stable distribution (new | x * )q(new , old ) a(old , new ) min 1, * (old | x )q(old , new ) June 1999 NCSU QTL Workshop © Brian S. Yandell 56 MCMC run: 2 loci assuming only 1 1 1 1 1 1 11 1 11 1 1 1 1 11 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 1 11 1 1 1 11 1 1 1 1 1 111 1 1 1 1 1 1 1 1 1 111 11 1 11 1 11 1111 11 1 11 11 1 1 1 1 1 1 11 1 11111 1 11 1 1 1 1 11 111 111 11 1 111 1111111 111 11111111 1 111 111 111 111 11 111111 11 111111 11111111111111111111 11111 111111111111111111 1111 1 1 11 111111111 1 1 111 1 11 1 1 1111111 11111 1 1 11 1 111111111111 11 1111111 1 1 1 1 111 1111 111111111 1 1111111 1111111 11 1111 11 1111111 111 11 111 11111111111111111111 11 11111111 1111 111 111111 1111 1 1 1 1 1 111 1 1 1 1 1 1 1 111 1 1 1 1 1 11 111111 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11111111 11111 11111111 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11111111111 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 11 11111 1111 1 111 11111111111 11111 11111 111111111 1111111111111 1 1 1111111 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1111 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11 11 1 1 111 11 1 1 1 1 1 1111 1 1 11111 1 1 11 11111 111 1 1 1 111 1 1 111111111 1 1 1 1 1 11 111 11 1 111 1 1 1 11 11 1 1 111 1 1 1 1 11 1 1 1 11 11 1111111 1 11 11111 1 1 1 1 1 1 1 1 11 1 1 1 11 1 1 1 11 1 1 1 1 1 1 1 1 111 1 1 1 1 1 1 1 70 1 50 distance60(cM) 1 50 60 70 1 1 1 1 11 1 40 40 1 1 11 11 1 0 200 400 600 800 1000 0 50 MCMC run/1000 June 1999 NCSU QTL Workshop © Brian S. Yandell 100 150 200 250 frequency 57 Multiple QTL model • trait = mean + add1 + add2 + error • trait = effect_of_genos + error • prob( trait | genos, effects ) y j b x b x ej * * 1 j1 * * 2 j2 m y j b x ej r 1 June 1999 * * r jr NCSU QTL Workshop © Brian S. Yandell 58 4 4 6 6 8 8 10 trait 10 12 12 14 14 16 16 Simulated Data with 2 QTL -1,-1 -1,1 1,-1 1,1 recombinants June 1999 0 1 NCSU QTL Workshop © Brian S. Yandell 2 3 4 frequency 5 6 59 Issues for Multiple QTL • how many QTL influence a trait? – 1, several (oligogenic) or many (polygenic)? – how many are supported by the data? • searching for 2 or more QTL – conditional search (IM, CIM) – simultaneous search (MIM) • epistasis (inter-loci interaction) – many more parameters to estimate – effects of ignored QTL June 1999 NCSU QTL Workshop © Brian S. Yandell 60 30 LOD for 2 QTL 0 5 10 15 LOD 20 25 QTL IM CIM 0 10 20 30 40 50 60 70 80 90 distance (cM) June 1999 NCSU QTL Workshop © Brian S. Yandell 61 Interval Mapping Approach • interval mapping (IM) – scan genome for 1 QTL • composite interval mapping (CIM) – scan for 1 QTL while adjusting for others – use markers as surrogates for other QTL • multiple interval mapping (MIM) – search for multiple QTL June 1999 NCSU QTL Workshop © Brian S. Yandell 62 70 distance (cM) 60 60 1 1 111111 11 11111111 1 1111111 1 11 1 1111 11 1111111111 111 11 1 1111 1 111 11 11 1 1111 1111111111111 11111 1111 11111111111 1 11111111 11 11111111 111 1111 11 111111 111111111 1111 11 1111 11 1 111 11 111 1 111 111111 11 11111111 1 11 1 111 11 11 11 11 1111 111 111111 11 11111 1111111 1 11 11111 11 111 111111 111 1111 11 11 11 11 1111 1 111111 111 11 111 111 11 111 1 1 11111111 11 1111 1 11111 11 11 11111 1 11 11 1 11111 1 1 111111 1 11111 111 11 11 11 111 1 1 11 1 1 1 111 11 1 111 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 111111 11 1 1 1 1 1 1 1 1 1 1 11111 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 11111 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 111111111111111111111 11111111111 1 11111 1 111 1111111 111111 11111111111 11111111 1 1 1111111111111 11111111111 111 1 11 111 1 1 1 1 1 11111111 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 111 11 111 11 111 1 1 111 1111111 111 1 11 11111 1 11111 111 11 11 1 1 1 11111 1 1 1 1 1 1 1 1 1 1 1 11 11 1 11 111 1 1 1 0 200 400 600 800 1000 40 50 50 40 80 2 22 22 2 2 2 2 22 2 2 2 2 2 2 2222 2 2 22 2 222 22 222 222 2 2 2 2 222 22222222 22222222 22 22 22222 2 22222 2 2 222 2 22 222222222 2 2 2 2 2 2 2 2 2 2 2 2 2 222 2 2 2 2 2 2 2 222 2 2222 2222 222 2 2222 22 222 22 2 222222 222 22222222222222 2 222 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22222222222222222 2 22222222222 2222222222222 222222 2 22 22 2 222222222 22 222 22222 22 2 222 222222222222222 2222 2222 2 2222222 222 2 22 2222 22222 222 22222222222 22222 2222222 22222 222222222 222222222 2222 2 222222 22222 22 22222222 2222 222222 222 222 222 2 2 2222 222222222 22222222222 22 2222222222 22222222222222 2 2 2 222 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 222 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2222 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2222222 22222222222222 2222222 22222222 2 22 22 222 22 2 2222222222 2222222222 2 2 222 222222 2222 2222222 2222 2222222 22222222 22222222 22222222 22 2 22 2222 2 222 22 22 22 22222 2222 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 22 2 22 2 22 22 22 2 2 2 2 2 70 80 MCMC run with 2 loci 0 50 MCMC run/100 June 1999 NCSU QTL Workshop © Brian S. Yandell 100 150 200 250 300 frequency 63 Bayesian Approach • simultaneous search for multiple QTL • use Bayesian paradigm – easy to consider joint distributions – easy to modify later for other types of data • counts, proportions, etc. – employ MCMC to estimate posterior dist • study estimates of loci & effects • use Bayes factors for model selection – number of QTL June 1999 NCSU QTL Workshop © Brian S. Yandell 64 2.6 2.4 2.2 1.8 2.0 2.2 2.0 1.8 effect 2.4 2.6 Effects for 2 Simulated QTL 40 50 60 70 80 effect 1 effect 2 0.0 0.05 0.10 0.15 distance (cM) 40 50 60 70 80 distance (cM) June 1999 NCSU QTL Workshop © Brian S. Yandell 65 Brassica napus Data • 4-week & 8-week vernalization effect – log(days to flower) • genetic cross of – Stellar (annual canola) – Major (biennial rapeseed) • 105 F1-derived double haploid (DH) lines – homozygous at every locus (QQ or qq) • 10 molecular markers (RFLPs) on LG9 – two QTLs inferred on LG9 (now chromosome N2) – corroborated by Butruille (1998) – exploiting synteny with Arabidopsis thaliana June 1999 NCSU QTL Workshop © Brian S. Yandell 66 2.5 3.0 2.5 8-week 3.5 3.5 Brassica 4- & 8-week Data 2.5 3.0 3.5 4.0 0 2 4 6 8 10 8-week vernalization 0 2 4 6 8 4-week 2.5 3.0 3.5 4.0 4-week vernalization June 1999 NCSU QTL Workshop © Brian S. Yandell 67 8 Brassica Data LOD Maps 8-week 0 2 4 LOD 6 QTL IM CIM 0 10 20 30 40 50 60 70 80 90 60 70 80 90 15 distance (cM) 0 5 10 LOD 4-week QTL IM CIM 0 10 20 30 40 50 distance (cM) June 1999 NCSU QTL Workshop © Brian S. Yandell 68 4-week vs 8-week vernalization 4-week vernalization • longer time to flower • larger LOD at 40cM • modest LOD at 80cM • loci well determined cM 40 80 June 1999 add .30 .16 • • • • 8-week vernalization shorter time to flower larger LOD at 80cM modest LOD at 40cM loci poorly determined cM 40 80 NCSU QTL Workshop © Brian S. Yandell add .06 .13 69 Brassica Credible Regions 8-week -0.3 -0.6 -0.2 -0.4 -0.2 additive additive -0.1 0.0 0.0 0.1 0.2 0.2 4-week 20 40 60 80 20 distance (cM) June 1999 NCSU QTL Workshop © Brian S. Yandell 40 60 80 distance (cM) 70 Collinearity of QTLs • multiple QT genotypes are correlated – QTL linked on same chromosome – difficult to distinguish if close • estimates of QT effects are correlated – poor identifiability of effects parameters – correlations give clue of how much to trust • which QTL to go after in breeding? – largest effect? – may be biased by nearby QTL June 1999 NCSU QTL Workshop © Brian S. Yandell 71 Brassica effect Correlations 8-week additive 2 -0.2 -0.1 cor = -0.7 -0.6 -0.3 -0.4 -0.2 additive 2 cor = -0.81 0.0 0.0 4-week -0.6 -0.4 -0.2 0.0 0.2 -0.2 additive 1 June 1999 -0.1 0.0 0.1 0.2 additive 1 NCSU QTL Workshop © Brian S. Yandell 72 Bayes Factors Which model (1 or 2 or 3 QTLs?) has higher probability of supporting the data? – ratio of posterior odds to prior odds – ratio of model likelihoods ( model1 | y) / ( model2 | y) (y | model1 ) B12 ( model1 ) / ( model2 ) (y | model2 ) June 1999 BF(1:2) 2log(BF) evidence for 1st <1 <0 negative 1 to 3 0 to 2 negligible 3 to 12 2 to 5 positive 12 to 150 5 to 10 strong > 150 > 10 very strong NCSU QTL Workshop © Brian S. Yandell 73 Bayes Factors & LR • equivalent to LR statistic when – comparing two nested models – simple hypotheses (e.g. 1 vs 2 QTL) • Bayes Information Criteria (BIC) in general – Schwartz introduced for model selection – penalty for different number of parameters p 2 log( B12 ) 2 log( LR) ( p2 p1 ) log( n) June 1999 NCSU QTL Workshop © Brian S. Yandell 74 Model Determination using Bayes Factors • pick most plausible model – histogram for range of models – posterior distribution of models – use Bayes theorem – often assume flat prior across models • posterior distribution of number of QTLs June 1999 NCSU QTL Workshop © Brian S. Yandell 75 Brassica Bayes Factors •compare models for 1, 2, 3 QTL •Bayes factor and -2log(LR) •large value favors first model •8-week vernalization only here June 1999 i vs. j Bayes Factor -log(LR) 2 vs. 1 2.49 7.82 3 vs. 1 .005 7.41 3 vs. 2 .002 4.17 NCSU QTL Workshop © Brian S. Yandell 76 Computing Bayes Factors • using samples from prior – mean across Monte Carlo or MCMC runs – can be inefficient if prior differs from posterior (y | modelk ) (y | k ; modelk ) ( k | modelk )d k • using samples from posterior – weighted harmonic mean more efficient but unstable – care in choosing weight h() close to posterior – increase stability: absorb variance (normal to t dist) h( k ) ˆ (y | modelk ) G g 1 (y | k ; modelk ) ( k | modelk ) G June 1999 NCSU QTL Workshop © Brian S. Yandell 1 77 MCMC Software • General MCMC software – U Bristol links • http://www.stats.bris.ac.uk/MCMC/pages/links.html – BUGS (Bayesian inference Using Gibbs Sampling) • http://www.mrc-bsu.cam.ac.uk/bugs/ • Our MCMC software for QTLs – C code using LAPACK • ftp://ftp.stat.wisc.edu/pub/yandell/revjump.tar.gz – coming soon: • perl preprocessing (to/from QtlCart format) • Splus post processing • Bayes factor computation June 1999 NCSU QTL Workshop © Brian S. Yandell 78 Bayes & MCMC References • CJ Geyer (1992) “Practical Markov chain Monte Carlo”, Statistical Science 7: 473-511 • L Tierney (1994) “Markov Chains for exploring posterior distributions”, The Annals of Statistics 22: 1701-1728 (with disc:1728-1762). • A Gelman, J Carlin, H Stern & D Rubin (1995) Bayesian Data Analysis, CRC/Chapman & Hall. • BP Carlin & TA Louis (1996) Bayes and Empirical Bayes Methods for Data Analysis, CRC/Chapman & Hall. • WR Gilks, S Richardson, & DJ Spiegelhalter (Ed 1996) Markov Chain Monte Carlo in Practice, CRC/Chapman & Hall. June 1999 NCSU QTL Workshop © Brian S. Yandell 79 QTL References • D Thomas & V Cortessis (1992) “A Gibbs sampling approach to linkage analysis”, Hum. Hered. 42: 63-76. • I Hoeschele & P vanRanden (1993) “Bayesian analysis of linkage between genetic markers and quantitative trait loci. I. Prior knowledge”, Theor. Appl. Genet. 85:953-960. • I Hoeschele & P vanRanden (1993) “Bayesian analysis of linkage between genetic markers and quantitative trait loci. II. Combining prior knowledge with experimental evidence”, Theor. Appl. Genet. 85:946-952. • SW Guo & EA Thompson (1994) “Monte Carlo estimation of mixed models for large complex pedigrees”, Biometrics 50: 417-432. • JM Satagopan, BS Yandell, MA Newton & TC Osborn (1996) “A Bayesian approach to detect quantitative trait loci using Markov chain Monte Carlo”, Genetics 144: 805-816. June 1999 NCSU QTL Workshop © Brian S. Yandell 80