Model Selection for Multiple QTL 1. reality of multiple QTL 2. selecting a class of QTL models 3. comparing QTL models • • QTL model selection criteria issues of detecting epistasis 4. simulations and data studies • • • Model 3-8 9-15 16-24 25-40 simulation with 8 QTL plant BC, animal F2 studies searching through QTL models NCSU QTL II: Yandell © 2004 1 what is the goal of QTL study? • uncover underlying biochemistry – – – – identify how networks function, break down find useful candidates for (medical) intervention epistasis may play key role statistical goal: maximize number of correctly identified QTL • basic science/evolution – – – – how is the genome organized? identify units of natural selection additive effects may be most important (Wright/Fisher debate) statistical goal: maximize number of correctly identified QTL • select “elite” individuals – predict phenotype (breeding value) using suite of characteristics (phenotypes) translated into a few QTL – statistical goal: mimimize prediction error Model NCSU QTL II: Yandell © 2004 2 1. reality of multiple QTL • evaluate some objective for model given data – classical likelihood – Bayesian posterior • search over possible genetic architectures (models) – number and positions of loci – gene action: additive, dominance, epistasis • estimate “features” of model – means, variances & covariances, confidence regions – marginal or conditional distributions • art of model selection – how select “best” or “better” model(s)? – how to search over useful subset of possible models? Model NCSU QTL II: Yandell © 2004 3 advantages of multiple QTL approach • improve statistical power, precision – increase number of QTL detected – better estimates of loci: less bias, smaller intervals • improve inference of complex genetic architecture – patterns and individual elements of epistasis – appropriate estimates of means, variances, covariances • asymptotically unbiased, efficient – assess relative contributions of different QTL • improve estimates of genotypic values – less bias (more accurate) and smaller variance (more precise) – mean squared error = MSE = (bias)2 + variance Model NCSU QTL II: Yandell © 2004 4 Pareto diagram of QTL effects 3 (modifiers) minor QTL polygenes 1 2 major QTL 0 3 additive effect major QTL on linkage map 2 1 Model 0 4 5 5 10 15 20 25 30 rank order of QTL NCSU QTL II: Yandell © 2004 5 limits of multiple QTL? • limits of statistical inference – power depends on sample size, heritability, environmental variation – “best” model balances fit to data and complexity (model size) – genetic linkage = correlated estimates of gene effects • limits of biological utility – sampling: only see some patterns with many QTL – marker assisted selection (Bernardo 2001 Crop Sci) • 10 QTL ok, 50 QTL are too many • phenotype better predictor than genotype when too many QTL • increasing sample size may not give multiple QTL any advantage – hard to select many QTL simultaneously • 3m possible genotypes to choose from Model NCSU QTL II: Yandell © 2004 6 QTL below detection level? • problem of selection bias – QTL of modest effect only detected sometimes – their effects are biased upwards when detected • probability that QTL detected – avoids sharp in/out dichotomy – avoid pitfalls of one “best” model – examine “better” models with more probable QTL • build m = number of QTL detected into QTL model – directly allow uncertainty in genetic architecture – model selection over genetic architecture Model NCSU QTL II: Yandell © 2004 7 2. selecting a class of QTL models • phenotype distribution – normal (usual), binomial, Poisson, … – exponential family, semi-parametric, nonparametric • = gene action – additive (A) or general (A+D) effects – epistatic interactions (AA, AD, ..., or other types?) • = location of QTL – known locations? – widely spaced (no 2 in marker interval) or arbitrarily close? • m = number of QTL – single QTL? – multiple QTL: known or unknown number? Model NCSU QTL II: Yandell © 2004 8 normal phenotype • trait = mean + genetic + environment • genetic effect uncorrelated with environment • pr( trait Y | genotype Q, effects ) Y GQ e Y var( GQ ) G2 , var( e) 2 Qq effects (GQ , 2 ) qq heritabili ty h 2 G 2 2 8 Model QQ 2 G 9 GQQ 10 11 12 13 NCSU QTL II: Yandell © 2004 14 15 16 17 18 19 20 9 two QTL with epistasis • same phenotype model overview Y GQ e, var( e) 2 • partition of genotypic value with epistasis GQ 1 (Q) 2 (Q) 12 (Q) • partition of genetic variance 2 2 2 2 var( GQ ) G 1 2 12 Model NCSU QTL II: Yandell © 2004 10 epistasis examples (Doebley Stec Gustus 1995; Zeng pers. comm.) 3 1 effect +/- 2 se 0 4 bb Bb BB a1 d1 a2 d2 iaa iad ida idd aa Aa AA Aa AA bb Bb BB -2 0 AA a1 d1 a2 d2 iaa iad ida idd 0.5 4.0 traits 1,4,9 Fisher-Cockerham effects Aa Model AA 0.0 -1.0 -0.5 3.0 2.5 aa 1.5 2.0 aa Aa AA 1.0 2.0 1.5 1.0 bb Bb BB effect +/- 2 se 3.5 genotypic value genotypic value 3.5 2.5 3.0 4.0 4.5 Aa 4.5 aa aa 0 bb BB Bb -40 20 20 -1 2 2 -20 genotypic value 2 6 6 20 0 60 aa 40 40 Bb bb genotypic4value 80 Aa effect +/- 2 se 100 AA genotypic value genotypic value 60 80 100 BB bb Bb BB a1 d1 a2 d2 iaa iad ida idd NCSU QTL II: Yandell © 2004 11 multiple QTL with epistasis • same overview model Y GQ e, var( e) 2 • sum over multiple QTL in model M = {1,2,12,…} GQ sum { jM } j (Q ) • partition genetic variance in same manner var( GQ ) G2 sum { jM } 2j • could restrict attention to 2-QTL interactions Model NCSU QTL II: Yandell © 2004 12 model selection with epistasis • additive by additive 2-QTL interaction – – • full epistasis adds many model df – – – • 2 QTL in BC: 1 df 2 QTL in F2: 4 df 3 QTL in F2: 20 df (one interaction) (AA, AD, DA, DD) (34 d.f. 2-QTL, 8 d.f. 3-QTL) data-driven interactions (tree-structured) – – – • adds only 1 model degree of freedom (df) per pair but could miss important kinds of interaction contrasts comparing subsets of genotypes double recessive or double dominant vs other genotypes discriminant analysis based contrasts (Gilbert and Le Roy 2003, 2004) some issues in model search – epistasis between significant QTL • • – epistasis with non-significant QTL • • – Model check all possible pairs when QTL included? allow higher order epistasis? whole genome paired with each significant QTL? pairs of non-significant QTL? Yi Xu (2000) Genetics; Yi, Xu, Allison (2003) Genetics; Yi (2004) NCSU QTL II: Yandell © 2004 13 3. comparing QTL models • balance model fit with model "complexity“ – want maximum likelihood – without too complicated a model • information criteria quantifies the balance – Bayes information criteria (BIC) for likelihood – Bayes factors for Bayesian approach Model NCSU QTL II: Yandell © 2004 14 QTL likelihoods and parameters • LOD or likelihood ratio compares model – L(p) = log likelihood for a particular model with p parameters – log(LR) = L(p2) – L(p1) – LOD = log10 (LR) = log(LR)/log(10) • p = number of model degrees of freedom – consider models with m QTL and all 2-QTL epistasis terms – BC: p = 1 + m + m(m-1) – F2: p = 1 + 2m +4m(m-1) • Bayesian information criterion balances complexity – BIC() = –2 log[L(p)] + p log(n) – n = number of individuals in study – = Broman’s BIC adjustment Model NCSU QTL II: Yandell © 2004 15 information criteria: likelihoods • L(p) = likelihood for model with p parameters • common information criteria: – Akaike AIC – Bayes/Schwartz BIC – BIC-delta BIC – general form: IC = –2 log[L(p)] + 2 p = –2 log[L(p)] + p log(n) = –2 log[L(p)] + p log(n) = –2 log[L(p)] + p D(n) • comparison of models – hypothesis testing: designed for one comparison • 2 log[LR(p1,p2)] = L(p2) – L(p1) – model selection: penalize complexity • IC(p1,p2) = 2 log[LR(p1,p2)] + (p2 – p1) D(n) Model NCSU QTL II: Yandell © 2004 16 information criteria vs. model size WinQTL 2.0 SCD data on F2 A=AIC 1=BIC(1) 2=BIC(2) d=BIC() models d d information criteria 300 320 340 • • • • • • • 360 d d d 1 A 1 1 3 1 1 1A A 2 2 2 • 2+5+9+2 • 2:2 AD 2 2 d2 d 2 – 1,2,3,4 QTL – epistasis 2 d 2 A A A 4 5 6 7 model parameters p 1 1 A A 8 9 epistasis Model NCSU QTL II: Yandell © 2004 17 Bayes factors & BIC B12 pr ( model 1 | Y ) / pr ( model 2 | Y ) pr (Y | model 1 ) pr ( model 1 ) / pr ( model 2 ) pr (Y | model 2 ) • what is a Bayes factor? – – ratio of posterior odds to prior odds ratio of model likelihoods • BF is equivalent to LR statistic when – comparing two nested models – simple hypotheses (e.g. 1 vs 2 QTL) • BF is equivalent to Bayes Information Criteria (BIC) – for general comparison of any models – want Bayes factor to be substantially larger than 1 (say 10 or more) 2 log( B12 ) 2 log( LR) ( p2 p1 ) log( n) Model NCSU QTL II: Yandell © 2004 18 QTL Bayes factors • pattern of QTL across genome – more complicated prior – posterior easily sampled prior probability 0.10 0.20 • sampled marginal histogram • shape affected by prior pr(m) e e e p u exponential Poisson uniform p p u u e u u p u u u p e p e e p p e 0.00 – prior pr(m) chosen by user – posterior pr(m|Y,X) 0.30 • m = number of QTL 0 e p e e p u p u p u e u 2 4 6 8 m = number of QTL pr (m|Y , X ) /pr (m) BFm,m1 pr (m 1|Y , X ) /pr (m 1) Model NCSU QTL II: Yandell © 2004 19 10 issues in computing Bayes factors • BF insensitive to shape of prior on m – geometric, Poisson, uniform – precision improves when prior mimics posterior • BF sensitivity to prior variance on effects – prior variance should reflect data variability – resolved by using hyper-priors • automatic algorithm; no need for user tuning • easy to compute Bayes factors from samples – sample posterior using MCMC – posterior pr(m|Y,X) is marginal histogram Model NCSU QTL II: Yandell © 2004 20 multiple QTL priors • phenotype influenced by genotype & environment pr(Y|Q,) ~ N(GQ , 2), or Y = GQ + environment • partition genotype-specific mean into QTL effects GQ = mean + main effects + epistatic interactions GQ = + (Q) = + sumj in M j(Q) • priors on mean and effects (Q) j(Q) • ~ N(0, 02) grand mean ~ N(0, 12) model-independent genotypic effect ~ N(0, 12/|M|) effects down-weighted by size of M determine hyper-parameters via Empirical Bayes h2 G2 0 Y and 1 2 2 1 h Model NCSU QTL II: Yandell © 2004 21 multiple QTL posteriors • phenotype influenced by genotype & environment pr(Y|Q,) ~ N(GQ , 2), or Y = + GQ + environment • relation of posterior mean to LS estimate GQ | Y , m ~ N ( BQ Gˆ Q , BQ CQ 2 ) N (Gˆ Q , CQ 2 ) LS estimate Gˆ Q sum i [sum jM ˆ j (Qi )] sum i wiQY variance 2 V (Gˆ Q ) sum i wiQ 2 CQ 2 shrinkage BQ /( CQ ) 1 Model NCSU QTL II: Yandell © 2004 22 4. simulations and data studies 40 •simulated F2 intercross, 8 QTL •increase to detect all 8 loci effect 1 2 3 4 5 6 7 8 11 50 62 107 152 32 54 195 –3 –5 +2 –3 +3 –4 +1 +2 1 1 3 6 6 8 8 9 30 20 8 9 10 11 12 13 Genetic map ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8 ch9 ch10 0 Model 7 number of QTL Chromosome QTL chr 0 – n=500, heritability to 97% 10 frequency in % – (Stephens, Fisch 1998) – n=200, heritability = 50% – detected 3 QTL posterior 50 100 NCSU QTL II: Yandell © 2004 150 200 23 loci pattern across genome • notice which chromosomes have persistent loci • best pattern found 42% of the time m 8 9 7 9 9 9 Chromosome 1 2 3 4 2 0 1 0 3 0 1 0 2 0 1 0 2 0 1 0 2 0 1 0 2 0 1 0 Model 5 0 0 0 0 0 0 6 2 2 2 2 3 2 7 0 0 0 0 0 0 8 2 2 1 2 2 2 9 1 1 1 1 1 2 10 0 0 0 0 0 0 NCSU QTL II: Yandell © 2004 Count of 8000 3371 751 377 218 218 198 24 B. napus 8-week vernalization whole genome study • 108 plants from double haploid – similar genetics to backcross: follow 1 gamete – parents are Major (biennial) and Stellar (annual) • 300 markers across genome – 19 chromosomes – average 6cM between markers • median 3.8cM, max 34cM – 83% markers genotyped • phenotype is days to flowering – after 8 weeks of vernalization (cooling) – Stellar parent requires vernalization to flower • Ferreira et al. (1994); Kole et al. (2001); Schranz et al. (2002) Model NCSU QTL II: Yandell © 2004 25 Bayesian model assessment Bayes factor ratios posterior / prior 0.3 0.2 50 QTL posterior moderate weak 1 0.0 3 number of QTL Bayes factor ratios 1 Model 3 5 7 9 model index 11 13 NCSU QTL II: Yandell © 2004 4 3 moderate 1 2 weak 2 2 strong 2 2 3 1 e+01 3 posterior / prior 5 e-01 0.0 0.1 2*2 2:2,12 3:2*2,12 2:2,13 2:2,3 2:2,16 2:2,11 3:2*2,3 2:2,15 4:2*2,3,16 3:2*2,13 2:2,14 0.2 model posterior 0.3 2 5 e+02 pattern posterior evidence suggests 4-5 QTL N2(2-3),N3,N16 11 9 7 5 3 1 11 9 7 5 number of QTL 2 1 2 2 col 1: posterior col 2: Bayes factor note error bars on bf strong 5 0.1 row 1: # QTL row 2: pattern 500 QTL posterior 1 3 5 7 9 model index 11 13 26 Bayesian estimates of loci & effects 4 0.00 histogram of loci blue line is density red lines at estimates loci histogram 0.02 0.04 0.06 napus8 summaries with pattern 1,1,2,3 and m Model 0 50 N2 100 150 N3 200 250 100 150 N3 200 250 N16 additive -0.02 0.02 50 -0.08 estimate additive effects (red circles) grey points sampled from posterior blue line is cubic spline dashed line for 2 SD 0 N2 NCSU QTL II: Yandell © 2004 N16 27 0.008 envvar 200 100 density 0.006 50 0 pattern: N2(2),N3,N16 col 1: density col 2: boxplots by m 0.010 Bayesian model diagnostics 0.004 0.006 0.008 0.010 0.012 4 4 5 6 7 8 9 11 12 envvar conditional on number of QTL 0.5 0.6 0.7 4 4 5 6 7 8 9 11 12 LOD 14 16 18 20 0.12 heritability conditional on number of QTL 0.08 density 0.50 0.30 0.4 0.04 5 10 15 marginal LOD, m Model 0.40 heritability 4 3 2 density 1 0 0.3 marginal heritability, m 10 12 but note change with m 0.2 0.00 environmental variance 2 = .008, = .09 heritability h2 = 52% LOD = 16 (highly significant) 0.60 5 marginal envvar, m 20 25 4 NCSU QTL II: Yandell © 2004 4 5 6 7 8 9 11 12 LOD conditional on number of QTL 28 studying diabetes in an F2 • segregating cross of inbred lines – B6.ob x BTBR.ob F1 F2 – selected mice with ob/ob alleles at leptin gene (chr 6) – measured and mapped body weight, insulin, glucose at various ages (Stoehr et al. 2000 Diabetes) – sacrificed at 14 weeks, tissues preserved • gene expression data – Affymetrix microarrays on parental strains, F1 • key tissues: adipose, liver, muscle, -cells • novel discoveries of differential expression (Nadler et al. 2000 PNAS; Lan et al. 2002 in review; Ntambi et al. 2002 PNAS) – RT-PCR on 108 F2 mice liver tissues • 15 genes, selected as important in diabetes pathways • SCD1, PEPCK, ACO, FAS, GPAT, PPARgamma, PPARalpha, G6Pase, PDI,… Model NCSU QTL II: Yandell © 2004 29 effect (add=blue, dom=red) -0.5 0.0 0.5 1.0 0 LOD 2 4 6 8 Multiple Interval Mapping SCD1: multiple QTL plus epistasis! 0 0 Model 50 chr2 100 50 chr2 100 150 200 250 chr9 300 200 250 chr9 300 chr5 150 chr5 NCSU QTL II: Yandell © 2004 30 Bayesian model assessment: number of QTL for SCD1 Bayes factor ratios 0.05 posterior / prior 5 10 50 QTL posterior 0.10 0.15 0.20 500 0.25 QTL posterior moderate 1 0.00 weak 1 2 3 4 5 6 7 8 9 11 number of QTL Model strong 13 1 2 3 4 5 6 7 8 9 number of QTL NCSU QTL II: Yandell © 2004 11 13 31 15 10 0.00 5 LOD 0.10 density 20 Bayesian LOD and h2 for SCD1 0 5 10 15 1 1 2 3 4 5 6 7 8 9 10 12 14 LOD conditional on number of QTL 0.5 0.1 0.3 heritability 3 2 1 0 density 4 marginal LOD, m 20 0.0 0.1 0.2 0.3 0.4 0.5 marginal heritability, m Model 0.6 1 0.7 1 2 3 4 5 6 7 8 9 10 12 14 heritability conditional on number of QTL NCSU QTL II: Yandell © 2004 32 Model 1 3 5 7 9 11 model index 13 15 1 NCSU QTL II: Yandell © 2004 3 5 7 9 model index 11 2 3 5 6 moderate 6 6 5 5 4 4 6 5 3 4 3:1,2,3 0.15 posterior / prior 0.2 0.4 0.6 0.8 4:2*1,2,3 4:1,2,2*3 4:1,2*2,3 5:3*1,2,3 5:2*1,2,2*3 5:2*1,2*2,3 6:3*1,2,2*3 6:3*1,2*2,3 5:1,2*2,2*3 6:4*1,2,3 6:2*1,2*2,2*3 2:1,3 3:2*1,2 2:1,2 model posterior 0.05 0.10 pattern posterior 2 0.00 Bayesian model assessment: chromosome QTL pattern for SCD1 Bayes factor ratios weak 13 15 33 trans-acting QTL for SCD1 (no epistasis yet: see Yi, Xu, Allison 2003) dominance? Model NCSU QTL II: Yandell © 2004 34 2-D scan: assumes only 2 QTL! epistasis LOD peaks Model joint LOD peaks NCSU QTL II: Yandell © 2004 35 sub-peaks can be easily overlooked! Model NCSU QTL II: Yandell © 2004 36 1-D and 2-D marginals pr(QTL at | Y,X, m) unlinked loci Model linked loci NCSU QTL II: Yandell © 2004 37 false detection rates and thresholds • multiple comparisons: test QTL across genome – size = pr( LOD() > threshold | no QTL at ) – threshold guards against a single false detection • very conservative on genome-wide basis – difficult to extend to multiple QTL • positive false discovery rate (Storey 2001) – pFDR = pr( no QTL at | LOD() > threshold ) – Bayesian posterior HPD region based on threshold • ={ | LOD() > threshold } { | pr( | Y,X,m ) large } – extends naturally to multiple QTL Model NCSU QTL II: Yandell © 2004 38 pFDR and QTL posterior • positive false detection rate – pFDR = pr( no QTL at | Y,X, in ) – pFDR = pr(H=0)*size pr(m=0)*size+pr(m>0)*power – power = posterior = pr(QTL in | Y,X, m>0 ) – size = (length of ) / (length of genome) • extends to other model comparisons – m = 1 vs. m = 2 or more QTL – pattern = ch1,ch2,ch3 vs. pattern > 2*ch1,ch2,ch3 Model NCSU QTL II: Yandell © 2004 39 0.3 1.0 0.0 Model 0.2 0.4 0.6 0.8 relative size of HPD region 1.0 0.2 Storey pFDR(-) 0.1 0 0.0 0.0 0.2 pr( H=0 | p>size ) 0.4 0.6 0.8 prior probability fraction of posterior found in tails BH pFDR(-) and size(.) 0.2 0.4 0.6 0.8 1.0 pFDR for SCD1 analysis 0.0 NCSU QTL II: Yandell © 2004 0.2 0.4 0.6 0.8 pr( locus in HPD | m>0 ) 1.0 40