Population Structure, Association Studies, and QTLs Stat 115/215 Structure Algorithm • One of the most widely-used programs in population genetics (original paper cited >9,000 times since 2000) – Pritchard, Stephens and Donnelly (2000). Inference of Population Structure Using Multilocus Genotype Data, Genetics. 155:945-959. • Very flexible model can determine: – The most likely number of uniform groups (populations, K) – The genomic composition of each individual (admixture coefficients) – Possible population of origin 2 A simple model of population structure • Individuals in our sample represent a mixture of K (unknown) ancestral populations. • Each population is characterized by (unknown) allele frequencies at each locus. • Within populations, markers are in HardyWeinberg and linkage equilibrium. 3 The model • Let A1, A2, …, AK represent the (unknown) allele frequencies in each subpopulation • Let Z1, Z2, … , Zm represent the (unknown) subpopulation of origin of the sampled individuals – they are indicators • Assuming HWE and LE within subpopulations, the likelihood of an individual’s genotypes at various loci in subpopulation k is given by the product of the relevant allele frequencies: 4 More details • Probability of observing a genotype at locus l by chance in population is a function of allele frequencies: – Pl =pi2 for homozygous loci – Pl =2pipj for heterozygous loci • Assuming no linkage among the markers, we have the product form as in the previous page. 5 Heuristics • If we knew the population allele frequencies in advance, then it would be easy to assign individuals (using Bayes rule). P(Gi | Zi = k, A)P(Zi = k | A) Pr(Zi = k | Gi , A1,… , Ak ) = å P(Gi | Zi = j, A)P(Zi = j | A) • If we knew the individual assignments, it would be easy to estimate frequencies. • In practice, we don’t know either of these, but we have the Gibbs sampler! 6 MCMC algorithm (for fixed K) • Start with random assignment of individuals to populations – Step 1: Gene frequencies in each population are estimated based on the individuals that are assigned to it. – Step 2: Individuals are assigned to populations based on gene frequencies in each population. • And this is repeated... • Estimation of K performed separately 7 Admixed individuals are mosaics of ancestral populations 8 Two basic models 9 Inferred from human populations 10 More details 11 12 Alternative approach • Structure is very computationally intensive • Often no clear best-supported K-value • Alternative is to use traditional multivariate statistics to find uniform groups • Principal Components Analysis is most commonly used algorithm • EIGENSOFT (PCA, Patterson et al., 2006; PloS Genetics 2:e190) 13 Principal Component Analysis • Efficient way to summarize multivariate data like genotypes • Each axis passes through maximum variation in data, explains a component of the variation 14 Human population assignment with SNPs • Assayed 500,000 SNP genotypes for 3,192 Europeans • Used Principal Components Analysis to ordinate samples in space • High correspondence between sample ordination and geographic origin of samples Individuals assigned to populations of origin with high accuracy 15 Genetic Association Tests • Review of typical approach: chi-square test – 2x3 table (or 2x2 table) AA A a aa Total Cases n11 n12 n13 n1. Controls n21 n22 n23 n2. n.1 n.2 n.3 n.. Total A a Tota l Cases n11 n12 n1. Controls n21 n22 n2. Total n.1 n.2 n.. – Alternatively, we can do a logistic regression P(Y =1) log = a + bX P(Y = 0) 16 Genetic Models and Underlining Hypotheses Genotypic Model Genotype AA Genotypic Value μAA Aa aa μAa μaa Genotypic value is the expected phenotypic value of a particular genotype Hypothesis: all 3 different genotypes have different effects AA vs. Aa vs. aa Genetic Models and Underlining Hypotheses Dominant Model Genotype AA Genotypic Value μA- Aa aa μAμaa Hypothesis: the genetic effects of AA and Aa are the same (assuming A is the minor allele) AA and Aa vs. aa Genetic Models and Underlining Hypotheses Recessive Model Genotype AA Genotypic Value μA- Aa aa μaμaa Hypothesis: the genetic effects of Aa and aa are the same (A is the minor allele) 19 AA vs. Aa and aa Genetic Models and Underlining Hypotheses Allelic Model Genotype Genotypic Value AA 2μA Aa aa μA+ μa 2μa Hypothesis: the genetic effects of allele A and allele a are different A vs. a Pearson’s Chi-squared Test Genotypic Model: Null Hypothesis: Independence H0 : ij i. . j cases controls AA nAA mAA Aa nAa mAa df = 2 aa naa maa Pearson’s Chi-squared Test Dominant Model: Null Hypothesis: Independence H0 : ij i. . j cases controls AA+Aa nAA + nAa mAA + mAa df = 1 aa naa maa Pearson’s Chi-squared Test Recessive Model: Null Hypothesis: Independence H0 : ij i. . j cases controls AA nAA mAA Aa +aa nAa + naa mAa + maa df = 1 Pearson’s Chi-squared Test Allelic Model: Null Hypothesis: Independence H0 : ij i. . j cases controls A 2nAA + nAa 2mAA + mAa df = 1 a nAa + 2naa mAa +2 maa Test Statistic Chi-squared Test Statistic: (O E ) E all cells 2 2 O is the observed cell counts E is the expected cell counts, under null hypothesis of independence (row total column tot al ) E N Other Options Fisher’s Exact Test: When sample size is small, the asymptotic approximation of null distribution is no longer valid. By performing Fisher’s exact test, exact significance of the deviation from a null hypothesis can be calculated. For a 2 by 2 table, the exact p-value can be calculated as: a b c d Association Tool PLINK: http://pngu.mgh.harvard.edu/~purcell/plink/ Case-control, TDT, quantitative traits. 27 Mapping Quantitative Traits • Examples: weight, height, blood pressure, BMI, mRNA expression of a gene, etc. • Example: F2 intercross mice 28 Quantitative traits (phenotypes) 133 females from our earlier (NOD B6) (NOD B6) cross Trait 4 is the log count of a particular white blood cell type. 29 Another representation of a trait distribution 30 Note the equivalent of dominance in our trait distributions. A second example 31 Note the approximate additivity in our trait distributions here. Trait distributions: a classical view In general we seek a difference in the phenotype distributions of the parental strains before we think seeking genes associated with a trait is worthwhile. But even if there is little difference, there may be many such genes. Our trait 4 is a case like this. 32 Data and goals Data Phenotypes: yi = trait value for mouse i Genotype: xij = 1/0 of mouse i is A/H at marker j (backcross); need two dummy variables for intercross Genetic map: Locations of markers Goals •Identify the (or at least one) genomic region, called quantitative trait locus = QTL, that contributes to variation in the trait •Form confidence intervals for the QTL location •Estimate QTL effects Models: GenotypePhenotype • Let y = phenotype, g = whole genome genotype • Imagine a small number of QTLw with genotypes g1,…., gp (2p or 3p distinct genotypes for BC, IC resp). • We assume E(y|g) = (g1,…gp ), var(y|g) = 2(g1,…gp) 34 Models: GenotypePhenotype, ctd • Homoscedacity (constant variance) 2(g1,…gp) = 2 (constant) • Normality of residual variation y|g ~ N(g ,2 ) • Additivity: (g1,…gp ) = + ∑j gj (gj = 0/1 for BC) • Epistasis: Any deviations from additivity. 35 Additivity, or non-additivity (BC) 36 Additivity or non-additivity: F2 37 The simplest method: ANOVA • Split mice into groups according to genotype at a marker • Do a t-test/ANOVA • Repeat for each marker • Adjust for multiplicity LOD score = log10 likelihood ratio, comparing single-QTL model to the “no QTL anywhere” model. 38 Interval mapping (IM) • Lander & Botstein (1989) • Take account of missing genotype data (uses the HMM) • Interpolates between markers • Maximum likelihood under a mixture model 39 Interval mapping, cont • Imagine that there is a single QTL, at position z between two (flanking) markers • Let qi = genotype of mouse i at the QTL, and assume • yi | qi ~ Normal( qi , 2 ) • We won’t know qi, but we can calculate • pig = Pr(qi = g | marker data) • Then, yi, given the marker data, follows a mixture of normal distributions, with known mixing proportions (the pig). • Use an EM algorithm to get MLEs of = (A, H, B, ). • Measure the evidence for a QTL via the LOD score, which is the log10 likelihood ratio comparing the hypothesis of a single QTL at position z to the hypothesis of no QTL anywhere. 40 Epistasis, interactions, etc • How to find interactions? – Stepwise regression – BEAM (Zhang and Liu 2007) 41 Naïve Bayes model Y X1 X2 X3 Xm 42 Augmented Naïve Bayes Group 0 X01 X2.21 Y X02 X2.22 Group 22 X11 X12 Group 1 X13 X2.12 X2.11 X2.13 Group 21 43 Variable Selection with Interaction Let Y ∈ R be a univerate response variable and X ∈ R p be a vector of p continuous predictor variables Y = X 1× X 2+ ϵ , ϵ ∼ N (0,σ 2 ), X ∼ MVN(0, I p ) Suppose p= 1000 . How to find X 1 and X 2 ? One step forward selection :∼500,000 interaction terms Is there any marginal relationship between Y and X 1 ? 44 σ̂ = 2.24 2 (1) σ̂ = 0.97 σ̂ = 0.42 2 (2) 2 (3) y 45 x1 x2 46 x1 Acknowledgment • Terry Speed (some of the slides) • Karl Broman (U of Wisconsin) • Steven P. DiFazio (West Virginia U) 47