Population Stratification Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics Mar. 24, 2011 https://dsgweb.wustl.edu/qunyuan/presentations/PopStrat2011.pptx 1 What is Population Stratification (PS) ? In narrow sense PS is the presence of a systematic difference in allele frequencies between subpopulations in a population, possibly due to different ancestry or origins, especially in the context of genetic association studies. Population stratification is also referred to as population structure. In broad sense PS can be regarded as the presence of a difference in relatedness between individuals in a population, due to different subpopulations, family/pedigree structure and/or cryptic relation. 2 PS & False Positives False Positives (inflation) Association could be due to the underlying structure of the population, even there is no disease-locus association. 3 An Example of PS-caused False Positive Sub-population 1 case control total A 72 8 80 a 18 2 20 total 90 10 100 Sub-population 2 case control total A 3 27 30 a 7 63 70 10 90 100 Mixed population case control total A 75 35 110 a 25 65 90 100 100 200 risk 9/1 9/1 9/1 risk 1/9 1/9 1/9 risk 2.14 0.38 1.00 • No disease-locus association. • Risk difference between sub-populations. • Allele Frequency difference between sub-populations. • False disease-locus association in mixed population. (any allele with higher frequency in higherrisk sub-population seems to be risk allele) 4 Mantel-Haenszel Test for Stratification Adjusted RR An Example Standard error Chi-square test (1) (2) (3) 5 Linear Model Marker data Population structure variable Genetic background variable Membership variable Subgroup/sub-population variable Ancestry/admixture proportion variable Usually Q is unknown, needs to be estimated 6 Estimating Q by Eigen-analysis singular values X snp1 snp2 snp3 snp4 snp5 = idv1 idv2 0 1 0 0 2 U idv3 2 2 0 1 0 1 2 1 0 0 -0.55 -0.78 -0.16 -0.20 -0.15 0.33 -0.10 0.04 0.14 -0.93 S VT 3.81 0.00 0.00 0.00 2.05 0.00 0.00 0.00 1.13 T S2 eigenvalues 0.34 -0.27 -0.71 0.52 0.20 14.51 0.00 0.00 0.00 4.21 0.00 0.00 0.00 1.28 -0.28 -0.75 -0.60 -0.95 0.29 0.08 0.11 0.59 -0.80 Q1 Q2 Q3 Eigenvector of COV(X) References: Patterson et al. 2006, Price et al. 2006 (software EIGENSTRAT) Or SAS Proc PRINCOM; R svd() and eigen() 7 Eigen-analysis of HapMap Populations Q2 Q1 8 Estimating Q by MLE (for admixed population) G: Observed genotypes of admixed [and parental populations] Q: Allelic frequencies in parental populations P : Individual membership to be estimated Goal: obtain P that maximizes Pr(G|P,Q) 1. Assign prior values for Q (randomly or estimated from parental population genotype data) & P (randomly) 2. Compute P(i) by solving 3. Compute Q(i) by solving (G | Q, P) 0 ( P) (G | Q, P) 0 (Q) 4. Iterate Steps 1 and 2 until convergence. Tang et al. Genetic Epidemiology, 2005(28): 289–301 9 Estimating Q by MCMC (for admixed population) Observed G : genotypes of admixed [and parental populations] Unknown Z : admixed individuals’ membership from ancestral populations Problem: How to estimate Z ? Bayesian and Markov Chain Monte Carlo (MCMC) methods 1. Assume ancestral population number K (see next slide) 2. Define prior distribution Pr(Z) under K 3. Use MCMC to sample from posterior distribution Pr(Z|G) = Pr(Z)∙ Pr(G|Z) 4. Average over large number of MCMC samples to obtain estimate of Z Falush et al. Genetics, 2003(164):1567–1587 Software : STRUCTURE 10 Infer Population Number (K) 11 Linear Model (an example including m Q-variables) y a bx b1Q1 b2Q2 ... bmQm e m y a bx bi Qi e i 1 SAS Proc REG, Proc GENMOD; R lm(), glm() Generalized, can fit binary/categorical y 12 Unified Mixed Model (more general) Inferred population membership SNP(s) Covariate(s) ID matrix Modeling the resemblance among individuals V=ZGZ'+R 13 Multi-Variate Normal Distribution (MVN) & Likelihood of Mixed Model Based on MVN, the likelihood of trait (y) in a matrix form is: no. of individuals (in a pedigree) Kinship (IBD) matrix (nn ) nn variancecovariance matrix phenotype vector mean phenotype vector V=ZGZ'+R V 2 I 2 a 2 e 14 Kinship Inbreeding Coefficient The inbreeding coefficient of an individual is the probability that the pair of alleles carried by the gametes that produced it are Identical By Descent (IBD). Identical By Descent (IBD) Two alleles come from the same ancestry. Kinship/Coancestry The inbreeding coefficient of an individual is equal to the coancestry between its parents. For example if parents X and Y have a child Z, then inbreeding coefficient of Z = coancestry between X and Y Software: SAS (PROC INBREED), MERLIN, SPAGedi , R(kinship, emma) et al. (need pedigree and/or marker data) 15 Kinship Matrix (expected probability of allele sharing among relatives) 16 Resources for Mixed Model with Kinship Matrix Software Kinship Mixed Model Data SAS Proc INBREED Proc MIXED Quantitative trait Pedigree data SAS Proc INBREED Proc GLIMMIX Quantitative/qualitative trait, Pedigree data R : kinship makekinship() lmekin() Quantitative trait Pedigree data R: emma emma.kinship() emma.REML.t() EMMAX emmax-kin emmax Quantitative trait Using maker data to calculate kinship 17 Diagnosis of Inflation of False Positives •Inflation: more false positives than expected under the null •In GWAS, usually due to PS •Can be caused by inappropriate statistical methods even with no PS •May (not necessarily) indicate PS 18 Theoretical Basis of Diagnosis Uniform distribution [0,1] of p-values under the null no inflation inflation Histogram -log10(p) Q-Q plot 19 Inflation Rate (IR) Devlin et al. 2004 For Binary Trait For Continuous Trait Amin , Duijn, Aulchenko, 2007 20 Genomic Control (by IR) For Binary Trait Yi 2 i2 For Continuous Trait Yi 2 (ti )2 Or based on p-value Yi 2 (21 pi ,df 1) 2 ~ 2 Yi Yi ~ df2 1 ˆ ~2 2 ~ pi Pr ob( df 1 Yi ) 21 Practice •Download and unzip the data from dsgweb.wustl.edu/qunyuan/data/ popstra2011hw.zip • Ignore pedigree.csv, test each SNP in snp.csv for association (with trait in trait.csv); •Investigate p-values to see if there is any inflation; • Try to explain why; •List some possible methods to reduce or control the inflation; •Choose one method, apply it to the data; •Does it work? •Try to explain why. •Clearly document each step of you analysis. The is no standard answer, feel free to try anything you like ! Report back to linusan@wustl.edu and qunyuan@wustl.edu in one week. Thanks ! 22