Admixture Mapping Qunyuan Zhang Division of Statistical Genomics GEMS Course M21-621 Computational Statistical Genetics March 25, 2010 1 Three Mapping Strategies Linkage Analysis (linkage): genotype & phenotype data from family (or families) Association Scan (LD): genotype & phenotype data from population(s) or families Admixture Mapping (LD): genotype data from admixed and ancestral populations, phenotype data from admixed populations (1) Ancestry-phenotype association mapping (2) Ancestry info for population structure control 2 Genetic Admixture Ancestral Population 1 Ancestral Population 2 Africans Caucasians Admixture Information (Ancestry Analysis) Admixed Population African Americans Admixture Mapping 3 Rationale of Admixture Mapping If a disease has some genetic factors, and the disease gene frequency in pop 2 is higher than in pop 1. After the admixture of pop 1 and 2, the diseased individuals in admixed generations will carry disease genes/alleles that have more ancestry from pop 2 than from pop 1. If a marker is linked with disease genes, because of linkage disequilibrium, the diseased individuals will also carry the marker copies that have more ancestry from pop 2 than from pop 1. Inversely, if we find a marker/locus whose ancestry from pop 2 in diseased group is significantly different from that in non-diseased group, we consider this marker/locus to be linked with (or a part of ) disease gene. 4 Illustration of Admixture 5 Advantages of Admixture Mapping Admixed population has more genetic variation and polymorphism than relatively pure ancestral populations. Admixture produces new LD in admixed population. Compared with ancestral populations, shorter genetic history of admixture population keeps more LD (long genetic history will destroy LD), In admixed population, LD could be detected for relatively loose linkage. Ancestry information can be used to control population stratification caused by genetic admixture. According to simulation, admixture mapping demonstrates higher power than regular methods, needs less sample size. Flexible design: case-control or case-only, qualitative or quantitative traits, no need of pedigree information 6 Ancestry Proportion of genetic materials descending from each founding population Population level : population admixture proportion Individual level: individual admixture proportion Individual-locus level: locus-specific ancestry 7 Two Ways of Using Ancestral Info. Individual Ancestry (IA) can be used as a genetic background covariate for population structure control Phenotype= a + b * Genotype + c * IA + Error Locus-specific Ancestry (LSA) can be directly used to detect association (admixture mapping) Phenotype=a + b * LSA 8 Individual Ancestry (IA) Estimation using MLE G: Observed genotypes of admixed and ancestral populations Q: Allelic frequencies in ancestral populations P : Individual Ancestry to be estimated Goal: obtain P that maximizes Pr(G|P,Q) 1. Assign prior values for Q (randomly or estimated from ancestral population genotype data) & P (randomly) 2. Compute P(i) by solving 3. Compute Q(i) by solving (G | Q, P) 0 ( P) (G | Q, P) 0 (Q) 4. Iterate Steps 1 and 2 until convergence. Tang et al. Genetic Epidemiology, 2005(28): 289–301 9 Locus-specific Ancestry Estimation using MCMC Observed G : genotypes of admixed and ancestral populations Unknown Z : admixed individuals’ locus specific ancestries from ancestral populations Problem: How to estimate Z ? Maximum Likelihood Estimate(MLE): How to obtain a Z that maximizes Pr(G|Z) ? Z is a huge space of parameters, in which search is difficult for likelihood method. Bayesian and Markov Chain Monte Carlo (MCMC) methods 1. Assume ancestral population number K 2. Define prior distribution Pr(Z) under K 3. Use MCMC to sample from posterior distribution Pr(Z|G) = Pr(Z)∙ Pr(G|Z) 4. Average over large number of MCMC samples to obtain estimate of Z Falush et al. Genetics, 2003(164):1567–1587 10 Software STRUCTURE Falush D, Stephens M, Pritchard JK (2003) Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics 164:1567–1587. ADMIXMAP Hoggart CJ, Parra EJ, Shriver MD, Bonilla C, Kittles RA, Clayton DG, McKeigue PM (2003) Control of confounding of genetic associations in stratified populations. Am J Hum Genet 72:1492–1504. ANCESTRYMAP Patterson N, Hattangadi N, Lane B, Lohmueller KE, Hafler DA, Oksenberg JR, Hauser SL, Smith MW, O’Brien SJ, Altshuler D, Daly MJ, Reich D (2004) Methods for high-density admixture mapping of disease genes. Am J Hum Genet 74:979–1000 11 References D.C.Rife. Populations of hybrid origin as source material for the detection of linkage. Am.J.Hum.Genet. 1954, (6):26-33 R.Chakraborty et al. Adimixture as a tool for finding linked genes and detecting that difference from allelic association between loci. Proc.Natl.Acad.Sci. 1988,Vol.85:9119-9123 N. Risch. Mapping genes for complex disease using association studies with recently admixed populations. Am.J.Hum.Genet.Suppl. 1992, 51:13 … P.M.McKeigue. Prospects for admixture mapping of complex traits. Am.J.Hum.Genet. 2005, Vol.76:1-7 X.Zhu et al. Admixture mapping for hypertention loci with genome-scan markers. Nature Genetics. 2005,Vol.37(2): 177-181 Q Zhang et al. Genome-wide admixture mapping for coronary artery calcification in African Americans: the NHLBI Family Heart Study. Genet Epidemiol. 2008 Apr;32(3):264-72. 12 Marker Information Content (MIC ) Distribution Used for Simulation (300 Loci) Mean=0.22 Std Dev=0.1003 (MIC) Freqency of allele k at locus i in Caucasians n MICi k 1 fikW fikB Freqency of allele k at locus i in Africans 2 Allele number of locus i 13 African Americans 622 Subjects from 211 families 400 microsatellite markers Average distance 10 cM Coronary and aortic artery calcium (CAC) Admixture Mapping CAC Loci calcified plaque Quantified by CT 14 Data Samples 1672 subjects from 3 populations: 622 African Americans (211 families) from FHSSCAN 893 Caucasians (320 families) from FHS-SCAN 157 Africans (unrelated) from Marshfield Center Genotypes 302 microsatellite Loci of all subjects Average marker distance 11.9cM Phenotype Coronary and aortic artery calcium (CAC) of 622 African Americans, BLOM transformation 15 Statisticl Procedure Step 1 Randomly draw one subject from each family to create a sample of 688 unrelated subjects which comprises : 211 African Americans from 211 families (FHS-SCAN) 320 whites from 320 families (FHS-SCAN) 157 unrelated Africans (Marshfield Center) Step 2 Ancestry estimation, STRUCTURE 2.1 Step 3 Ancestry-CAC association analysis, regress 211 African Americans’ CAC scores on their locus-specific ancestries from Africans. Step 4 Repeat step1~step3 (100 times), obtain the average p-value of each locus Step 5 For each locus: permutation test on average p-value Number of random permutations: 10000 16 RESULTS Sources of Variation of Ancestry-from-Africans Sources of variation Variance components Percent(%) Families Subjects within family Loci within subject Replications within locus 0.01054 0.00492 0.00599 0.00042 48.19 22.50 27.39 1.92 2% Var(families) 27% 48% Var(subjects/family) Var(loci/subject) 23% Var(replications/locus) 17 RESULTS Ancestry Analysis at Population Level Population Admixture Proportions in African Americans Founding population Ancestry(%) From Caucasians 22.04 From Africans 77.96 18 RESULTS Ancestry Analysis at Individual Level Individual Ancestry Distribution of 622 African Americans Ancestry-from-Africans: average 77.96% (3.1%~96.9%) 19 RESULTS Ancestry Analysis at Individual-locus Level Distribution of Locus-specific Ancestries from Africans Ancestry from Africans An Example African American 302 Microsatellite Loci ordered by chromosome and position from Chrom. 1 (4.22cM) to Chrom. 23 (104.83cM) 20 RESULTS Locus-specific Ancestry-CAC association analysis No. Loci Chr# Pos. Permu. p Reg. coeff. R2 1 AFM063XF4 10 19 .0 (10p14) 0.0021 -1.2442 0.0310 2 GATA64D02 6 80.45 (6q12) 0.0024 -2.2112 0.0205 3 GATA42H02 4 181.93 (4q32) 0.0083 2.7996 0.0198 4 AFMB337ZH9 22 60.61 0.0120 1.1594 0.0194 5 GGAA20G10 2 27.6 0.0133 0.7271 0.0166 6 GATA73H09 12 78.14 0.0170 -1.4403 0.0150 7 GGAA3F06 7 41.69 0.0173 1.6652 0.0163 8 UT1307 20 69.5 0.0178 -1.1565 0.0175 9 UT7136 22 52.61 0.0194 1.9457 0.0162 10 GATA163B10 6 42.27 0.0267 -1.3473 0.0165 11 GATA88F09 10 4.32 0.0315 -2.1781 0.0153 12 GATA26D02 12 83.19 0.0319 -2.0880 0.0130 13 ATA1B07 11 54.09 0.0339 1.1540 0.0143 14 ATA4E02 1 192.05 0.0394 0.7829 0.0122 15 GATA137H02 7 29.28 0.0418 1.3168 0.0121 16 GATA4D07 2 145.08 0.0455 -1.0065 0.0125 17 ATA31G11 10 28.31 0.0461 -1.2933 0.0134 21 -log(p value) of Markers on Chromosome 4 Chromosome 4 (20 markers) GATA42H02 -log(p) 2.5 2 1.5 1 0.5 0 0 20 40 60 80 100 120 140 160 180 200 220 Distance (cM) 22 -log(p value) of Markers on Chromosome 6 Chromosome 6 (16 markers) -log(p) GATA64D02 3 2.5 2 1.5 1 0.5 0 0 20 40 60 80 100 120 140 160 180 200 Distance (cM) 23 -log(p value) of Markers on Chromosome 10 Chromosome 10 (14 markers) -log(p) AFM063XF4 3 2.5 2 1.5 1 0.5 0 0 20 40 60 80 100 120 140 160 180 Distance (cM) 24