Supplemental Data for Distribution of Ancestral Chromosomal Segments in Admixed Genomes and Its Implications for Inferring Population History and Admixture Mapping Wenfei Jin, 1, § Ran Li, 1, § Ying Zhou, 1 Shuhua Xu, 1,* 1 Max Planck Independent Research Group on Population Genomics, Chinese Academy of Sciences and Max Planck Society (CAS-MPG) Partner Institute for Computational Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, China. § These authors contributed equally to this work. * To whom correspondence should be addressed. E-mail: xushua@picb.ac.cn (S.X.) 1 Features of LACS distribution in HI and GA models Since the ancestral chromosomal segments in HI model followed an exponential distribution, the expectation of LACS from pop1 at generation T was E ( x;T ) 1 (3) (1 m)T and the variance was D( x;T ) 1 (4). (1 m) 2 T 2 In T-generation GA model, the expected mean and variance of LACS could be calculated based on 0 0 the definition: E ( x;T ) xf ( x; t )dx and D( x;T ) ( x E ( x;T )) 2 f ( x;T )dx . To our knowledge, it is impossible to integrate the formula directly to obtain an analytic expression. Therefore, we used an alternate approach to estimate the mean and variance of LACS in GA model. Based on the method to calculate density function, we could see that the mean LACS from pop1 could be calculated by averaging the mean LACS from different time scales. Since ancestral chromosomal segments from different generations have different weights and the weight P( xt ) is proportional to wt (1 m)t , the mean xt from different times corrected by weights could be calculated as, T xw Eˆ ( x; T ) x P( x )d t w T 0 t 0 t t 0 T 0 1 (1 m)t (1 m)t T 0 (1 m)t 2 t T t 2 (1 m)T (5) Similar strategies could be applied to estimate the variance of LACS in the GA model. t T 2 Dˆ ( x; T ) E ( x 2 ; T ) E 2 ( x; T ) t 1 pt E ( xt ) ( t T t 1 (1 m)tE( xt ) 2 t T t 1 (1 m)t 2 ( )2 (1 m)T t T t 1 2 )2 (1 m)T (1 m)t ( D( xt ) E 2 ( xt )) t T t 1 (1 m)t ( 2 ) 2 (6) (1 m)T t T 1 4(t 1 1) t (1 m) 2 T 2 Data simulation and comparison with theoretical LACS distribution Both the admixed population and parental populations were simulated using the forward-time simulation program we developed previously1,2. In brief, the haploid chromosomes of YRI (Yoruba in Ibadan, Nigeria) and CEU (Utah residents with northern and western European ancestry from the CEPH collection) from HapMap were treated as the initial status of the two parental populations3. Haploid chromosomes from YRI and CEU were then sampled based on their genetic contributions. A pair of haploid chromosomes from the two parental populations (respectively) constructed a diploid admixed individual. Recombination was then introduced into the admixed population based on the genetic map from the HapMap data3. Mutation was ignored considering the short population history. The effective population size (Ne) of each population was set at 5,000. Ancestral origin of the 3 haplotype was labeled to track ancestral chromosomal segments in the simulated admixed population. This allowed us to directly compare the simulated LACS distribution with the theoretical LACS distribution that was calculated using the formula we deduced in this study. Inferring ancestral chromosomal segments and population admixture history We used HAPMIX 4, a software that integrates population genetic models, to identify ancestral chromosomal segments in admixed populations. Since HAPMIX can only infer local ancestries directly based on a two-way admixture model (using only two reference populations), we chose one parental population as a reference population and combined all the other parental populations as the other reference population when the admixed population was formed by multiple-way admixture. This allowed us to infer the ancestral chromosomal segments of one parental population each time. For example, we first combined the African and European parental populations and treated them as a single parental population (African-European population). Then we used the African-European population and the Amerindian population as the two reference populations. Thus we could infer the ancestral chromosomal segments of Amerindian in African-Americans and analyze the admixture dynamics of Amerindian ancestral component. The ancestral segments shorter than a certain threshold could not be accurately inferred due to the limited density of genetic variants and statistical error1,5,6. Therefore, we were interested only in the long ancestral chromosomal segments over a certain threshold, C, which is a constant value. For example, the expected 4 proportion of LACS > C (Pc) in the HI model is c E[ pc | T ] 1 (1 m)Te(1 m)Tx dx e(1 m)TC (see Results), which is a constant value 0 when the threshold C was set. Therefore, the distribution of ancestral chromosomal segments longer than a threshold can be used to infer the population history. Since it was straightforward to obtain the mean and standard deviation (SD) of LACS from each admixture model based on theoretical distribution, we inferred the population admixture history by comparing these empirical data with those from theoretical models. Simulation of case-control and admixture mapping To elucidate the influences of LACS distribution on admixture mapping in the two different admixture models, we simulated a data set for a systematic comparative analysis. Based on the aforementioned methods for the simulation of admixed population, we randomly sampled the simulated admixed individuals as controls. Cases were simulated by random sampling of haploid chromosomes from admixed individuals, but we restricted genetic contribution from the given parental population particularly in the susceptibility locus. More specifically, the genetic contribution of the given parental population to the admixed population () was set at 20%, which was similar to the genetic contribution of European to African-Americans. The number of generations since the initial population admixture () was set as 20, which is an approximation of the generation of population admixture in the New World. Finally, we set the sample size of cases 5 and controls to be 2000 and finally assumed an increased ancestry relative risk of 2 in both HI and GA models, relative to the alleles that did not come from the given parental population. We compared the signatures of association in HI and GA models based on case-only and case-control approaches. We also performed extensive simulations to investigate other possible scenarios. 6 Figure S1. Q-Q plot of simulated LACS distribution under 100-generation HI model versus theoretical distribution. Red line shows null hypothesis that simulated distribution is the same as theoretical distribution. 7 Figure S2. Q-Q plot of simulated LACS distribution under 100-generation GA model versus theoretical distribution. Red line shows null hypothesis that simulated distribution is the same as theoretical. 8 Figure S3. Empirical LACS Distributions of the African ancestral component in African-American and its corresponding theoretical distributions. The mean of LACS in theoretical models are the same as the empirical value. 9 Figure S4. Empirical LACS Distributions of the European ancestral component in African-American and its corresponding theoretical distributions. The mean of LACS in theoretical models are the same as the empirical value. 10 Figure S5. Empirical LACS Distributions of the European ancestral component in Mexcian and its corresponding theoretical distributions. The mean of LACS in theoretical models are the same as the empirical value. \ 11 Figure S6. Empirical LACS Distributions of the Amerindian ancestral component in Mexcian and its corresponding theoretical distributions. The mean of LACS in theoretical models are the same as the empirical value. 12 Reference 1. Jin W, Wang S, Wang H, Jin L, Xu S: Exploring Population Admixture Dynamics via Empirical and Simulated Genome-Wide Distribution of Ancestral Chromosomal Segments. Am J Hum Genet 2012; 91: 849-862. 2. Jin W, Xu S, Wang H et al: Genome-wide detection of natural selection in African Americans pre- and post-admixture. Genome research 2012; 22: 519-527. 3. Altshuler DM, Gibbs RA, Peltonen L et al: Integrating common and rare genetic variation in diverse human populations. Nature 2010; 467: 52-58. 4. Price AL, Tandon A, Patterson N et al: Sensitive detection of chromosomal segments of distinct ancestry in admixed populations. PLoS Genet 2009; 5: e1000519. 5. Pool JE, Nielsen R: Inference of historical changes in migration rate from the lengths of migrant tracts. Genetics 2009; 181: 711-719. 6. Johnson NA, Coram MA, Shriver MD et al: Ancestral components of admixed genomes in a Mexican cohort. PLoS Genet 2011; 7: e1002410. 13