Theoretical Distribution of Lengths of Ancestral

advertisement
Supplemental Data for
Distribution of Ancestral Chromosomal Segments in
Admixed Genomes and Its Implications for Inferring
Population History and Admixture Mapping
Wenfei Jin, 1, § Ran Li, 1, § Ying Zhou, 1 Shuhua Xu, 1,*
1
Max Planck Independent Research Group on Population Genomics, Chinese
Academy of Sciences and Max Planck Society (CAS-MPG) Partner Institute for
Computational Biology, Shanghai Institutes for Biological Sciences, Chinese
Academy of Sciences, Shanghai 200031, China.
§
These authors contributed equally to this work.
* To whom correspondence should be addressed. E-mail: xushua@picb.ac.cn (S.X.)
1
Features of LACS distribution in HI and GA models
Since the ancestral chromosomal segments in HI model followed an exponential
distribution, the expectation of LACS from pop1 at generation T was
E ( x;T ) 
1
(3)
(1  m)T
and the variance was
D( x;T ) 
1
(4).
(1  m) 2 T 2
In T-generation GA model, the expected mean and variance of LACS could be
calculated
based
on


0
0
the
definition:
E ( x;T )   xf ( x; t )dx and D( x;T )   ( x  E ( x;T )) 2 f ( x;T )dx . To our knowledge, it
is impossible to integrate the formula directly to obtain an analytic expression.
Therefore, we used an alternate approach to estimate the mean and variance of LACS
in GA model. Based on the method to calculate density function, we could see that the
mean LACS from pop1 could be calculated by averaging the mean LACS from
different time scales. Since ancestral chromosomal segments from different
generations have different weights and the weight
P( xt ) is proportional
to wt  (1  m)t , the mean xt from different times corrected by weights could be
calculated as,
T
 xw
Eˆ ( x; T )   x P( x )d t 
w
T
0
t
0
t
t
0


T
0
1
(1  m)t
(1  m)t

T
0
(1  m)t
2
t
T

t
2
(1  m)T
(5)
Similar strategies could be applied to estimate the variance of LACS in the GA
model.
t T
2
Dˆ ( x; T )  E ( x 2 ; T )  E 2 ( x; T )  t 1 pt E ( xt )  (


t T
t 1
(1  m)tE( xt )
2

t T
t 1
(1  m)t
2
(
)2 
(1  m)T

t T
t 1
2
)2
(1  m)T
(1  m)t ( D( xt )  E 2 ( xt ))

t T
t 1
(1  m)t
(
2
) 2 (6)
(1  m)T
t T 1
4(t 1  1)
t

(1  m) 2 T 2
Data simulation and comparison with theoretical LACS distribution
Both the admixed population and parental populations were simulated using the
forward-time simulation program we developed previously1,2. In brief, the haploid
chromosomes of YRI (Yoruba in Ibadan, Nigeria) and CEU (Utah residents with
northern and western European ancestry from the CEPH collection) from HapMap
were treated as the initial status of the two parental populations3. Haploid
chromosomes from YRI and CEU were then sampled based on their genetic
contributions. A pair of haploid chromosomes from the two parental populations
(respectively) constructed a diploid admixed individual. Recombination was then
introduced into the admixed population based on the genetic map from the HapMap
data3. Mutation was ignored considering the short population history. The effective
population size (Ne) of each population was set at 5,000. Ancestral origin of the
3
haplotype was labeled to track ancestral chromosomal segments in the simulated
admixed population. This allowed us to directly compare the simulated LACS
distribution with the theoretical LACS distribution that was calculated using the
formula we deduced in this study.
Inferring ancestral chromosomal segments and population admixture history
We used HAPMIX 4, a software that integrates population genetic models, to identify
ancestral chromosomal segments in admixed populations. Since HAPMIX can only
infer local ancestries directly based on a two-way admixture model (using only two
reference populations), we chose one parental population as a reference population
and combined all the other parental populations as the other reference population
when the admixed population was formed by multiple-way admixture. This allowed
us to infer the ancestral chromosomal segments of one parental population each time.
For example, we first combined the African and European parental populations and
treated them as a single parental population (African-European population). Then we
used the African-European population and the Amerindian population as the two
reference populations. Thus we could infer the ancestral chromosomal segments of
Amerindian in African-Americans and analyze the admixture dynamics of
Amerindian ancestral component.
The ancestral segments shorter than a certain threshold could not be accurately
inferred due to the limited density of genetic variants and statistical error1,5,6.
Therefore, we were interested only in the long ancestral chromosomal segments over
a certain threshold, C, which is a constant value. For example, the expected
4
proportion
of
LACS
>
C
(Pc)
in
the
HI
model
is
c
E[ pc | T ]  1   (1  m)Te(1 m)Tx dx  e(1 m)TC (see Results), which is a constant value
0
when the threshold C was set. Therefore, the distribution of ancestral chromosomal
segments longer than a threshold can be used to infer the population history. Since it
was straightforward to obtain the mean and standard deviation (SD) of LACS from
each admixture model based on theoretical distribution, we inferred the population
admixture history by comparing these empirical data with those from theoretical
models.
Simulation of case-control and admixture mapping
To elucidate the influences of LACS distribution on admixture mapping in the two
different admixture models, we simulated a data set for a systematic comparative
analysis. Based on the aforementioned methods for the simulation of admixed
population, we randomly sampled the simulated admixed individuals as controls.
Cases were simulated by random sampling of haploid chromosomes from admixed
individuals, but we restricted genetic contribution from the given parental population
particularly in the susceptibility locus.
More specifically, the genetic contribution of the given parental population to the
admixed population () was set at 20%, which was similar to the genetic contribution
of European to African-Americans. The number of generations since the initial
population admixture () was set as 20, which is an approximation of the generation
of population admixture in the New World. Finally, we set the sample size of cases
5
and controls to be 2000 and finally assumed an increased ancestry relative risk of 2 in
both HI and GA models, relative to the alleles that did not come from the given
parental population. We compared the signatures of association in HI and GA models
based on case-only and case-control approaches. We also performed extensive
simulations to investigate other possible scenarios.
6
Figure S1. Q-Q plot of simulated LACS distribution under 100-generation HI
model versus theoretical distribution. Red line shows null hypothesis that simulated
distribution is the same as theoretical distribution.
7
Figure S2. Q-Q plot of simulated LACS distribution under 100-generation GA
model versus theoretical distribution. Red line shows null hypothesis that simulated
distribution is the same as theoretical.
8
Figure S3. Empirical LACS Distributions of the African ancestral component in
African-American and its corresponding theoretical distributions. The mean of
LACS in theoretical models are the same as the empirical value.
9
Figure S4. Empirical LACS Distributions of the European ancestral component
in African-American and its corresponding theoretical distributions. The mean of
LACS in theoretical models are the same as the empirical value.
10
Figure S5. Empirical LACS Distributions of the European ancestral component
in Mexcian and its corresponding theoretical distributions. The mean of LACS in
theoretical models are the same as the empirical value.
\
11
Figure S6. Empirical LACS Distributions of the Amerindian ancestral
component in Mexcian and its corresponding theoretical distributions. The mean
of LACS in theoretical models are the same as the empirical value.
12
Reference
1.
Jin W, Wang S, Wang H, Jin L, Xu S: Exploring Population Admixture Dynamics via
Empirical and Simulated Genome-Wide Distribution of Ancestral Chromosomal Segments.
Am J Hum Genet 2012; 91: 849-862.
2.
Jin W, Xu S, Wang H et al: Genome-wide detection of natural selection in African Americans
pre- and post-admixture. Genome research 2012; 22: 519-527.
3.
Altshuler DM, Gibbs RA, Peltonen L et al: Integrating common and rare genetic variation in
diverse human populations. Nature 2010; 467: 52-58.
4.
Price AL, Tandon A, Patterson N et al: Sensitive detection of chromosomal segments of
distinct ancestry in admixed populations. PLoS Genet 2009; 5: e1000519.
5.
Pool JE, Nielsen R: Inference of historical changes in migration rate from the lengths of
migrant tracts. Genetics 2009; 181: 711-719.
6.
Johnson NA, Coram MA, Shriver MD et al: Ancestral components of admixed genomes in a
Mexican cohort. PLoS Genet 2011; 7: e1002410.
13
Download