Tina – population admixture - University of South Carolina

advertisement
Drake 1
Kristina Drake
Dr. Bert Ely
BIOL 303, Section 501
5 November 2010
Population Genetics in Costa Rica
Technological advancements in the last few centuries have increased dramatically the
mobility of populations throughout the world. Thus, despite the physical barriers that separate
the different continental populations, gene flow is common even over long distances.
Immigration and interbreeding introduce new genetic variants into a population, and these
additions to the gene pool create genetic admixture. One such example of an admixed
population is Costa Rica1, a country with about four million people of diverse ancestry.
Historical information about Costa Rica indicates that the nation’s population consists of
genomic contributions from three continental populations, namely Africans, Europeans, and
Native Americans1. This mixed pool of genetic information provides a unique opportunity to
study the genes associated with vulnerability to common diseases, and as a foundation for this
ultimate purpose, researchers have conducted analyses of the genetic composition and
substructure of the Costa Rican population.
In one study, scientists from the United States, France, and Costa Rica itself cooperated
on a comprehensive population structure analysis of 1,301 women from the northern pacific
region of Costa Rica (CR), known as Guanacaste1. They used admixture mapping to create an
estimate of descent, by finding the degree of genetic admixture, measured by single nucleotide
polymorphisms (SNPs). The women were genotyped on a custom Illumina InfiniumII iSelect
panel containing 27, 904 tag SNPs. These tag SNPs are a subclass of SNPs that contain the
Drake 2
majority of the information represented by the other SNPs in the surrounding region, based on
the linkage disequilibrium (LD) structure2. Linkage disequilibrium is the degree of non-random
association of alleles at multiple loci. These SNPs were selected according to a multiethnic
tagging strategy which targeted candidate genes in European, African, and Asian HapMap
reference populations. After standard quality-control filtering of the SNPs, 9,148 were identified
as common among the CR samples and the four reference populations: the three HapMap
samples plus Illumina 550 K Native American samples, all confirmed to be unrelated
individuals1. Further filtering narrowed the results to 2,663 population structure inference SNPs
with low background LD, which could otherwise interfere with the results.
Five different programs were employed to analyze the structure and admixture of the CR
population samples. The model-based STRUCTURE program was used to estimate the
admixture proportions of the CR samples in comparison to the four continental reference
samples. The CLUMPP program was used to compute the average admixture coefficients, and
the DISTRUCT program plotted the range of resulting coefficients for all the samples. Principle
components analysis was applied to determine a more in depth population structure, and the
Haploview program performed haplotype analyses to estimate the average linkage disequilibrium
block size for all the samples for comparison.
The concepts behind these programs, primarily the STRUCTURE program, a modelbased method for inferring population structure from multilocus genotype data3. The number of
ancestral populations in the data sample is assumed to be an unknown number K, and individuals
in the sample are allocated to one of the K number of populations based on allele frequencies at
different loci3. If the genotype of an individual appears to be admixed, it is assigned to multiple
populations. Whichever value of K has the highest metric value of ∆K is the probabilistically
Drake 3
optimal number of populations in the input data3. A subsequent study three years later modified
this method to allow for the linkage between loci that develops in admixed populations, or the
admixture linkage disequilibrium. This makes the program useful for admixture mapping, since
it can detect more understated population subdivisions4. The improved algorithms were tested
on admixed populations of African-Americans, and were used to study recombination in bacteria
and genetic drift in fruit flies.
This STRUCTURE program from these previous studies was applied to analyze the
population structure in the CR sample. First, the program was run with various possible assumed
K values to find the unknown value, and the optimum result was 4, meaning that there were four
inferred ancestral populations for the pooled data of CR women and reference samples1. Since
four distinct continental reference samples were entered into the program along with the CR
samples, it was clear that the CR samples were not in themselves a separate ancestral population
but rather an admixture. Then the STRUCTURE analysis was run again, using only that K value
of 4, and the overall admixture proportions for the CR samples were calculated to be 42.5%
European, 38.3% Native American, 15.2% African1. There appeared to be a 4% Asian influence,
which was most likely the result of a few outlier individuals with strong Asian origins. These
findings were consistent with the history of Costa Rican settlement, which was originally Native
American with an influx of colonizing Europeans and West Africans1.
To explore the population structure of Guanacaste in more detail, the researchers used a
principal components analysis (PCA). PCA detects the primary axes of variation in the samples
and projects them onto these axes graphically5. Three pairwise eigenvector comparisons were
generated, and they supported the conclusion that the CR population is an admixture1. The CR
samples spread between the reference population samples, primarily between the European and
Drake 4
Native American data points, but skew noticeably towards the African samples as well. The
second comparison clearly shows that the Asian population contributed very little to the CR
genome1. The Asian data points lie in a clump entirely separate from the CR data points in the
second graph of the eigenvector comparisons, and unlike the African population in the first
comparison graph, there is little to no skewing in the direction of the Asian data points. The third
comparison shows a separation in the North American samples between the Pima Indians and the
more closely related Colombians and Mayans. The latter two Native American tribes more
resemble the genetic composition of the CR samples. This association corresponds with
historical knowledge about the Native American tribes; the Pima Indians typically resided much
farther geographically from Costa Rican territory than the other tribes, and they did not migrate
for colonization purposes into the area as did Europeans1.
Since Asians and Pima Indians appeared to have contributed very little to the CR gene
pool, the researchers refined their analysis by removing these samples from the input data. They
also removed the few CR samples that contained Asian admixture greater than 10%, as these
individuals were most likely outliers in the general composition of the CR population. The
resulting PCA showed strong correlations between the EV projections and the admixture
coefficients estimated by the STRUCTURE analysis. When the admixture coefficients for
Europeans and Africans were compared in a histogram, a strong bimodality could be seen in the
graphs. This double peak indicates that there are substructures in the CR samples1. The
variability in the admixture coefficients, measured by the standard deviation, was 0.07 for
Africans and 0.13 for Europeans. The low variability in the former suggests that the majority of
the African admixing occurred during a relatively short period of time, early in the history of the
Drake 5
Costa Ricans. The higher variability in the latter suggests that the European admixing occurred
during a longer period of time, with a constant influx of European immigrants1.
The African admixing fits with a relatively extreme model of admixture known as the
Hybrid-Isolation model6, an idea proposed by Pfaff et al. in which the admixture takes place in a
single generation. The European admixing fits with another extreme model from that study,
known as the Continuous Gene Flow model6, in which the admixture takes place at a constant
rate in every generation. These models were proposed as the result of research on linkage
disequilibrium patterns in African American population samples from coastal South Carolina and
Jackson, Mississippi, but can be applied to the Costa Rican population as well.
In exploring possibilities for these substructures in Costa Rica, the researchers separated
the samples into the different geographic regions of Guanacaste from which they originated. The
three regions that seemed be correlated on the EV comparisons were the coastal cantons Santa
Cruz and Nicoya and the inland canton Tilaran1. When compared to the projections of the
reference populations, the Santa Cruz samples correlated towards a greater proportion of African
admixture. Some of the Nicoya samples mixed in with those from Santa Cruz, but a large
portion of them correlated towards a greater proportion of Native American admixture. The
Tilaran samples correlated towards a greater proportion of European admixture. These findings
are consistent with historical information that indicates that Nicoya was inhabited by several
native tribes in the past, and that Tilaran had more recent European immigrants1. So, in
conclusion, the substructures within the Guanacaste population appear to mirror the geographic
location of the samples.
Like the refined PCA, once the Asian and Pima samples were removed, the
STRUCTURE analysis revealed more subtle nuances in the CR population. In the graph of the
Drake 6
admixture proportions, there are two center points around which the CR samples are focused.
The most samples concentrate around the point at 40% European, 40% Native American, and
20% African admixture. The second, smaller group of samples concentrates around the point at
60% European, 30% Native American, and 10% African admixture1. This suggests that the
Guanacaste consists of two subpopulations with approximately these proportions of admixture,
one with a high European percentage and the other with a lower but still predominant
percentage1.
Before studying these subpopulations further, the researchers investigated the potential
effect of population stratification on false positive linkage signals in the admixture mapping.
They sampled 200 subjects from each subpopulation and identified SNPs that have different
genotype distributions between the two. The most differentiated locus, the rs1426654 SNP, was
coincidentally in the gene neighborhood of an ancestry-informative marker associated with skin
pigmentation, SLC24A51. It was chosen and used in a simulated scenario. In this hypothetical
situation, the sample individuals with AA genotypes on that locus were chosen as cases and the
other two genotypes were used as controls. An association test based on a logistic regression
model with the additive genotype effect illustrated an inflation factor of 1.76 in the false positive
rate, but with adjustments of the top two eigenvectors, the inflation factor dropped to 1.104. This
simulated scenario represented an extreme case, but the strong population stratification effect
could still be adequately corrected by the eigenvector adjustment1.
The effects of population stratification on association studies and the importance of
control selection were explored in a previous study7, and similar confounding effects were
evident in the inflation factors. Data from the Cancer Genetic Markers of Susceptibility
(CGEMS) project was used to determine the impact of population stratification on GWASs in
Drake 7
breast cancer and prostate cancer in European Americans using various control selection
strategies. These researchers determined that, despite the ancestral differences between controls
and case subjects in a study with suboptimal control selection, the error can be reduced to an
acceptable level with corrections for the inflation factor7. These are the corrections that were
employed in the Costa Rican study to eliminate the population stratification error in admixture
frequencies.
The researchers then performed a Haplotype analysis to compare the average LD block
size for each chromosome between the two subpopulations. Overall, the high-European
admixture population had average block sizes that were 20% higher than the low-European
population1. Because a greater proportion of admixture leads to a greater amount of LD, this
data supported the hypothesis that the Guanacaste population as a whole was developed through
a two-stage progression of admixture, in which the low-European group was formed first and
then functioned as one of the ancestral populations for the high-European group to interbreed
further with the European immigrants1.
Then the LD magnitude and pattern in the CR samples was compared with those of the
four reference population samples based on the original set of 9,148 SNPs.
The TagZilla
program was used to find the number of tagging SNPs required to cover the entire set of 9,148
SNPs with respect to different LD threshold values (r2=0.3, 0.5, 0.8). As the number of tag SNPs
necessary decreases, the magnitude of the LD in that population increases1. Between the
continental reference populations, the Africans required the greatest number of tag SNPs and the
Native Americans required the least to encompass the same genomic area. In the CR population,
less tag SNPs were necessary than in the Africans but more were needed than in the Native
American, European, and Asian populations. Because of the multiethnic tagging strategy used in
Drake 8
targeting these continental populations originally, the magnitude of LD detected for the reference
samples may have been underestimated, however1.
Overall, the results of this study on the population of Costa Rica indicate that the
Guanacaste region is strongly admixed. The gene pool for this population is approximately
42.5% European, 38.3% Native American, and 15.2% African, with a nearly negligible 4%
Asian influence. The variability in the admixture coefficients suggests that the integration of
African genetic components into the Costa Rican population occurred early and for a brief
period, whereas the integration of Europeans occurred steadily of a longer period of time. There
are significantly different admixture coefficients between three geographic regions in
Guanacaste, indicating a finer substructure to the population as a whole. The region can be split
into two main subpopulations, one with a higher European admixture proportion (60%) and
another with a lower proportion (40% European). The average linkage disequilibrium block size
for the high-European subpopulation is 20% higher than that for the low-European
subpopulation, suggesting that the low-European group was established first and then functioned
as an additional ancestral population for the formation of the high-European group1.
Works Cited
[1]
Wang, Z. et al. Genetic Admixture and Population Substructure in Guanacaste Costa
Rica. PLoS One. 2010; 5(10): e13336.
[2]
"QuickSNP FAQs." Intute - Home. Johns Hopkins Medicine. Web. 04 Nov. 2010.
<http://bioinformoodics.jhmi.edu/QuickSNP/FAQ.htm>.
Drake 9
[3]
Pritchard JK, Stephens M, Donnelly P. Inference of population structure using multilocus
genotype data. Genetics. 2000;155:945–959.
[4]
Falush D, Stephens M, Pritchard JK. Inference of population structure using multilocus
genotype data: linked loci and correlated allele frequencies. Genetics. 2003;164:1567–1587.
[5]
McVean G. A Genealogical Interpretation of Principal Components Analysis. PLoS
Genet. 2009;5(10):e1000686.
[6]
Pfaff CL, Parra EJ, Bonilla C, Hiester K, McKeigue PM, et al. Population structure in
admixed populations: effect of admixture dynamics on the pattern of linkage disequilibrium. Am
J Hum Genet. 2001;68:198–207.
[7]
Yu K, Wang Z, Li Q, Wacholder S, Hunter DJ, et al. Population substructure and control
selection in genome-wide association studies. PLoS One. 2008;3(7):e2551.
Download