Supplementary material 3: DAPC

advertisement
Supplementary material 2: Assigning to admixed original populations with DAPC and
Bayesian clustering analysis
Classical genetic methods of assignment are based on the multilocus genotype of an
individual and the expected probabilities of that genotype occurring in each of the potential
sources. These “model-based” methods mostly rely on restrictive explicit assumptions
(populations at Hardy-Weinberg equilibrium and linkage equilibrium) and were essentially
developed for data on diploid (and haploid) genotypes (Manel et al., 2005). Polyploid
microsatellites dataset may be scored as binary data (presence-absence), but it remains unclear
how these model-based methods are relevant for treating such kinds of data.
The software STRUCTURE (Prichard et al., 1999), a Bayesian model-based clustering
algorithm, has been recently modified to handle polyploid data and allele copy number
ambiguity, under the assumption of full autopolyploid inheritance (Falush et al., 2007).
Commonly used to identify genetic clusters in a dataset, STRUCTURE may also be used to
assign additional individuals of unknown origin to their source populations. By pre-specifying
source populations, the algorithm estimates ancestry for additional individuals, updating allele
frequencies using only those from source populations.
The Discriminant Analysis of Principal Components (DAPC), a non model-based method
recently developed and implemented in the adegenet R packages (Jombart, 2008), provides an
efficient description of genetic clusters using a few synthetic variables (called the
discriminant functions). This multivariate analysis seeks linear combinations of the original
variables (alleles) which show differences between groups as best as possible while
minimizing variation within clusters. Contrary to traditional methods such as PCA or PCoA,
which focus on the entire genetic variation, DAPC yields linear combinations of the original
variables (alleles) which maximize differences between groups while minimizing variation
within clusters. Based on the retained discriminant functions, the analysis derives probabilities
for each individual of membership in each of the different groups. This coefficient can be
interpreted as “genetic proximity” of individuals to the different clusters. It is possible to
construct the linear model and obtain synthetic variables on a given dataset (the source
populations), then add supplementary individuals (that were not used in constructing the
model) and derive for each one a membership probability to original source populations.
These coefficients might provide an “assignment measure” of individuals to predefined
groups, comparable with ancestry value derived by the STRUCTURE analysis.
In this study, both methods were performed to investigate the potential source(s) of
New Guinean landraces in tropical America. Therefore, we first ran these methods with the
tropical America dataset, excluding New Guinean samples, which were added as
supplementary individuals. However, both methods require prior groups to be achieved.
K-means clustering and the DAPC method: The adegenet package allows running
the sequential K-means clustering algorithm, and comparing the different clustering solutions
using the Bayesian Information Criterion (BIC) (after transforming the data using a PCA,
notably to reduce the numbers of variables and speed the clustering algorithm) to identify an
optimal number of genetic clusters to describe the data. We ran the K-means clustering
algorithm for K = 2 to K = 10 on the tropical America dataset. Based on this analysis, three
genetic clusters were considered optimal to describe the data.
Figure 1: Inference of the number of clusters in the DAPC performed on the tropical America
dataset. A K value of 3 (the lowest BIC value) represents the best summary of the data.
The three genetic clusters are geographically restricted: Two clusters (K2 and K3)
group mostly individuals from the Northern region (90.6 %) while cluster K1 groups mostly
those from the Southern region (81.4 %). The grouping obtained for K = 2 also provides an
accurate description of the tropical America dataset: the cluster K1 mainly contained samples
from the Southern genepool (79%), and cluster K2 mostly those from the Northern genepool
(91%).
DAPC relies on data transformation using principal component analysis (PCA) as a
prior step. Retaining too many PCs can lead to overfitting the discriminant functions, which
could model any structure and virtually discriminate any set of clusters. Adegenet proposes
an optimization procedure to evaluate the optimal numbers of PCs to retain. The procedure is
based on the calculation of the α-score, which measures the difference between the proportion
of successful reassignment of the analysis (observed discrimination) and values obtained
using random groups (random discrimination). The number of retained PCs can be chosen so
as to optimize the a-score.
The optimization α-score graph (Figure 2) shows that only few PCs need to be
retained for the assignment analysis. We tested DAPC for both groupings (K = 2 and K = 3),
retaining between 4 and 10 PCs, for prior data transformation. Assignment results were
globally congruent and we present in the article results obtained with 5 PCs retained (29.3 of
the total variance).
Figure 2: optimization α-score graph
Bayesian clustering method: We also used a Bayesian clustering method on the
tropical America dataset to predefine groups in tropical America that may be used for the
assignment of New Guinea accessions. We ran STRUCTURE for K = 2 to K = 6, using the
admixture model, correlated allele frequencies, 50 000 burn-in iterations and 150 000 Markov
chain-Monte Carlo steps and data coding for handling genotype ambiguity for co-dominant
markers in polyploids (Falush et al., 2007). We then plotted the ∆K Evanno criterion (Evanno
et al., 2005) to identify the optimal number of clusters to describe the data.
a)
70
b)
-13500
1
2
3
4
5
6
60
-14000
50
-14500
-15000
-15500
40
30
Série1
20
10
-16000
0
-16500
1
2
3
4
Figure 3: a) Variation of the posterior log-probability of the data as a function of the
number of clusters K. b) Variation of ΔK values
Following this method, the optimal number of clusters to describe the data was unclear. We
then retained groupings obtained for K = 2 and K = 3 (as determined by the BIC criterion) and
compared their composition with those obtained with the DAPC method.
5
Results:
Figure 3: Tropical America landraces membership probabilities (DAPC analysis) or ancestry
value (Bayesian analysis) to K1 and K2 (DAPC K2 or Bayesian K2) or K1, K2 and K3
(DAPC K3 or Bayesian K3) clusters. Each individual is represented as a vertical bar, with
colours corresponding to probabilities of membership in K1 (black), K2 (dark gray), and K3
(light gray).
Northern
Southern
DAPC K2
Bayesian K2
DAPC K3
Bayesian K3
Both methods gave globally congruent results: Neotropical sweet potatoes are
characterized by two distinct genetic groups, geographically circumscribed: one genetic group
corresponds to most of the accessions from the Southern region and the other one to those
from the Northern region. For K = 3, a sub-structure is revealed in which Northern region
accessions are split into two genetic groups.
However, the phylogeographic pattern is not so clear-cut: indeed, in the Southern
region, we detected some accessions clearly attributed to the nuclear cluster(s)
characteristic(s) of the Northern region (K2, or K2 and K3) or with a mixed genetic
constitution. Also, in the Northern region, we identified several accessions attributed to
cluster K1 or with a mixed composition. As we already discussed in a previous paper
(Roullier et al., 2011), this situation suggests that Neotropical sweet potatoes are
characterized by two original differentiated genepools (probably related to independent
domestications in each region) and that clones were secondarily exchanged between both
regions and then recombined with local material. This scenario is also supported by
chloroplast data (Roullier et al., 2011). This situation of admixture is well underlined by the
Bayesian clustering results, for which most of the neotropical accessions exhibited a mixed
genetic constitution. With the DAPC method, individual assignment was more “contrasted”,
only few individuals exhibiting a mixed constitution.
Thus, it was difficult to use Bayesian clustering results on the tropical America dataset
to predefine groups for assignment of the New Guinean accessions. We preferred to use the
genetic grouping (and not the a priori regional grouping) inferred by the K-means clustering
for K = 2, which provides an accurate and simple summary of the neotropical dataset, to
perform assignment of New Guinean accessions.
Evanno G, Regnaut S, Goudet J (2005) Detecting the number of clusters of individuals using
the software STRUCTURE: a simulation study. Mol Ecol 14:2611-2620.
Falush D, Stephens M, Pritchard JK (2007) Inference of population structure using multilocus
genotype data: dominant markers and null alleles. Mol Ecol Notes 7: 574-578.
Jombart T (2008) Adegenet: R package for the multivariate analysis of genetic markers.
Bioinf 24: 1403-1405.
Manel S, Gaggiotti OE, Waples RS (2005) Assignments methods: matching biological
questions with appropriate techniques. Trends Ecol Evol 20:136-142.
Pritchard JK, Stephens M, Donnelly P (2000) Inference of population structure using
multilocus genotype data. Genetics 155:945–959.
Roullier C, Rossel G, Tay D, McKey D, Lebot V (2011) Combining chloroplast and nuclear
microsatellites to investigate origin and dispersal of New World sweet potato
landraces. Mol Ecol 20:3963-3977.
.
Download