Supplementary Materials and methods (doc 56K)

advertisement
1
SUPPLEMENTARY MATERIALS AND METHODS
2
3
Population structure analyses
4
The pruned dataset of modern SNPs was used to calculate pairwise population Fst genetic
5
distances using the Arlequin package v. 3.5.1.2 and a graphical representation of the obtained
6
distance matrix was drawn by means of metric multidimensional scaling (MDS) (Cox and
7
Cox, 2001). Apportionment of genetic variance among the observed groups of populations
8
was then investigated with a locus-by-locus Analysis of the Molecular Variance (AMOVA)
9
(Excoffier et al., 1992).
10
11
Admixture analysis
12
The model-based maximum likelihood (ML) clustering algorithm implemented in the
13
Admixture software (Alexander et al., 2009) was applied to explore potential population
14
structure at the loci included in the pruned dataset. ML estimates for SNPs allele frequencies
15
in a predefined number (K) of hypothetical ancestral populations, as well as the probabilistic
16
assignment of each individual to these clusters, were obtained. Population structure was
17
investigated at K=2 to K=5. The matrices of obtained ancestry fractions for each K value were
18
imported into R version 2.15.1 (http://www.r-project.org/) and used to generate stacked
19
barplots. A cross-validation (CV) procedure was applied to identify the K for which the model
20
has the best predictive accuracy.
21
22
Multivariate analyses
23
The pruned dataset of SNPs was also used for applying multivariate analyses. Principal
24
Components Analysis (PCA) was performed using the R adegenet package. Moreover, to
25
provide further support to the identified population groups, evaluation of cluster membership
1
26
probabilities for each individual was achieved by means of Discriminant Analysis of Principal
27
Components (DAPC) (Jombart et al., 2010). In fact, this procedure is particularly well suited
28
for depicting diversity patterns observable among pre-defined groups of observations. DAPC
29
was repeated with different randomized groups for different numbers of retained PCs, whose
30
optimal number (20) was identified as that optimizing the mean α-score (i.e. the closest to
31
one) obtained as the difference between observed and random discriminations. Retained PCs
32
were passed to a Linear Discriminant Analysis that constructed discriminant functions as
33
linear combinations of the original variables in order to show the largest between-group
34
variance and the smallest within-group variance. Given the low number of clusters identified
35
by the other population structure analyses, all discriminant functions were retained and used
36
to compute individuals’ membership probabilities.
37
DAPC also allows to deal with “supplementary individuals”, that are observation which do
38
not actually participate to the model construction, but which can be predicted using a model
39
fitted on a different dataset. Data from the archaic species were thus not included in DAPC,
40
but were transformed using the centering and scaling of the modern data and, according to the
41
same discriminant coefficients as for the contributing individuals, were subsequently
42
represented onto the obtained discriminant functions.
43
44
Clustering analysis
45
Scores from the 40 most informative PCs obtained by PCA, accounting for about 85% of the
46
observed variation, were used to perform a cluster analysis via the Model Based Clustering
47
algorithm (Fraley and Raftery, 2002) implemented in the R mclust package. This algorithm
48
explores a set of ten different models for Expectation-Maximization (EM), each one being
49
characterized by a different parameterization of the covariance matrix, for different number of
50
clusters, finally choosing the best one according to the highest Bayesian Information Criterion
2
51
(BIC). This procedure enabled the definition of parameters of both the maximum-BIC model
52
and the corresponding classification (i.e. the affiliation of each individual to one of the
53
inferred clusters).
54
55
REFERENCES
56
Alexander DH, Novembre J, Lange K (2009). Fast model-based estimation of ancestry in
57
unrelated individuals. Genome Res 19: 1655-1664.
58
Cox TF, Cox MAA (2001). Multidimensional Scaling. Chapman & Hall: London.
59
Excoffier L, Smouse PE, Quattro JM (1992). Analysis of molecular variance inferred from
60
metric distances among DNA haplotypes: application to human mitochondrial DNA
61
restriction data. Genetics 131: 479-491.
62
63
Fraley C, Raftery AE (2002). Model-based Clustering, Discriminant Analysis and Density
Estimation. J Amer Statist Assoc 97: 611-631.
64
Jombart T, Devillard S, Balloux F (2010). Discriminant analysis of principal components: a
65
new method for the analysis of genetically structured populations. BMC Genetics 11: 94.
3
Download