1 SUPPLEMENTARY MATERIALS AND METHODS 2 3 Population structure analyses 4 The pruned dataset of modern SNPs was used to calculate pairwise population Fst genetic 5 distances using the Arlequin package v. 3.5.1.2 and a graphical representation of the obtained 6 distance matrix was drawn by means of metric multidimensional scaling (MDS) (Cox and 7 Cox, 2001). Apportionment of genetic variance among the observed groups of populations 8 was then investigated with a locus-by-locus Analysis of the Molecular Variance (AMOVA) 9 (Excoffier et al., 1992). 10 11 Admixture analysis 12 The model-based maximum likelihood (ML) clustering algorithm implemented in the 13 Admixture software (Alexander et al., 2009) was applied to explore potential population 14 structure at the loci included in the pruned dataset. ML estimates for SNPs allele frequencies 15 in a predefined number (K) of hypothetical ancestral populations, as well as the probabilistic 16 assignment of each individual to these clusters, were obtained. Population structure was 17 investigated at K=2 to K=5. The matrices of obtained ancestry fractions for each K value were 18 imported into R version 2.15.1 (http://www.r-project.org/) and used to generate stacked 19 barplots. A cross-validation (CV) procedure was applied to identify the K for which the model 20 has the best predictive accuracy. 21 22 Multivariate analyses 23 The pruned dataset of SNPs was also used for applying multivariate analyses. Principal 24 Components Analysis (PCA) was performed using the R adegenet package. Moreover, to 25 provide further support to the identified population groups, evaluation of cluster membership 1 26 probabilities for each individual was achieved by means of Discriminant Analysis of Principal 27 Components (DAPC) (Jombart et al., 2010). In fact, this procedure is particularly well suited 28 for depicting diversity patterns observable among pre-defined groups of observations. DAPC 29 was repeated with different randomized groups for different numbers of retained PCs, whose 30 optimal number (20) was identified as that optimizing the mean α-score (i.e. the closest to 31 one) obtained as the difference between observed and random discriminations. Retained PCs 32 were passed to a Linear Discriminant Analysis that constructed discriminant functions as 33 linear combinations of the original variables in order to show the largest between-group 34 variance and the smallest within-group variance. Given the low number of clusters identified 35 by the other population structure analyses, all discriminant functions were retained and used 36 to compute individuals’ membership probabilities. 37 DAPC also allows to deal with “supplementary individuals”, that are observation which do 38 not actually participate to the model construction, but which can be predicted using a model 39 fitted on a different dataset. Data from the archaic species were thus not included in DAPC, 40 but were transformed using the centering and scaling of the modern data and, according to the 41 same discriminant coefficients as for the contributing individuals, were subsequently 42 represented onto the obtained discriminant functions. 43 44 Clustering analysis 45 Scores from the 40 most informative PCs obtained by PCA, accounting for about 85% of the 46 observed variation, were used to perform a cluster analysis via the Model Based Clustering 47 algorithm (Fraley and Raftery, 2002) implemented in the R mclust package. This algorithm 48 explores a set of ten different models for Expectation-Maximization (EM), each one being 49 characterized by a different parameterization of the covariance matrix, for different number of 50 clusters, finally choosing the best one according to the highest Bayesian Information Criterion 2 51 (BIC). This procedure enabled the definition of parameters of both the maximum-BIC model 52 and the corresponding classification (i.e. the affiliation of each individual to one of the 53 inferred clusters). 54 55 REFERENCES 56 Alexander DH, Novembre J, Lange K (2009). Fast model-based estimation of ancestry in 57 unrelated individuals. Genome Res 19: 1655-1664. 58 Cox TF, Cox MAA (2001). Multidimensional Scaling. Chapman & Hall: London. 59 Excoffier L, Smouse PE, Quattro JM (1992). Analysis of molecular variance inferred from 60 metric distances among DNA haplotypes: application to human mitochondrial DNA 61 restriction data. Genetics 131: 479-491. 62 63 Fraley C, Raftery AE (2002). Model-based Clustering, Discriminant Analysis and Density Estimation. J Amer Statist Assoc 97: 611-631. 64 Jombart T, Devillard S, Balloux F (2010). Discriminant analysis of principal components: a 65 new method for the analysis of genetically structured populations. BMC Genetics 11: 94. 3