Supplementary Information Genotype Data and Quality Control The genotype data for the individuals used in the analysis came from three different sources: The Human Genome Diversity Panel (HGDP) Illumina 650k data1, HapMap Phase 3 (draft release 2), as well as a case-control study on a population from Mozambique genotyped with the Affymetrix Human Immune and Inflammation 9K SNP array (manuscript in preparation). For the HGDP, we only used data of individuals of the H952 standardized subset2. For both HGDP and HapMap data, quality control (QC) consisted in removing SNPs with call rate below 0.95, as well as SNPs with a minor allele frequency (MAF) below 0.01 across the entire panel. Stringent quality control was also applied to the case-control study data from Mozambique, details of which described elsewhere (manuscript in preparation). For conducting the main analysis, we generated a merged dataset containing all autosomal SNPs with genotype data in all three source datasets (2,841), using PLINK (version 1.07)3. To test whether this number of SNPs is sufficient for our analysis, we first generated another merged dataset of all autosomal SNPs with genotype data in both HGDP and HapMap (460,147). Of this "full" dataset, we then generated one random subsample each, for 100,000, 10,000, 1,000, as well as 2,841 SNPs. Principal Component Analysis We carried out principal component analysis (PCA) on the SNP genotype data using EIGENSOFT (version 2.0)4. Analysis was carried out on autosomal SNPs only, using default parameters, but including the linkage disequilibrium (LD) correction option. To test how the number of SNPs used influences the results of the PCA, we analyzed and plotted each of the generated test datasets described above. To further quantify the results, we calculated the convex hull polygon and its area for each population for 1 the first three principal components, using R. Results are shown in Supplementary Figure 1 and 2. As can be seen in Supplementary Figure 1, the mean PC coordinates for populations remain very similar for all of the datasets, indicating that even with as little as 1,000 SNPs it is possible to obtain a general picture of the structure in the sample. As expected, interindividual variation increases with reduced number of SNPs, as can be seen in the increase of the polygon area covering the samples of a particular population (Supplementary Figure 2). Notably, this effect does only become more pronounced when less than 10,000 SNPs are considered. Since the general picture of population structure is well recovered with the 2,841 SNPs in our main dataset, we concluded that it was sufficient for our analysis. PCA Geography Correlation Examining the PCA results plot for PC1 and PC3, we found that the positioning of the populations resembled their geographic distribution within Sub-Saharan Africa. To produce the plot superimposing the PCA results on the map of Africa (Figure 1), we followed a similar approach as Novembre et al.5. Namely, we found the angle θ that maximizes the sum of the correlation coefficients of PC1 versus longitude (long) and PC3 vs latitude (lat), for all individuals except the African Americans (ASW). Map coordinates for each individual were assigend as the sampling location of the respective population. The final correlation coefficients obtained after rotation were 0.80 (PC1 ~ long) and 0.83 (PC3 ~ lat). For plotting purposes, the original PC coordinates were transformed into map coordinates (longitude/latitutude) by both scaling and shifting the rotated PC coordinates. Both scale and shift parameters were calculated as to minimize the sum of squares distances of the samples rotated PC 2 coordinates to their respective map coordinates. All calculations and plotting was carried out using R. STRUCTURE Analysis We used STRUCTURE (version 2.2)6; 7 to further investigate the genetic structure of the merged dataset. Analysis was performed using default parameters, with 5 repetitions, with the number of clusters K ranging from 2 to 8. Results for the runs for each value of K were combined using CLUMPP (version 1.1.1)8. For values K = 2 to K = 4, we used the "full search" option, whereas for values K > 4 the "greedy" option was used for computational feasibility. Plots of the results were generated using distruct (version 1.1)9 3 Supplementary Tables Supplementary Table 1. Study populations, source dataset and coordinates of origin. populationID ASW Bantu BiakaPygmies LWK MKK Mandenka MbutiPygmies Mozambique San YRI Yoruba Bedouin Druze Mozabite Palestinian Adygei Basque CEU French NorthItaly Orcadian Russian Sardinian TSI Balochi Brahui Burusho GIH Hazara Kalash Makrani NorthWestChina Pathan Sindhi CHB CHD Cambodian Han JPT Japanese NorthEastChina SouthChina Yakut NANMelanesian Papuan Colombian Karitiana populationLabel AfricanAmerican Bantu BiakaPygmies Luhya Maasai Mandenka MbutiPygmies Mozambique San Yoruba_HapMap Yoruba_HGDP Bedouin Druze Mozabite Palestinian Adygei Basque EuropeCEU French NorthItaly Orcadian Russian Sardinian Tuscan Balochi Brahui Burusho Gujarati Hazara Kalash Makrani NorthWestChina Pathan Sindhi ChineseBeijing ChineseDenver Cambodian Han JapaneseTokyo Japanese NorthEastChina SouthChina Yakut NANMelanesian Papuan Colombian Karitiana region1 SSAFR SSAFR SSAFR SSAFR SSAFR SSAFR SSAFR SSAFR SSAFR SSAFR SSAFR MENA MENA MENA MENA EUR EUR EUR EUR EUR EUR EUR EUR EUR CSASIA CSASIA CSASIA CSASIA CSASIA CSASIA CSASIA CSASIA CSASIA CSASIA EASIA EASIA EASIA EASIA EASIA EASIA EASIA EASIA EASIA OCE OCE AME AME 4 dataset HapMap3 HGDP HGDP HapMap3 HapMap3 HGDP HGDP Malaria HGDP HapMap3 HGDP HGDP HGDP HGDP HGDP HGDP HGDP HapMap3 HGDP HGDP HGDP HGDP HGDP HapMap3 HGDP HGDP HGDP HapMap3 HGDP HGDP HGDP HGDP HGDP HGDP HapMap3 HapMap3 HGDP HGDP HapMap3 HGDP HGDP HGDP HGDP HGDP HGDP HGDP HGDP latitude 0.00 -25.80 4.00 0.60 -0.02 12.00 1.00 -25.40 -21.00 7.38 8.00 31.00 32.00 32.00 32.50 44.00 43.00 48.86 46.00 43.00 59.00 61.00 40.00 43.77 30.00 31.00 37.00 22.26 33.00 35.00 26.00 40.00 32.00 25.50 39.91 35.86 12.00 32.50 35.69 38.00 50.00 25.00 63.00 -6.00 -4.00 3.00 -10.00 longitude 24.00 25.00 17.00 34.78 37.91 -12.00 29.00 32.80 20.00 3.90 5.00 35.00 35.00 3.00 35.50 39.00 0.00 2.35 2.00 10.00 -3.00 40.00 9.00 11.26 66.00 67.00 75.00 71.19 70.00 71.00 64.00 91.00 69.00 69.00 116.40 104.20 105.00 114.00 139.69 138.00 126.50 109.50 129.50 155.00 143.00 -68.00 -63.00 MEX Mexican AME HapMap3 23.63 Maya Maya AME HGDP 19.00 Pima Pima AME HGDP 29.00 Surui Surui AME HGDP -11.00 1 SSAFR: Sub-Saharan Africa; MENA: Middle East and North Africa; EUR: Europe; CSASIA: Central and South Asia; EASIA: East Asia; OCE: Oceania; AME: Americas 5 -102.55 -91.00 -108.00 -62.00 Figure Legends Supplementary Figure 1 Convex hull polygon areas for biplots of the first three principal components, for the full as well as all random subsampled samples of the combined HGDP and HapMap 3 data. For clarity, only selected populations are plotted. The position of the population labels indicate the mean coordinates of the samples of the respective population. As can be seen, mean PCs remain similar for all subsets, whereas the polygon area increases, particularly for populations that are well differentiated with the respective PC (e.g. YRI in PC1). Supplementary Figure 2 Barplot showing convex hull polygon area for all populations in PC1 – PC2 space, for each subsampled dataset. 6 References 1. Li JZ, Absher DM, Tang H, Southwick AM, Casto AM, Ramachandran S, Cann HM, Barsh GS, Feldman M, Cavalli-Sforza LL, Myers RM (2008) Worldwide human relationships inferred from genome-wide patterns of variation. Science 319:1100-1104 2. Rosenberg NA (2006) Standardized subsets of the HGDP-CEPH Human Genome Diversity Cell Line Panel, accounting for atypical and duplicated samples and pairs of close relatives. Ann Hum Genet 70:841-847 3. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ, Sham PC (2007) PLINK: a tool set for wholegenome association and population-based linkage analyses. Am J Hum Genet 81:559-575 4. Patterson N, Price AL, Reich D (2006) Population structure and eigenanalysis. PLoS Genet 2:e190 5. Novembre J, Johnson T, Bryc K, Kutalik Z, Boyko AR, Auton A, Indap A, King KS, Bergmann S, Nelson MR, Stephens M, Bustamante CD (2008) Genes mirror geography within Europe. Nature 456:98-101 6. Falush D, Stephens M, Pritchard JK (2003) Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics 164:1567-1587 7. Pritchard JK, Stephens M, Donnelly P (2000) Inference of population structure using multilocus genotype data. Genetics 155:945-959 8. Jakobsson M, Rosenberg NA (2007) CLUMPP: a cluster matching and permutation program for dealing with label switching and multimodality in analysis of population structure. Bioinformatics 23:1801-1806 9. Rosenberg NA (2004) DISTRUCT: a program for the graphical display of population structure. Molecular Ecology Notes 4:137-138 7 Supplementary Figure 1 8 Supplementary Figure 2 9 10