Supplementary Information (doc 1138K)

advertisement
Supplementary Information
Genotype Data and Quality Control
The genotype data for the individuals used in the analysis came from three different
sources: The Human Genome Diversity Panel (HGDP) Illumina 650k data1, HapMap
Phase 3 (draft release 2), as well as a case-control study on a population from
Mozambique genotyped with the Affymetrix Human Immune and Inflammation 9K
SNP array (manuscript in preparation). For the HGDP, we only used data of
individuals of the H952 standardized subset2. For both HGDP and HapMap data,
quality control (QC) consisted in removing SNPs with call rate below 0.95, as well as
SNPs with a minor allele frequency (MAF) below 0.01 across the entire panel.
Stringent quality control was also applied to the case-control study data from
Mozambique, details of which described elsewhere (manuscript in preparation).
For conducting the main analysis, we generated a merged dataset containing all
autosomal SNPs with genotype data in all three source datasets (2,841), using PLINK
(version 1.07)3. To test whether this number of SNPs is sufficient for our analysis, we
first generated another merged dataset of all autosomal SNPs with genotype data in
both HGDP and HapMap (460,147). Of this "full" dataset, we then generated one
random subsample each, for 100,000, 10,000, 1,000, as well as 2,841 SNPs.
Principal Component Analysis
We carried out principal component analysis (PCA) on the SNP genotype data using
EIGENSOFT (version 2.0)4. Analysis was carried out on autosomal SNPs only, using
default parameters, but including the linkage disequilibrium (LD) correction option.
To test how the number of SNPs used influences the results of the PCA, we analyzed
and plotted each of the generated test datasets described above. To further quantify
the results, we calculated the convex hull polygon and its area for each population for
1
the first three principal components, using R. Results are shown in Supplementary
Figure 1 and 2. As can be seen in Supplementary Figure 1, the mean PC coordinates
for populations remain very similar for all of the datasets, indicating that even with as
little as 1,000 SNPs it is possible to obtain a general picture of the structure in the
sample. As expected, interindividual variation increases with reduced number of
SNPs, as can be seen in the increase of the polygon area covering the samples of a
particular population (Supplementary Figure 2). Notably, this effect does only
become more pronounced when less than 10,000 SNPs are considered. Since the
general picture of population structure is well recovered with the 2,841 SNPs in our
main dataset, we concluded that it was sufficient for our analysis.
PCA Geography Correlation
Examining the PCA results plot for PC1 and PC3, we found that the positioning of the
populations resembled their geographic distribution within Sub-Saharan Africa. To
produce the plot superimposing the PCA results on the map of Africa (Figure 1), we
followed a similar approach as Novembre et al.5. Namely, we found the angle θ that
maximizes the sum of the correlation coefficients of PC1 versus longitude (long) and
PC3 vs latitude (lat), for all individuals except the African Americans (ASW). Map
coordinates for each individual were assigend as the sampling location of the
respective population. The final correlation coefficients obtained after rotation were
0.80 (PC1 ~ long) and 0.83 (PC3 ~ lat). For plotting purposes, the original PC
coordinates were transformed into map coordinates (longitude/latitutude) by both
scaling and shifting the rotated PC coordinates. Both scale and shift parameters were
calculated as to minimize the sum of squares distances of the samples rotated PC
2
coordinates to their respective map coordinates. All calculations and plotting was
carried out using R.
STRUCTURE Analysis
We used STRUCTURE (version 2.2)6; 7 to further investigate the genetic structure of
the merged dataset. Analysis was performed using default parameters, with 5
repetitions, with the number of clusters K ranging from 2 to 8. Results for the runs for
each value of K were combined using CLUMPP (version 1.1.1)8. For values K = 2 to
K = 4, we used the "full search" option, whereas for values K > 4 the "greedy" option
was used for computational feasibility. Plots of the results were generated using
distruct (version 1.1)9
3
Supplementary Tables
Supplementary Table 1. Study populations, source dataset and coordinates of
origin.
populationID
ASW
Bantu
BiakaPygmies
LWK
MKK
Mandenka
MbutiPygmies
Mozambique
San
YRI
Yoruba
Bedouin
Druze
Mozabite
Palestinian
Adygei
Basque
CEU
French
NorthItaly
Orcadian
Russian
Sardinian
TSI
Balochi
Brahui
Burusho
GIH
Hazara
Kalash
Makrani
NorthWestChina
Pathan
Sindhi
CHB
CHD
Cambodian
Han
JPT
Japanese
NorthEastChina
SouthChina
Yakut
NANMelanesian
Papuan
Colombian
Karitiana
populationLabel
AfricanAmerican
Bantu
BiakaPygmies
Luhya
Maasai
Mandenka
MbutiPygmies
Mozambique
San
Yoruba_HapMap
Yoruba_HGDP
Bedouin
Druze
Mozabite
Palestinian
Adygei
Basque
EuropeCEU
French
NorthItaly
Orcadian
Russian
Sardinian
Tuscan
Balochi
Brahui
Burusho
Gujarati
Hazara
Kalash
Makrani
NorthWestChina
Pathan
Sindhi
ChineseBeijing
ChineseDenver
Cambodian
Han
JapaneseTokyo
Japanese
NorthEastChina
SouthChina
Yakut
NANMelanesian
Papuan
Colombian
Karitiana
region1
SSAFR
SSAFR
SSAFR
SSAFR
SSAFR
SSAFR
SSAFR
SSAFR
SSAFR
SSAFR
SSAFR
MENA
MENA
MENA
MENA
EUR
EUR
EUR
EUR
EUR
EUR
EUR
EUR
EUR
CSASIA
CSASIA
CSASIA
CSASIA
CSASIA
CSASIA
CSASIA
CSASIA
CSASIA
CSASIA
EASIA
EASIA
EASIA
EASIA
EASIA
EASIA
EASIA
EASIA
EASIA
OCE
OCE
AME
AME
4
dataset
HapMap3
HGDP
HGDP
HapMap3
HapMap3
HGDP
HGDP
Malaria
HGDP
HapMap3
HGDP
HGDP
HGDP
HGDP
HGDP
HGDP
HGDP
HapMap3
HGDP
HGDP
HGDP
HGDP
HGDP
HapMap3
HGDP
HGDP
HGDP
HapMap3
HGDP
HGDP
HGDP
HGDP
HGDP
HGDP
HapMap3
HapMap3
HGDP
HGDP
HapMap3
HGDP
HGDP
HGDP
HGDP
HGDP
HGDP
HGDP
HGDP
latitude
0.00
-25.80
4.00
0.60
-0.02
12.00
1.00
-25.40
-21.00
7.38
8.00
31.00
32.00
32.00
32.50
44.00
43.00
48.86
46.00
43.00
59.00
61.00
40.00
43.77
30.00
31.00
37.00
22.26
33.00
35.00
26.00
40.00
32.00
25.50
39.91
35.86
12.00
32.50
35.69
38.00
50.00
25.00
63.00
-6.00
-4.00
3.00
-10.00
longitude
24.00
25.00
17.00
34.78
37.91
-12.00
29.00
32.80
20.00
3.90
5.00
35.00
35.00
3.00
35.50
39.00
0.00
2.35
2.00
10.00
-3.00
40.00
9.00
11.26
66.00
67.00
75.00
71.19
70.00
71.00
64.00
91.00
69.00
69.00
116.40
104.20
105.00
114.00
139.69
138.00
126.50
109.50
129.50
155.00
143.00
-68.00
-63.00
MEX
Mexican
AME
HapMap3
23.63
Maya
Maya
AME
HGDP
19.00
Pima
Pima
AME
HGDP
29.00
Surui
Surui
AME
HGDP
-11.00
1
SSAFR: Sub-Saharan Africa; MENA: Middle East and North Africa; EUR: Europe; CSASIA:
Central and South Asia; EASIA: East Asia; OCE: Oceania; AME: Americas
5
-102.55
-91.00
-108.00
-62.00
Figure Legends
Supplementary Figure 1
Convex hull polygon areas for biplots of the first three principal components, for the
full as well as all random subsampled samples of the combined HGDP and HapMap 3
data. For clarity, only selected populations are plotted. The position of the population
labels indicate the mean coordinates of the samples of the respective population. As
can be seen, mean PCs remain similar for all subsets, whereas the polygon area
increases, particularly for populations that are well differentiated with the respective
PC (e.g. YRI in PC1).
Supplementary Figure 2
Barplot showing convex hull polygon area for all populations in PC1 – PC2 space, for
each subsampled dataset.
6
References
1. Li JZ, Absher DM, Tang H, Southwick AM, Casto AM, Ramachandran S, Cann
HM, Barsh GS, Feldman M, Cavalli-Sforza LL, Myers RM (2008) Worldwide
human relationships inferred from genome-wide patterns of variation. Science
319:1100-1104
2. Rosenberg NA (2006) Standardized subsets of the HGDP-CEPH Human Genome
Diversity Cell Line Panel, accounting for atypical and duplicated samples and
pairs of close relatives. Ann Hum Genet 70:841-847
3. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, Maller J,
Sklar P, de Bakker PI, Daly MJ, Sham PC (2007) PLINK: a tool set for wholegenome association and population-based linkage analyses. Am J Hum Genet
81:559-575
4. Patterson N, Price AL, Reich D (2006) Population structure and eigenanalysis.
PLoS Genet 2:e190
5. Novembre J, Johnson T, Bryc K, Kutalik Z, Boyko AR, Auton A, Indap A, King
KS, Bergmann S, Nelson MR, Stephens M, Bustamante CD (2008) Genes
mirror geography within Europe. Nature 456:98-101
6. Falush D, Stephens M, Pritchard JK (2003) Inference of population structure using
multilocus genotype data: linked loci and correlated allele frequencies.
Genetics 164:1567-1587
7. Pritchard JK, Stephens M, Donnelly P (2000) Inference of population structure
using multilocus genotype data. Genetics 155:945-959
8. Jakobsson M, Rosenberg NA (2007) CLUMPP: a cluster matching and permutation
program for dealing with label switching and multimodality in analysis of
population structure. Bioinformatics 23:1801-1806
9. Rosenberg NA (2004) DISTRUCT: a program for the graphical display of
population structure. Molecular Ecology Notes 4:137-138
7
Supplementary Figure 1
8
Supplementary Figure 2
9
10
Download