Supplementary Material of A panel of ancestry informative markers

advertisement
Supplementary Material of
A panel of ancestry informative markers to estimate and correct
potential effects of population stratification in Han Chinese
Pengfei Qin, 1,2, § Zhiqiang Li,3,§ Wenfei Jin, 1,2 Dongsheng Lu, 1,2 Haiyi Lou, 1,2 Jiawei
Shen, 4 Li Jin,1, 2, 5,* Yongyong Shi, 6,* Shuhua Xu, 1,2,*
1
Max Planck Independent Research Group on Population Genomics,
2
Chinese
Academy of Sciences Key Laboratory of Computational Biology, Chinese Academy of
Sciences and Max Planck Society (CAS-MPG) Partner Institute for Computational
Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences,
Shanghai 200031, China;
3
Shanghai genomePilot Institutes for Genomics and Human Health, Shanghai 200030,
China;
4
Changning Mental Health Center, Shanghai 200335, China;
5
Ministry of Education (MOE) Key Laboratory of Contemporary Anthropology,
School of Life Sciences and Institutes of Biomedical Sciences, Fudan University,
Shanghai 200433, China;
6
Bio-X Center, MOE Key Laboratory for the Genetics of Developmental and
Neuropsychiatric Disorders, Shanghai Jiao Tong University, Shanghai 200030, China.
§ These authors contributed equally to this work.
* To whom correspondence should be addressed. E-mail: xushua@picb.ac.cn (S.X.);
lijin.fudan@gmail.com (L.J.); shiyongyong@gmail.com (Y.S.).
Text S1
Quality control for 1000 Genomes Project data
Due to the low coverage and the unsatisfying quality of the data from 1000 Genomes
Project phase 1, we compared the strands and genotypes of the sequencing data from
1000 Genomes Project and the genotyping data from HapMap Project before
integrating them. As a result, we found discordant strand information for some loci
and discordant genotypes even for identical individuals.
SNPs with discordant strands can be either filtered out or corrected following
three steps. First, those SNPs that cannot be adjusted by simply converting strand of
the loci from 1000 Genomes Project should be dropped. Taking rs6700513 as an
example, alleles of this locus in HapMap are C/G, while A/C in 1000 Genomes. 431
SNPs in this part were excluded. Second, although the SNPs can be adjusted simply
by converting strand from 1000 Genomes, we should be careful especially with those
alleles of A/T or C/G because the same allele is to be observed when we convert the
strand of SNPs with allele A/T or C/G, and we cannot tell the difference between the
converted strand and the original strand for that SNP. In this part, a strand of 2644
SNPs was corrected. Third, based on the frequency difference, some SNPs with
alleles of A/T or C/G should be corrected. Ideally, if the discordant rate of a locus for
the identical samples of CHB from two projects is 100%, it may decrease to 0% if we
convert the strand of this locus from 1000 Genomes Project. For this part, 28 strands
were converted for |do–dc| > 0.2, with do being the frequency difference of the
original allele and dc being that of the allele after converting the strand for CHB in
HapMap and 1000 Genomes. (Table S1 a, b and c)
For the discordance of genotypes, we calculated the discordance rate after
filtering and converting the strand of the SNPs with respect to the genotype of each
individual. We found, on average, that there was 0.38% discordance for each identical
individual from HapMap CHB and 1000 Genomes CHB (Table S1 d). Then we
excluded the SNPs with discordance rate greater than 0.1%, i.e. 3382 (out of 1406525)
SNPs were discarded in the following analysis.
Text S2
Excluding outliers from CHB and CHS
We downloaded Han Chinese sequencing data from 1000 Genomes Project (CHB and
CHS). CHB were collected from Beijing, the capital of China that is located in the
north of China, therefore most of the samples belong to Northern Han Chinese. CHS
were sampled from Hunan and Fujian province of China, located at the south of
Yangtze River, which could be considered as southern Han Chinese. Overall, there
were 36,820,992 SNPs for these samples. Considering the low coverage and the
unsatisfying quality, we used only SNPs with dbSNP ID in our study. The samples
whose geographic origins provided by the 1000 Genomes Project were discordant
with the cluster in the PCA were thought as outliers and were excluded from the
following analysis. All these SNPs were filtered for Hardy-Weinberg disequilibrium
(HWD) (p < 0.00001) within regional population. Finally, we got 56 CHB and 83
CHS with 15,587,409 SNPs that could be used as Northern Han and Southern Han,
respectively, in the following study.
Supplementary Figures
Figure S1 Geographical locations and sample sizes in this study.
(a)Dataset 1 including 757 individuals used for structure analysis and screening AIMs.
(b)Dataset 2 including 4,783 individuals used for validation of AIMs.
Figure S2 PCA plot of CHB and CHS from 1,000 Genomes Project.
Figure S3 Pairwise plots of PC1 to PC10 derived from 101,038 SNPs in samples of
data set 1.
Each plot includes samples of CHB (yellow), CHS (green) and other Han (blue) in
data set 1.
Figure S4 Distributions of FST over the whole genome.
We calculated and plotted FST values for all 738,937 autosomal SNPs between
northern and southern Han. Locations of genes with the most different SNPs are
labeled on the Manhattan plot.
Figure S5 Correlation between FST and In values using top 1000 AIMs.
Figure S6 Structure of 467 Han samples analyzed by FRAPPE using top 150 AIMs.
Iterations in FRAPPE were set to 10,000 assuming K = 2.
Figure S7 Values of inflation factors for different degrees of stratification from
simulated association analysis.
Figure S8 Q-Q plots of the p values from simulated association studies with or without correction for population stratification using AIMs.
Rows correspond to degrees of stratification of the 5,000 case and 5,000 control samples. Columns correspond to plots of uncorrected,
PCA-corrected and GC-corrected p values
Figure S9 Classification of 467 Han samples using varying numbers of AIMs from a
previous panel.
This panel of AIMs was selected from a small dataset in a previous study.
Supplementary Tables
Table S1 Discordance of SNP allele strands and individual genotypes between
HapMap and 1000 Genomes. (see excel file Table_S1.xls, including 4 sheets)
Table S2 Top 1,000 highly differentiated SNPs between northern and southern Han
ranked by FST values. (see excel file Table_S2.xls)
Table S3 False positives due to population stratifications and the performance of PCA-based method and genomic control method using our
AIMs panel.
Stratification
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1.2
0.003
0.02
0.193
0.54
0.777
0.871
0.913
0.935
0.947
0.955
0.961
Uncorrected
odds ratio
1.4
1.6
0.002 0.002
0.016 0.015
0.157 0.153
0.478
0.47
0.731 0.725
0.841 0.836
0.891 0.888
0.918 0.915
0.934 0.932
0.944 0.942
0.951 0.949
2
0.002
0.015
0.151
0.465
0.721
0.834
0.886
0.914
0.93
0.941
0.948
1.2
0.003
0.006
0.001
0.008
0.003
0.003
0.01
0.01
0.024
0.128
0.994
PCA-corrected
odds ratio
1.4
1.6
0.002 0.002
0.004 0.004
0.001 0.001
0.005 0.005
0.002 0.002
0.002 0.002
0.006 0.005
0.005 0.004
0.008 0.007
0.019 0.014
0.872 0.551
2
1.2
0.002 0.001
0.004
nan
0.001
nan
0.005
nan
0.002
nan
0.002
nan
0.005
nan
0.004
nan
0.006
nan
0.013
nan
0.25
nan
GC-corrected
odds ratio
1.4
1.6
0.001 0.001
0
0
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
nan
2
0.001
0
0
nan
nan
nan
nan
nan
nan
nan
nan
Table S4 Statistical power to detect risk alleles.
Stratification
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1.2
0.726
0.727
0.727
0.727
0.727
0.727
0.727
0.727
0.727
0.727
0.727
Uncorrected
odds ratio
1.4
1.6
0.932 0.962
0.932 0.962
0.932 0.962
0.932 0.962
0.932 0.962
0.932 0.962
0.932 0.962
0.932 0.962
0.932 0.962
0.932 0.962
0.932 0.962
2
0.98
0.98
0.98
0.98
0.98
0.98
0.98
0.98
0.98
0.98
0.98
1.2
0.726
0.712
0.695
0.655
0.634
0.58
0.508
0.381
0.244
0.082
0.001
PCA-corrected
odds ratio
1.4
1.6
0.932 0.962
0.93 0.961
0.924 0.959
0.91 0.951
0.905 0.949
0.896 0.942
0.877 0.931
0.856 0.923
0.787 0.892
0.623 0.819
0.023 0.127
2
1.2
0.98 0.703
0.98
0
0.978
0
0.976
0
0.973
0
0.97
0
0.967
0
0.964
0
0.951
0
0.924
0
0.468
0
GC-corrected
odds ratio
1.4
1.6
0.925 0.958
0.236 0.686
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
2
0.979
0.857
0.357
0
0
0
0
0
0
0
0
Download