Supplementary Material of A panel of ancestry informative markers to estimate and correct potential effects of population stratification in Han Chinese Pengfei Qin, 1,2, § Zhiqiang Li,3,§ Wenfei Jin, 1,2 Dongsheng Lu, 1,2 Haiyi Lou, 1,2 Jiawei Shen, 4 Li Jin,1, 2, 5,* Yongyong Shi, 6,* Shuhua Xu, 1,2,* 1 Max Planck Independent Research Group on Population Genomics, 2 Chinese Academy of Sciences Key Laboratory of Computational Biology, Chinese Academy of Sciences and Max Planck Society (CAS-MPG) Partner Institute for Computational Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, China; 3 Shanghai genomePilot Institutes for Genomics and Human Health, Shanghai 200030, China; 4 Changning Mental Health Center, Shanghai 200335, China; 5 Ministry of Education (MOE) Key Laboratory of Contemporary Anthropology, School of Life Sciences and Institutes of Biomedical Sciences, Fudan University, Shanghai 200433, China; 6 Bio-X Center, MOE Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Shanghai Jiao Tong University, Shanghai 200030, China. § These authors contributed equally to this work. * To whom correspondence should be addressed. E-mail: xushua@picb.ac.cn (S.X.); lijin.fudan@gmail.com (L.J.); shiyongyong@gmail.com (Y.S.). Text S1 Quality control for 1000 Genomes Project data Due to the low coverage and the unsatisfying quality of the data from 1000 Genomes Project phase 1, we compared the strands and genotypes of the sequencing data from 1000 Genomes Project and the genotyping data from HapMap Project before integrating them. As a result, we found discordant strand information for some loci and discordant genotypes even for identical individuals. SNPs with discordant strands can be either filtered out or corrected following three steps. First, those SNPs that cannot be adjusted by simply converting strand of the loci from 1000 Genomes Project should be dropped. Taking rs6700513 as an example, alleles of this locus in HapMap are C/G, while A/C in 1000 Genomes. 431 SNPs in this part were excluded. Second, although the SNPs can be adjusted simply by converting strand from 1000 Genomes, we should be careful especially with those alleles of A/T or C/G because the same allele is to be observed when we convert the strand of SNPs with allele A/T or C/G, and we cannot tell the difference between the converted strand and the original strand for that SNP. In this part, a strand of 2644 SNPs was corrected. Third, based on the frequency difference, some SNPs with alleles of A/T or C/G should be corrected. Ideally, if the discordant rate of a locus for the identical samples of CHB from two projects is 100%, it may decrease to 0% if we convert the strand of this locus from 1000 Genomes Project. For this part, 28 strands were converted for |do–dc| > 0.2, with do being the frequency difference of the original allele and dc being that of the allele after converting the strand for CHB in HapMap and 1000 Genomes. (Table S1 a, b and c) For the discordance of genotypes, we calculated the discordance rate after filtering and converting the strand of the SNPs with respect to the genotype of each individual. We found, on average, that there was 0.38% discordance for each identical individual from HapMap CHB and 1000 Genomes CHB (Table S1 d). Then we excluded the SNPs with discordance rate greater than 0.1%, i.e. 3382 (out of 1406525) SNPs were discarded in the following analysis. Text S2 Excluding outliers from CHB and CHS We downloaded Han Chinese sequencing data from 1000 Genomes Project (CHB and CHS). CHB were collected from Beijing, the capital of China that is located in the north of China, therefore most of the samples belong to Northern Han Chinese. CHS were sampled from Hunan and Fujian province of China, located at the south of Yangtze River, which could be considered as southern Han Chinese. Overall, there were 36,820,992 SNPs for these samples. Considering the low coverage and the unsatisfying quality, we used only SNPs with dbSNP ID in our study. The samples whose geographic origins provided by the 1000 Genomes Project were discordant with the cluster in the PCA were thought as outliers and were excluded from the following analysis. All these SNPs were filtered for Hardy-Weinberg disequilibrium (HWD) (p < 0.00001) within regional population. Finally, we got 56 CHB and 83 CHS with 15,587,409 SNPs that could be used as Northern Han and Southern Han, respectively, in the following study. Supplementary Figures Figure S1 Geographical locations and sample sizes in this study. (a)Dataset 1 including 757 individuals used for structure analysis and screening AIMs. (b)Dataset 2 including 4,783 individuals used for validation of AIMs. Figure S2 PCA plot of CHB and CHS from 1,000 Genomes Project. Figure S3 Pairwise plots of PC1 to PC10 derived from 101,038 SNPs in samples of data set 1. Each plot includes samples of CHB (yellow), CHS (green) and other Han (blue) in data set 1. Figure S4 Distributions of FST over the whole genome. We calculated and plotted FST values for all 738,937 autosomal SNPs between northern and southern Han. Locations of genes with the most different SNPs are labeled on the Manhattan plot. Figure S5 Correlation between FST and In values using top 1000 AIMs. Figure S6 Structure of 467 Han samples analyzed by FRAPPE using top 150 AIMs. Iterations in FRAPPE were set to 10,000 assuming K = 2. Figure S7 Values of inflation factors for different degrees of stratification from simulated association analysis. Figure S8 Q-Q plots of the p values from simulated association studies with or without correction for population stratification using AIMs. Rows correspond to degrees of stratification of the 5,000 case and 5,000 control samples. Columns correspond to plots of uncorrected, PCA-corrected and GC-corrected p values Figure S9 Classification of 467 Han samples using varying numbers of AIMs from a previous panel. This panel of AIMs was selected from a small dataset in a previous study. Supplementary Tables Table S1 Discordance of SNP allele strands and individual genotypes between HapMap and 1000 Genomes. (see excel file Table_S1.xls, including 4 sheets) Table S2 Top 1,000 highly differentiated SNPs between northern and southern Han ranked by FST values. (see excel file Table_S2.xls) Table S3 False positives due to population stratifications and the performance of PCA-based method and genomic control method using our AIMs panel. Stratification 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 1.2 0.003 0.02 0.193 0.54 0.777 0.871 0.913 0.935 0.947 0.955 0.961 Uncorrected odds ratio 1.4 1.6 0.002 0.002 0.016 0.015 0.157 0.153 0.478 0.47 0.731 0.725 0.841 0.836 0.891 0.888 0.918 0.915 0.934 0.932 0.944 0.942 0.951 0.949 2 0.002 0.015 0.151 0.465 0.721 0.834 0.886 0.914 0.93 0.941 0.948 1.2 0.003 0.006 0.001 0.008 0.003 0.003 0.01 0.01 0.024 0.128 0.994 PCA-corrected odds ratio 1.4 1.6 0.002 0.002 0.004 0.004 0.001 0.001 0.005 0.005 0.002 0.002 0.002 0.002 0.006 0.005 0.005 0.004 0.008 0.007 0.019 0.014 0.872 0.551 2 1.2 0.002 0.001 0.004 nan 0.001 nan 0.005 nan 0.002 nan 0.002 nan 0.005 nan 0.004 nan 0.006 nan 0.013 nan 0.25 nan GC-corrected odds ratio 1.4 1.6 0.001 0.001 0 0 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 2 0.001 0 0 nan nan nan nan nan nan nan nan Table S4 Statistical power to detect risk alleles. Stratification 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 1.2 0.726 0.727 0.727 0.727 0.727 0.727 0.727 0.727 0.727 0.727 0.727 Uncorrected odds ratio 1.4 1.6 0.932 0.962 0.932 0.962 0.932 0.962 0.932 0.962 0.932 0.962 0.932 0.962 0.932 0.962 0.932 0.962 0.932 0.962 0.932 0.962 0.932 0.962 2 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.98 1.2 0.726 0.712 0.695 0.655 0.634 0.58 0.508 0.381 0.244 0.082 0.001 PCA-corrected odds ratio 1.4 1.6 0.932 0.962 0.93 0.961 0.924 0.959 0.91 0.951 0.905 0.949 0.896 0.942 0.877 0.931 0.856 0.923 0.787 0.892 0.623 0.819 0.023 0.127 2 1.2 0.98 0.703 0.98 0 0.978 0 0.976 0 0.973 0 0.97 0 0.967 0 0.964 0 0.951 0 0.924 0 0.468 0 GC-corrected odds ratio 1.4 1.6 0.925 0.958 0.236 0.686 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0.979 0.857 0.357 0 0 0 0 0 0 0 0