Some practical notes about imputation

Some practical notes about imputation The most important precondition of imputation is that study and reference panel should be as compatible as possible. This is because the whole process is based on the common SNPs, so we have to make sure that no discrepancies are included. Specifically, study and reference should be aligned in the same genome assembly, checked for inverted SNPs alleles and corrected for minor allele frequency (MAF) and Hardy Weinberg Equilibrium (HWE) differences. In case the reference and the study are aligned in different genome assembly, we recommend the re-alignment of the study panel in the assembly of the reference rather than the opposite. This is because the haplotype structure of the reference panel that has been extracted through phasing will be distorted if the position of the markers will be altered through the re-alignment process. The process of re-alignment of a genotype dataset in a new genome assembly is called liftovering. GATK [6] is one of the tools that perform liftovering. Reference panels that have been pre-processed from imputation can be acquired from the websites of imputation software: Impute2: http://mathgen.stats.ox.ac.uk/impute/impute_v2.html#download_reference_data BEAGLE: http://faculty.washington.edu/browning/beagle/beagle.html#reference_panels Mach: http://www.sph.umich.edu/csg/abecasis/MACH/download/1000G-PhaseIInterim.html Alternatively an imputation reference panel can be created from the raw data. For example raw variant calls of the 1000 Genomes Project data in compressed Variant Call Format [5] are available here (October 2011 release): ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521 One of the most known tools to manage VCF files is vcftools. After downloading we can extract the SNPs, disregard any other variation and apply a basic quality filter of over 1% for MAF and over 10-4 for HWE with vcftools. We strongly advise to scrutinize the reference panel for any HWE or MAF discrepancy, as it is more likely that it will be transferred to the imputation results. The qq-plot of the chi-square values of the HWE of the SNPs that passed the above filter reveal that 1KG contains excessive SNPs with abnormal allele distributions. By applying a more stringent filter we can correct the inflation but, of course, we will decrease the variation in the reference panel. The next step contains the format conversions. In [table 1] we present the tools needed to perform the appropriate conversions. To impute with BEAGLE we will have to convert the study and the reference from VCF to BEAGLE. To impute with IMPUTE2 we will have to convert the reference panel to hap and legend format and the study panel to gen and samples format. Origin Format VCF Target Format PED / MAP Tool vcftools [8] VCF Hap, legend vcftools Notes This can be very slow for large datasets. We recommend converting from VCF to TPED/TFAM and then use PLINK to convert to PED/MAP IMPUTE2 reference panel. Does not support missing PED / MAP PED / MAP gen, samples beagle gtools [9] linkage2beagle [10] genotypes. Only markers with 100% calling rate will be exported. IMPUTE2 study panel BEAGLE study and reference panel The HWE and MAF filtering should be applied to the study panel as well. The next step is the correction for possible strand alignment issues. Impute2 and MACH contain options to fix misaligned alleles between study and reference panel by inverting the alleles whenever this is possible. Practically this corrects for AC/GT and AG/CT misalignments whereas all the other combinations (for example AC / AG) are considered incompatible and the respective SNPs are removed from the study panel. We do not recommend relying in these corrections and always align the genotypes in study panel to the assembly of the reference. IMPUTE2’s reference datasets are aligned in the ‘+’ strand, so if the architecture of the study’s genotype platform is known, the strand correction can be performed prior to the imputation. An extra quality step is to check for significant differences in minor allele frequencies. An absolute difference of more than 0.25 indicates a major structural dissimilarity coming possibly from a genotyping error. Here we demonstrate how we can apply the presented filters by using PLINK software [7]. We assume that the study and the reference are in PED / MAP format. Apply MAF and HWE filtering in the study plink --file study --maf 0.01 --hwe 0.0001 --recode --out study_1 Substitute the SNP identifier with the chromosomal position. cat study_1.map | awk '{print $1, $1"_"$4, $3, $4}' > study_2.map cat reference_1.map | awk '{print $1, $1"_"$4, $3, $4}' > reference_2.map cp study_1.ped study_2.ped cp reference_1.ped reference_2.ped Create an alphanumerically sorted file of the positions: cat study_2.map | awk '{print $4}' | sort > study.pos cat reference_2.map | awk ‘{print $4}’ | sort > reference.pos Get the common positions: comm -12 study.pos reference.pos > common.pos Extract the Allele Frequencies from common SNPs between study and reference: plink --file study_2 --freq --extract common.pos --out study_2 plink --file reference_2 --freq --extract study.pos --out reference_2 Paste the two generated frequency files horizontally: paste study_2.frq reference_2.frq > common.frq The “normalize.py” python script in the appendix parses the common.frq file and reports the SNPs that have incompatible alleles, inverted alleles and abnormal frequencies: python normalize.py Remove the incompatible and abnormal SNPs: plink --file study_2 --exclude abnormal.txt --recode --out study_3 plink --file study_3 --exclude incompatible.txt --recode --out study_4 We can also flip the inverted alleles: plink --file study_4 --flip inverted.txt --recode --out study_5 Now we can convert the files study5.map and study5.ped to the suitable format of the imputation tool of our choice according to table 1. The final step is the splitting of the study panel to bins of samples and to chromosomal regions. The size of the splitting is up to the existing computational environment and the parallelizing options. Quality measures Here we present the various metrics that assess the quality of an imputation experiment. The metrics are divided into two categories whether real unimputed genotypes are available or not. The most common imputation metric is the R2 that represents the correlation between the imputed and the real genotypes. Since real genotypes are unknown, various statistics can be used to estimate it. Marchini et al. [1] presents a thorough review about the R2 metrics used by MACH, BEAGLE, SNPTEST and IMPUTE v1 and v2 software. In the appendix, we provide source code for the estimation of these metrics. Another R2 metric [2] is the ratio of the variance of the imputed allele dosage and the variance of the true allele dosage. The variance of the true allele dosage is unknown, but it can be estimated as 2p(1−p) under Hardy-Weinberg equilibrium, where p is the estimated allele frequency. Other metrics include also the allelic frequency error and the standardized allele frequency error [3]. After assessing a suitable R2 value we can draw useful conclusion from plotting the percentage of SNPs that exhibit R2 over than 0.8 for various minor allele frequency bins (for an example see [11]). This graph will show how rare and common SNPs where imputed and can be used to compare different imputation reference panels. In cases where real data are available the most useful metric is the R2 of the correlation between allelic dosage from imputation and true genotyped dosage (code available in appendix). We can also assess the performance of the imputation R2 value by estimating the overconfident and under-confident genotypes. Overconfident are the wrongly imputed genotypes with high R2 values and the opposite are the under-confident genotypes. Another qualitative metric is the concordance between real and imputed genotypes. The large number of concordant genotypes of the homozygous major alleles can easily bias the concordance estimation. This is why if A and B are major and minor alleles respectively, it is essential to provide an estimations for the non-reference sensitivity that is the concordance of genotypes that are A/B or B/B in the “real” dataset. GATK contains options to perform these estimations for VCF files. For comparing different imputation methods we can assess the graph of the percentage discordance versus percentage of missing genotypes for various thresholds of the genotype probability [4]. To construct the graph, we measure the discordance between imputed and real data versus the missing genotypes for an increasing threshold for the genotype probabilities from 0.33 to 0.99. Reference datasets The construction of a novel imputation reference dataset is a complex procedure that requires the dense genotyping and phasing of hundreds of individuals from a specific population. Additional filters and stringent quality controls have to be applied that exclude any potential artifacts and biases [11]. The most thoroughly documented and widely available imputation reference sets are coming from the HapMap and 1000 Genomes consortia. The HapMap 3 reference set is coming from samples that have been separately genotyped in two different platforms, Affymetrix Human SNP Array 6.0 and Illumina Human 1M-single beadchip. The dataset consists of 1,184 individuals from 11 populations typed in 1,440,616 SNPs [13]. A different approach was followed by the 1000 Genomes project [14]. The genotypes are coming from the low-coverage wholegenome sequencing of individuals belonging to same populations. Depending on the population, the number of SNPs that it includes varies from 6,273,441 (CHB+JPT) to 10,938,130 (YRI). The union across samples is 15,275,256 SNPs. It becomes eminent that the main difference between HapMap 3 and 1000 Genomes project is that the former has less but more qualitative genotypes while the later contains previously unknown variation but less reliable. It is important to note that 1000 Genomes Project is an ongoing project that targets to sequence nearly all of the HapMap 3 samples. In its final form it will contain approximately 2,500 individuals from nearly 25 populations. Impute2 has the option to combine two differenct reference datasets which is very convenient for the presented datasets and, as expected, it yields superior results when compared with imputation with HapMap 3 or 1000 Genomes project alone [15]. Which reference population to select The main problem of existing reference panels is the relevant small sample size. Especially for imputation of low-frequency variants, the missing genotype might not been represented adequately in the reference dataset due to sampling error. Nevertheless, this genotype might exist in a haplotype of a different population. So population mixing usually improves imputation quality [16], [17]. The improvement in performance for low-frequency variants of the population mixing is substantial and can reach the quality of common variants [12]. Impute2 includes a sophisticated method for selecting the haplotype subset from the complete reference dataset with which it will perform imputation. This eases the procedure, as the researcher does not have to predefine the reference population. Experiments with population mixing indicated that even when we mix populations with different ancestry (i.e. TSI and CHB+JPT) the performance increases, although the mixture percentage should not exceed 20% [12]. References [1] [2] [3] [4] Jonathan Marchini & Bryan Howie. Genotype imputation for genome-wide association studies. Nature Reviews Genetics 11, 499-511. Supplementary information S3. P.I. de Bakker, M.A. Ferreira, X. Jia, B.M. Neale, S. Raychaudhuri, B.F. Voight Practical aspects of imputation-driven meta-analysis of genome-wide association studies Brian L. Browning and Sharon R. Browning. A Unified Approach to Genotype Imputation and Haplotype-Phase Inference for Large Data Sets of Trios and Unrelated Individuals Howie BN, Donnelly P, Marchini J, 2009 A Flexible and Accurate Genotype Imputation Method for the Next Generation of Genome-Wide Association Studies. PLoS Genet 5(6): e1000529. doi:10.1371/journal.pgen.1000529 [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] Danecek, P., Auton, A., Abecasis, G., Albers, C. A., Banks, E., DePristo, M. A., Handsaker, R. E., Lunter, G., Marth, G. T., Sherry, S. T., McVean, G., Durbin, R., 1000 Genomes Project Analysis Group. The variant call format and VCFtools. Bioinformatics (2011) 27 (15): 2156-2158. Mark A DePristo, Eric Banks, Ryan Poplin, Kiran V Garimella, Jared R Maguire, Christopher Hartl, Anthony A Philippakis, Guillermo del Angel, Manuel A Rivas, Matt Hanna, Aaron McKenna, Tim J Fennell, Andrew M Kernytsky, Andrey Y Sivachenko, Kristian Cibulskis, Stacey B Gabriel, David Altshuler & Mark J Daly. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature Genetics 43, 491–498 (2011) doi:10.1038/ng.806 Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, Maller J, Sklar P, de Bakker PIW, Daly MJ & Sham PC (2007) PLINK: a toolset for wholegenome association and population-based linkage analysis. American Journal of Human Genetics, 81. http://vcftools.sourceforge.net/ http://www.well.ox.ac.uk/~cfreeman/software/gwas/gtool.html http://faculty.washington.edu/browning/beagle_utilities/utilities.html Zhaoming Wang, Kevin B Jacobs, Meredith Yeager, Amy Hutchinson, Joshua Sampson, Nilanjan Chatterjee, Demetrius Albanes, Sonja I Berndt, Charles C Chung, W Ryan Diver, Susan M Gapstur, Lauren R Teras, Christopher A Haiman, Brian E Henderson, Daniel Stram, Xiang Deng, Ann W Hsing, Jarmo Virtamo, Michael A Eberle, Jennifer L Stone, Mark P Purdue, Phil Taylor, Margaret Tucker& Stephen J Chanock. Improved imputation of common and uncommon SNPs with a new reference set. Nature Genetics 44, 6–7 (2012) doi:10.1038/ng.1044. http://www.nature.com/ng/journal/v44/n1/full/ng.1044.html Luke Jostins, Katherine I Morley and Jeffrey C Barrett. Imputation of lowfrequency variants using the HapMap3 benefits from large, diverse reference sets. European Journal of Human Genetics (2011) 19, 662–666. http://www.nature.com/ejhg/journal/v19/n6/full/ejhg201110a.html The International HapMap 3 Consortium. Integrating common and rare genetic variation in diverse human populations. Nature 467, 52–58 (02 September 2010). http://www.nature.com/nature/journal/v467/n7311/full/nature09298.html The 1000 Genomes Project Consortium. A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 (28 October 2010). http://www.nature.com/nature/journal/v467/n7319/full/nature09534.html Kwangsik Nho, Li Shen, Sungeun Kim, Shanker Swaminathan, BTech, Shannon L. Risacher, Andrew J. Saykin, and the Alzheimer’s Disease Neuroimaging Initiative (ADNI). AMIA Annu Symp Proc. 2011; 2011: 1013–1018. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3243280/?tool=pubmed Bryan Howie, Jonathan Marchini and Matthew Stephens. Genotype Imputation with Thousands of Genomes. G3 November 1, 2011 vol. 1 no. 6 457-470 Ke Hao, Eugene Chudin, Joshua McElwee and Eric E Schadt. Accuracy of genomewide imputation of untyped markers and impacts on statistical power for association studies. BMC Genetics 2009, 10:27. http://www.biomedcentral.com/1471-2156/10/27 Appendix Scripts: normalize.py. Correct for inverted SNPs, Minor Allele Frequency differences and incompatible SNPs between reference and study panel    http://www.bbmriwiki.nl/svn/Imputation/PaperMethodology/scripts/normali ze.py BEAGLE’s Allelic R2: http://www.bbmriwiki.nl/svn/Imputation/PaperMethodology/scripts/beagle_r 2.py Real Allelic R2: http://www.bbmriwiki.nl/svn/Imputation/PaperMethodology/scripts/real_alle lic_r2.py

Some practical notes about imputation

Related documents

Products

Support

Some practical notes about imputation

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib