Some practical notes about imputation

advertisement
Some practical notes about imputation
The most important precondition of imputation is that study and reference panel should
be as compatible as possible. This is because the whole process is based on the common
SNPs, so we have to make sure that no discrepancies are included. Specifically, study
and reference should be aligned in the same genome assembly, checked for inverted
SNPs alleles and corrected for minor allele frequency (MAF) and Hardy Weinberg
Equilibrium (HWE) differences. In case the reference and the study are aligned in
different genome assembly, we recommend the re-alignment of the study panel in the
assembly of the reference rather than the opposite. This is because the haplotype
structure of the reference panel that has been extracted through phasing will be
distorted if the position of the markers will be altered through the re-alignment process.
The process of re-alignment of a genotype dataset in a new genome assembly is called
liftovering. GATK [6] is one of the tools that perform liftovering.
Reference panels that have been pre-processed from imputation can be acquired from
the websites of imputation software:
Impute2:
http://mathgen.stats.ox.ac.uk/impute/impute_v2.html#download_reference_data
BEAGLE:
http://faculty.washington.edu/browning/beagle/beagle.html#reference_panels
Mach: http://www.sph.umich.edu/csg/abecasis/MACH/download/1000G-PhaseIInterim.html
Alternatively an imputation reference panel can be created from the raw data. For
example raw variant calls of the 1000 Genomes Project data in compressed Variant Call
Format [5] are available here (October 2011 release):
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521
One of the most known tools to manage VCF files is vcftools. After downloading we can
extract the SNPs, disregard any other variation and apply a basic quality filter of over
1% for MAF and over 10-4 for HWE with vcftools. We strongly advise to scrutinize the
reference panel for any HWE or MAF discrepancy, as it is more likely that it will be
transferred to the imputation results. The qq-plot of the chi-square values of the HWE of
the SNPs that passed the above filter reveal that 1KG contains excessive SNPs with
abnormal allele distributions. By applying a more stringent filter we can correct the
inflation but, of course, we will decrease the variation in the reference panel.
The next step contains the format conversions. In [table 1] we present the tools needed
to perform the appropriate conversions. To impute with BEAGLE we will have to
convert the study and the reference from VCF to BEAGLE. To impute with IMPUTE2 we
will have to convert the reference panel to hap and legend format and the study panel to
gen and samples format.
Origin Format
VCF
Target Format
PED / MAP
Tool
vcftools [8]
VCF
Hap, legend
vcftools
Notes
This can be very slow for
large datasets. We
recommend converting from
VCF to TPED/TFAM and then
use PLINK to convert to
PED/MAP
IMPUTE2 reference panel.
Does not support missing
PED / MAP
PED / MAP
gen, samples
beagle
gtools [9]
linkage2beagle [10]
genotypes. Only markers
with 100% calling rate will
be exported.
IMPUTE2 study panel
BEAGLE study and reference
panel
The HWE and MAF filtering should be applied to the study panel as well. The next step is
the correction for possible strand alignment issues. Impute2 and MACH contain options
to fix misaligned alleles between study and reference panel by inverting the alleles
whenever this is possible. Practically this corrects for AC/GT and AG/CT misalignments
whereas all the other combinations (for example AC / AG) are considered incompatible
and the respective SNPs are removed from the study panel. We do not recommend
relying in these corrections and always align the genotypes in study panel to the
assembly of the reference. IMPUTE2’s reference datasets are aligned in the ‘+’ strand, so
if the architecture of the study’s genotype platform is known, the strand correction can
be performed prior to the imputation. An extra quality step is to check for significant
differences in minor allele frequencies. An absolute difference of more than 0.25
indicates a major structural dissimilarity coming possibly from a genotyping error.
Here we demonstrate how we can apply the presented filters by using PLINK software
[7]. We assume that the study and the reference are in PED / MAP format.
Apply MAF and HWE filtering in the study
plink --file study --maf 0.01 --hwe 0.0001 --recode --out study_1
Substitute the SNP identifier with the chromosomal position.
cat study_1.map | awk '{print $1, $1"_"$4, $3, $4}' > study_2.map
cat reference_1.map | awk '{print $1, $1"_"$4, $3, $4}' > reference_2.map
cp study_1.ped study_2.ped
cp reference_1.ped reference_2.ped
Create an alphanumerically sorted file of the positions:
cat study_2.map | awk '{print $4}' | sort > study.pos
cat reference_2.map | awk ‘{print $4}’ | sort > reference.pos
Get the common positions:
comm -12 study.pos reference.pos > common.pos
Extract the Allele Frequencies from common SNPs between study and reference:
plink --file study_2 --freq --extract common.pos --out study_2
plink --file reference_2 --freq --extract study.pos --out reference_2
Paste the two generated frequency files horizontally:
paste study_2.frq reference_2.frq > common.frq
The “normalize.py” python script in the appendix parses the common.frq file and
reports the SNPs that have incompatible alleles, inverted alleles and abnormal
frequencies:
python normalize.py
Remove the incompatible and abnormal SNPs:
plink --file study_2 --exclude abnormal.txt --recode --out study_3
plink --file study_3 --exclude incompatible.txt --recode --out study_4
We can also flip the inverted alleles:
plink --file study_4 --flip inverted.txt --recode --out study_5
Now we can convert the files study5.map and study5.ped to the suitable format of the
imputation tool of our choice according to table 1. The final step is the splitting of the
study panel to bins of samples and to chromosomal regions. The size of the splitting is
up to the existing computational environment and the parallelizing options.
Quality measures
Here we present the various metrics that assess the quality of an imputation
experiment. The metrics are divided into two categories whether real unimputed
genotypes are available or not.
The most common imputation metric is the R2 that represents the correlation between
the imputed and the real genotypes. Since real genotypes are unknown, various
statistics can be used to estimate it. Marchini et al. [1] presents a thorough review about
the R2 metrics used by MACH, BEAGLE, SNPTEST and IMPUTE v1 and v2 software. In
the appendix, we provide source code for the estimation of these metrics. Another R2
metric [2] is the ratio of the variance of the imputed allele dosage and the variance of the
true allele dosage. The variance of the true allele dosage is unknown, but it can be
estimated as 2p(1−p) under Hardy-Weinberg equilibrium, where p is the estimated
allele frequency.
Other metrics include also the allelic frequency error and the standardized allele
frequency error [3]. After assessing a suitable R2 value we can draw useful conclusion
from plotting the percentage of SNPs that exhibit R2 over than 0.8 for various minor
allele frequency bins (for an example see [11]). This graph will show how rare and
common SNPs where imputed and can be used to compare different imputation
reference panels.
In cases where real data are available the most useful metric is the R2 of the correlation
between allelic dosage from imputation and true genotyped dosage (code available in
appendix). We can also assess the performance of the imputation R2 value by estimating
the overconfident and under-confident genotypes. Overconfident are the wrongly
imputed genotypes with high R2 values and the opposite are the under-confident
genotypes. Another qualitative metric is the concordance between real and imputed
genotypes. The large number of concordant genotypes of the homozygous major alleles
can easily bias the concordance estimation. This is why if A and B are major and minor
alleles respectively, it is essential to provide an estimations for the non-reference
sensitivity that is the concordance of genotypes that are A/B or B/B in the “real” dataset.
GATK contains options to perform these estimations for VCF files. For comparing
different imputation methods we can assess the graph of the percentage discordance
versus percentage of missing genotypes for various thresholds of the genotype
probability [4]. To construct the graph, we measure the discordance between imputed
and real data versus the missing genotypes for an increasing threshold for the genotype
probabilities from 0.33 to 0.99.
Reference datasets
The construction of a novel imputation reference dataset is a complex procedure that
requires the dense genotyping and phasing of hundreds of individuals from a specific
population. Additional filters and stringent quality controls have to be applied that
exclude any potential artifacts and biases [11]. The most thoroughly documented and
widely available imputation reference sets are coming from the HapMap and 1000
Genomes consortia. The HapMap 3 reference set is coming from samples that have been
separately genotyped in two different platforms, Affymetrix Human SNP Array 6.0 and
Illumina Human 1M-single beadchip. The dataset consists of 1,184 individuals from 11
populations typed in 1,440,616 SNPs [13]. A different approach was followed by the
1000 Genomes project [14]. The genotypes are coming from the low-coverage wholegenome sequencing of individuals belonging to same populations. Depending on the
population, the number of SNPs that it includes varies from 6,273,441 (CHB+JPT) to
10,938,130 (YRI). The union across samples is 15,275,256 SNPs. It becomes eminent
that the main difference between HapMap 3 and 1000 Genomes project is that the
former has less but more qualitative genotypes while the later contains previously
unknown variation but less reliable. It is important to note that 1000 Genomes Project is
an ongoing project that targets to sequence nearly all of the HapMap 3 samples. In its
final form it will contain approximately 2,500 individuals from nearly 25 populations.
Impute2 has the option to combine two differenct reference datasets which is very
convenient for the presented datasets and, as expected, it yields superior results when
compared with imputation with HapMap 3 or 1000 Genomes project alone [15].
Which reference population to select
The main problem of existing reference panels is the relevant small sample size.
Especially for imputation of low-frequency variants, the missing genotype might not
been represented adequately in the reference dataset due to sampling error.
Nevertheless, this genotype might exist in a haplotype of a different population. So
population mixing usually improves imputation quality [16], [17]. The improvement in
performance for low-frequency variants of the population mixing is substantial and can
reach the quality of common variants [12]. Impute2 includes a sophisticated method for
selecting the haplotype subset from the complete reference dataset with which it will
perform imputation. This eases the procedure, as the researcher does not have to predefine the reference population. Experiments with population mixing indicated that
even when we mix populations with different ancestry (i.e. TSI and CHB+JPT) the
performance increases, although the mixture percentage should not exceed 20% [12].
References
[1]
[2]
[3]
[4]
Jonathan Marchini & Bryan Howie. Genotype imputation for genome-wide
association studies. Nature Reviews Genetics 11, 499-511. Supplementary
information S3.
P.I. de Bakker, M.A. Ferreira, X. Jia, B.M. Neale, S. Raychaudhuri, B.F. Voight
Practical aspects of imputation-driven meta-analysis of genome-wide
association studies
Brian L. Browning and Sharon R. Browning. A Unified Approach to Genotype
Imputation and Haplotype-Phase Inference for Large Data Sets of Trios and
Unrelated Individuals
Howie BN, Donnelly P, Marchini J, 2009 A Flexible and Accurate Genotype
Imputation Method for the Next Generation of Genome-Wide Association
Studies. PLoS Genet 5(6): e1000529. doi:10.1371/journal.pgen.1000529
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
Danecek, P., Auton, A., Abecasis, G., Albers, C. A., Banks, E., DePristo, M. A.,
Handsaker, R. E., Lunter, G., Marth, G. T., Sherry, S. T., McVean, G., Durbin, R.,
1000 Genomes Project Analysis Group. The variant call format and VCFtools.
Bioinformatics (2011) 27 (15): 2156-2158.
Mark A DePristo, Eric Banks, Ryan Poplin, Kiran V Garimella, Jared R Maguire,
Christopher Hartl, Anthony A Philippakis, Guillermo del Angel, Manuel A Rivas,
Matt Hanna, Aaron McKenna, Tim J Fennell, Andrew M Kernytsky, Andrey Y
Sivachenko, Kristian Cibulskis, Stacey B Gabriel, David Altshuler & Mark J Daly. A
framework for variation discovery and genotyping using next-generation DNA
sequencing data. Nature Genetics 43, 491–498 (2011) doi:10.1038/ng.806
Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, Maller J,
Sklar P, de Bakker PIW, Daly MJ & Sham PC (2007) PLINK: a toolset for wholegenome association and population-based linkage analysis. American Journal of
Human Genetics, 81.
http://vcftools.sourceforge.net/
http://www.well.ox.ac.uk/~cfreeman/software/gwas/gtool.html
http://faculty.washington.edu/browning/beagle_utilities/utilities.html
Zhaoming Wang, Kevin B Jacobs, Meredith Yeager, Amy Hutchinson, Joshua
Sampson, Nilanjan Chatterjee, Demetrius Albanes, Sonja I Berndt, Charles C
Chung, W Ryan Diver, Susan M Gapstur, Lauren R Teras, Christopher A Haiman,
Brian E Henderson, Daniel Stram, Xiang Deng, Ann W Hsing, Jarmo Virtamo,
Michael A Eberle, Jennifer L Stone, Mark P Purdue, Phil Taylor, Margaret
Tucker& Stephen J Chanock. Improved imputation of common and uncommon
SNPs with a new reference set. Nature Genetics 44, 6–7 (2012)
doi:10.1038/ng.1044.
http://www.nature.com/ng/journal/v44/n1/full/ng.1044.html
Luke Jostins, Katherine I Morley and Jeffrey C Barrett. Imputation of lowfrequency variants using the HapMap3 benefits from large, diverse reference
sets. European Journal of Human Genetics (2011) 19, 662–666.
http://www.nature.com/ejhg/journal/v19/n6/full/ejhg201110a.html
The International HapMap 3 Consortium. Integrating common and rare genetic
variation in diverse human populations. Nature 467, 52–58 (02 September
2010).
http://www.nature.com/nature/journal/v467/n7311/full/nature09298.html
The 1000 Genomes Project Consortium. A map of human genome variation from
population-scale sequencing. Nature 467, 1061–1073 (28 October 2010).
http://www.nature.com/nature/journal/v467/n7319/full/nature09534.html
Kwangsik Nho, Li Shen, Sungeun Kim, Shanker Swaminathan, BTech, Shannon L.
Risacher, Andrew J. Saykin, and the Alzheimer’s Disease Neuroimaging Initiative
(ADNI). AMIA Annu Symp Proc. 2011; 2011: 1013–1018.
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3243280/?tool=pubmed
Bryan Howie, Jonathan Marchini and Matthew Stephens. Genotype Imputation
with Thousands of Genomes. G3 November 1, 2011 vol. 1 no. 6 457-470
Ke Hao, Eugene Chudin, Joshua McElwee and Eric E Schadt. Accuracy of genomewide imputation of untyped markers and impacts on statistical power for
association studies. BMC Genetics 2009, 10:27.
http://www.biomedcentral.com/1471-2156/10/27
Appendix
Scripts:
normalize.py. Correct for inverted SNPs, Minor Allele Frequency
differences and incompatible SNPs between reference and study panel



http://www.bbmriwiki.nl/svn/Imputation/PaperMethodology/scripts/normali
ze.py
BEAGLE’s Allelic R2:
http://www.bbmriwiki.nl/svn/Imputation/PaperMethodology/scripts/beagle_r
2.py
Real Allelic R2:
http://www.bbmriwiki.nl/svn/Imputation/PaperMethodology/scripts/real_alle
lic_r2.py
Download