Analysis of imputed rare variants Andrew Morris Advanced Topics in GWAS Toronto, 30 May 2012 Introduction • GWAS have been successful in detecting novel loci for complex traits: • typically characterised by common variants of modest effect; • together explain relatively little of the heritability. • Low-frequency and rare variation may contribute to the “missing” genetic component of complex traits: • IFIH1 and type 1 diabetes; • MYH6 and sick sinus syndrome. Rare variants and complex disease • Rare variants are likely to have arisen from founder effects in the last few generations. • Rare variants are expected to have larger effects on complex traits that common variants. • Statistical methods focus on the accumulation of minor alleles at rare variants (mutational load) within the same functional unit. GRANVIL • Test of association of phenotype with proportion of rare variants at which individuals carry minor alleles. 1 0 0 0 0 1 0 0 0 1 pi = 3/10 • Model disease phenotype via regression on pi and any other covariates in GLM framework. Reedik Magi http://www.well.ox.ac.uk/GRANVIL/ Assaying rare genetic variation • Gold-standard approach to assaying rare genetic variation is through re-sequencing, which is expensive on the scale of the whole genome. • GWAS genotyping arrays are inexpensive, but are not designed to capture rare genetic variation. • Increasing availability of large-scale reference panels of whole-genome re-sequencing data: 1000 Genomes Project and the UK10K Project. • Impute into GWAS scaffolds up to these reference panels to recover genotypes at rare variants at no additional cost, other than computing. GRANVIL: imputed variants • Test of association of phenotype with proportion of rare variants at which individuals carry minor alleles. 0.9 0.1 0.2 0.1 0.1 0.8 0.1 0.1 0.1 0.6 pi = 3.0/10 • Replace direct genotypes with posterior probability of heterozygous or rare homozygous call from imputation. • Model disease phenotype via regression on pi and any other covariates in GLM framework. Study question • Can we make use of imputation into GWAS scaffolds up to re-sequencing reference panels to detect rare variant associations with complex traits? • Simulation study performed to compare power to detect association using GRANVIL for four alternative strategies for assaying rare genetic variation. Design and analysis strategies PHASED REFERENCE PANEL ANALYSIS COHORT Strategy 1. Re-sequence analysis cohort PHASED REFERENCE PANEL ANALYSIS COHORT Strategy 2. Genotype analysis cohort for variants in reference panel PHASED REFERENCE PANEL ANALYSIS COHORT Variants not present in reference panel will be missed Strategy 3. Genotype analysis cohort with GWAS chip PHASED REFERENCE PANEL ANALYSIS COHORT Strategy 4. Genotype analysis cohort with GWAS chip and impute variants on reference panel PHASED REFERENCE PANEL ANALYSIS COHORT Recovery of rare variants on reference panel by imputation Simulation study • Simulate 1050kb region of genome containing 50kb gene in phased reference panel (120, 500 or 4000 individuals) and analysis cohort (2000 individuals). • Select causal variants within gene subject to maximum MAF and total MAF. • Simulate quantitative trait for analysis cohort given causal variants and contribution to overall trait variance. Simulation study • Apply each strategy and test for association of rare variants (MAF<1% in analysis cohort) with quantitative trait using GRANVIL. • Strategies 3 and 4: GWAS Illumina 660K chip. • Strategy 4: Imputation performed using IMPUTEv2 allowing a “buffer” of 500kb, with low quality imputed variants (info score < 0.4) excluded from analysis. • Assess power to detect association at a nominal 5% significance threshold. Maximum MAF of causal variant: 1% Total MAF of causal variants: 5% Power at nominal 5% significance threshold, assuming 5% contribution to trait variance. Maximum MAF of causal variant: 0.5% Total MAF of causal variants: 2% Power at nominal 5% significance threshold, assuming 5% contribution to trait variance. Comments • We can recover up to 80% of the power to detect rare variant associations attained through re-sequencing by imputation into GWAS data. • Essential to include a “buffer” for imputation. • As the MAF of causal variants decreases, larger reference panels offer greater power. • Limiting assumptions of simulation study: • No re-sequencing or phasing errors in the reference panel, and no miscalled or missing genotypes in the analysis cohort. • Reference panel ascertained from same population as analysis cohort. Application to WTCCC • GWAS of seven complex human diseases from the UK (2000 cases each and 3000 shared controls from 1958 British Birth Cohort and National Blood Service): • bipolar disease (BD), coronary artery disease (CAD), Crohn’s disease (CD), hypertension (HT), rheumatoid arthritis (RA), type 1 diabetes (T1D) and type 2 diabetes (T2D). • Individuals genotyped using the Affymetrix GeneChip 500K Mapping Array Set. Quality control • Samples excluded on the basis of mismatch with external data, low call rate, outlying heterozygosity, duplication, relatedness, and nonEuropean ancestry. • SNPs excluded on the basis of: • call rate <95% (<99% if MAF <5%); • extreme deviation from HWE (exact p<5.7x10-7); • MAF <1%. Cohort Samples passing QC Controls 2,938 BD 1,868 CAD 1,926 CD 1,748 HT 1,952 RA 1,860 T1D 1,963 T2D 1,924 A total of 16,179 samples and 391,060 high-quality autosomal SNPs carried forward for analysis Fine-scale UK population structure • Fine-scale population structure may have greater impact on rare variants than on common SNPs because of recent founder effects. • Utilised EIGENSTRAT to construct principal components to represent axes of genetic variation across the UK: 27,770 high-quality LD pruned (r2<0.2) common autosomal SNPs (MAF>5%). Fine-scale UK population structure Imputation • SNPs mapped to NCBI build 37 of human genome. • Samples imputed up to 1000 Genomes Phase 1 cosmopolitan reference panel (June 2011 interim release). • 8.23M imputed autosomal rare variants (MAF<1%) polymorphic in WTCCC. • 5.38M (65.3%) were “well-imputed” (i.e. Info score > 0.4) and carried forward for analysis. • Mean info score was 0.618, and 17.3% had info score > 0.8. Rare variant analysis • Test for association of each disease with accumulation of rare variants (MAF<1%) within genes using GRANVIL. • Gene boundaries defined from UCSC human genome database (build 37). • Analyses adjusted for three principal components to adjust for fine-scale UK population structure. • Genome-wide significance threshold p<1.7x10-6: Bonferroni adjustment for 30,000 genes. No evidence of residual population structure Rare variant association with CAD • Genome-wide significant evidence of association of CAD with rare variants in the gene PRDM10 (p=4.9x10-8). • Gene contains 122 well imputed rare variants with mean MAF of 0.23%. • Accumulations of minor alleles across these variants were associated with decreased risk of disease: odds ratio 0.828 (0.774-0.886) per minor allele. Rare variant association with T1D • Genome-wide significant evidence of association of T1D with rare variants in multiple genes from the MHC. • Strongest signal of association observed for HLA-DRA (p=2.0x10-13). • Gene contains 23 well imputed rare variants with mean MAF of 0.32%. • Accumulations of minor alleles across these variants were associated with decreased risk of disease: odds ratio 0.556 (0.476-0.650) per minor allele. T1D association across the MHC • Ten genes achieve genome-wide significant evidence of rare variant association with T1D. HLA-DRA PBMUCL2 NCR3 SLC44A4 HLA-DRB5 PBX2 TNXA EHMT2 AGPAT1 C6orf10 T1D association across the MHC • After additional adjustment for additive effect of lead GWAS common variant from the MHC (rs9268645). PBX2 SLC44A4 PBMUCL2 EHMT2 HLA-DRA SKIVL2 TNXB AGPAT1 HLA-DRB5 HLA-DMA T1D association across the MHC Comments • GRANVIL assumes the same direction of effect on the trait of all rare variants within the functional unit. • Methodology allowing for different directions of effect of rare variants are well established for resequencing data, and are being generalised to allow for imputation. • The most powerful rare variant test will depend on the underlying genetic architecture of the trait. Summary • Simulations suggest that we can recover up to 80% of the power to detect rare variant associations attained through re-sequencing by imputation into GWAS data. • Requires no additional cost, other than computation, which is not trivial! • Imputation up to 1000 Genomes reference panel into GWAS data from WTCCC highlighted: • novel association of rare genetic variation in PRDM10 with CAD; • complex genetic architecture underlying T1D association across the MHC involving multiple genes. Lab practical • Use GRANVIL to test for association of T1D with imputed rare variants within genes across the MHC, using data from the WTCCC. • Investigate the impact on results of: • the MAF threshold for inclusion of rare variants in the analysis; • filtering rare variants on the basis of annotation; • gene boundary definition.