Analysis of genome-wide association studies Lecture 1: Introduction Linkage studies • Traditional approach to identifying genes for human traits and diseases was through linkage. • For Mendelian diseases (e.g. Huntington’s disease) there is a clear co-segregation of genetic markers with disease within pedigrees. • For complex traits (e.g. type 2 diabetes), linkage analysis has been less successful because the relationship between phenotype and genotype is less clear. • Non-genetic risk factors influence the outcome; • Many genes have an impact on the trait, each having only a small effect on the outcome. Association studies • Common disease, common variant hypothesis: complex traits will be determined by variants that occur frequently in the population, but each have only a small impact. • Ascertain sample of affected cases and unaffected controls (or a random sample for a quantitative trait) from the population. • Compare allele frequencies in cases and controls (or mean trait values between alleles for a quantitative traits). • With sufficient sample size, powerful approach to identify loci contributing to complex traits. Common variation in the genome • Impractical to genotype all common variants in the genome in large samples. • International HapMap project genotyped more than three million SNPs in samples from multiple ancestry groups. • Common variation is arranged on relatively few haplotypes that occur within blocks of strong linkage disequilibrium between recombination hotspots... Common variation in the genome BLOCK 1 SNPs in strong LD Low diversity of haplotypes BLOCK 3 HOTSPOT Low diversity of haplotypes HOTSPOT SNPs in strong LD BLOCK 2 SNPs in strong LD Low diversity of haplotypes • SNPs located within the same block will be in strong LD with each other. However, a pair of SNPs located in adjacent blocks will be uncorrelated due to the high levels of recombination at the flanking hotspot. • The strong LD between SNPs in the same block means that we tend to observe fewer haplotypes than we might expect by chance. In fact, much of the diversity is accounted for by a small number of common haplotypes. Common variation in the genome • Jeffreys et al. (2001) looked for evidence of recombination in the sperm of six men across ~200kb region of the MHC. • Crossovers cluster in six hotspots: recombination extremely rare elsewhere in the region. • High-resolution LD analysis in sample of 50 unrelated individuals identified blocks with boundaries corresponding to these recombination hotspots. (Nature Genetics 29: 217-222) 3- Common variation in the genome • Dawson et al. (2002) genotyped 90 unrelated UK individuals at over 1500 SNPs across chromosome 22. • Defined haplotype blocks as regions of limited haplotype diversity. (Thanks to Lon Cardon) Common variation in the genome Genotyping this subset of SNPs accounts for all common genetic variation in the block Measures of LD • Linkage disequilibrium between two marker allele M and disease variant X quantified by measure: D P MX P M P X • Adjustment for allele frequencies: D D D MAX r 2 D 2 P M P m P X P x • May not have typed disease variant: but understanding patterns of LD between markers will help interpretation of results of association studies. Genome-wide association arrays • Possible to “cover” much of the common genetic variation in the human genome by genotyping only a subset of SNPs. • Most efficient approach is to select “tags” that cover common SNPs (at some r2 threshold). 11 • Samples genotyped with Affymetrix GeneChip 500K Mapping Array Set. • Identified novel loci and replicated association signals for five diseases. • Established standard protocols for quality control and analysis of GWAS. • Publicly available control cohort. Published Genome-Wide Associations through 07/2012 Published GWA at p≤5X10-8 for 18 trait categories NHGRI GWA Catalog www.genome.gov/GWAStudies www.ebi.ac.uk/fgpt/gwas/ Design issues • Is the trait heritable?! • Phenotype definition, case-control selection, and availability of non-genetic risk factors. • Choice of genotyping product. • Sample size requirements. • The GENETIC POWER CALCULATOR (http://statgen.iop.kcl.ac.uk/gpc) can be used to calculate power of case-control studies. QUALITY CONTROL Introduction • Poor study design and errors in genotype calling can introduce systematic bias in association studies. • Increase in false positive error rate and decrease in power. • Assess data quality to remove sub-standard genotypes, samples and SNPs from subsequent association analysis. Genotype calling • For large-scale GWA studies, automated genotype calling algorithms have been developed, e.g. GENCALL and GENOSNP. • Estimate probability that any specific genotype is AA, AB or BB. • Apply threshold to probabilities in order to call genotype, otherwise treated as missing. • Choice of calling threshold will impact results: • Too low: include poor quality genotypes. • Too high: unnecessarily remove high quality genotypes, or may introduce bias by preferentially calling specific genotypes (e.g. rare homozygotes). Genotype calling Genotype calling • For large-scale GWA studies, automated genotype calling algorithms have been developed, e.g. GENCALL and GENOSNP. • Estimate probability that any specific genotype is AA, AB or BB. • Apply threshold to probabilities in order to call genotype, otherwise treated as missing. • Choice of calling threshold will impact results: • Too low: include poor quality genotypes. • Too high: unnecessarily remove high quality genotypes, or may introduce bias by preferentially calling specific genotypes (e.g. rare homozygotes). Sample quality control • Remove samples on the basis of: • Low call rate (poor DNA quality). • Outlying heterozygosity across autosomes (DNA sample contamination or inbreeding). • Duplication or relatedness based on identity-bystate (samples should be independent). • Mismatches with external information (sample mix-up). • Outlying population ancestry (confounding due to population structure). Call rate and heterozygosity 23-30% heterozygosity 3% missing Identity-by-state (IBS) • Over M markers, the IBS between the ith and jth individuals is given by IBS ij 1 1 2M G ik G jk k where Gik denotes the number of minor alleles (0, 1 or 2) carried by the ith individual at SNP k. • Identical samples will share IBS near to 100% (allowing for genotyping errors). • Related individuals will share higher IBS than unrelated individuals. • Common to plot histogram of IBS of each individual with “nearest neighbour”. IBS distribution Remove one sample from each duplicate or related pair (usually one with lowest call rate). Duplicates Relateds X chromosome • Distribution of heterozygosity different in males and females. • Should be no heterozygosity in males, but expect some genotyping error. • Discrepancies with external gender information may reflect: • Errors in external data; • Sample mix-up; • Gender inconsistent with sex chromosomes. X chromosome Each individual plotted twice according to reported gender: females in red and males in blue. Should these samples be removed from the study or the sex corrected based on heterozygosity? May impact on results if sex is adjusted for in the analysis or if sex specific analyses are to be undertaken. SNP quality control • Remove SNPs on the basis of: • Low call rate, variable by MAF (poor quality SNP). • Extreme deviation from Hardy-Weinberg equilibrium in cases, controls or both for autosomes (genotyping error). • Extreme differential call rates between cases and controls (calling bias). • Study specific SNP QC filters (such as differences in allele frequencies between multiple control cohorts). • Low frequency SNPs (more prone to bias due to genotyping error and low power to detect association). • Visual inspection of cluster plots. Effect of differential call rate Individuals called as missing? Fewer heterozygotes among cases. Cluster plot inspection: good SNP Cluster plot inspection: bad SNP Summary • QC criteria are subjective and vary from one study to another. • Sample QC filters should not be so stringent as to remove the majority of the analysis cohort! • SNP QC filters should eliminate the worst quality markers without “throwing the baby out with the bathwater”. • All SNPs demonstrating evidence for association should be followed up with visual inspection of cluster plots. BASIC ANALYSIS OF GENOME-WIDE ASSOCIATION STUDIES Introduction • Association analyses focus on the identification of SNPs that differ in allele (genotype) frequency between cases and controls. • Basic analyses utilise standard epidemiological tools, rather than specialised methods that have been developed for analysing more traditional pedigree and family studies: • contingency table analysis; • logistic regression modelling. Genotype-based test • • Assuming the sample to be typed at a SNP marker of interest, we can represent genotype data in a 2 x 3 contingency table. 2 The usual test for independence of rows and columns in contingency tables can be applied to test the null hypothesis of no disease-marker association X 2 i 0 ,1 , 2 j A ,U where • E n ij n E n ij Controls Total MM n2A n2U n2· Mm n1A n1U n1· mm n0A n0U n0· Total n·A n·U n·· 2 ij E n ij n i n j n X2 has distribution with 2 degrees of freedom under null hypothesis. 2 Cases • Odds ratio for genotype MM relative to mm MM |mm n 2 A n 0 U n 0 A n 2 U • Affected individual MM |mm times more likely to have marker genotype MM than mm. Cochran-Armitage trend test • Assume a multiplicative model of disease risks: 2 MM |mm Mm |mm • The Cochran-Armitage trend test of association between disease and the marker SNP is given by X 2 1 1 p 2 A 2 p 1 A p 2 U 2 p 1U 1 1 n .U n.A 1 2 n .. p ij n ij n j X2 has distribution with 1 degree of freedom under null hypothesis. 2 Controls Total MM n2A n2U n2· Mm n1A n1U n1· mm n0A n0U n0· Total n·A n·U n·· 2 2 1 1 n .. n 1 . n 2 . n 1 . n 2 . 2 4 where • Cases • Odds ratio for allele M relative to allele m M |m n 1A n 0 U n 0 . n1 . n 2 A n 1U 4 n 2 A n 0 U n1. n 2 . n 0 . n 2 . 1 n 0 A n 1U n 1 A n 2 U 4 n 2 A n 2 U n 0 A n 0 U 2 n 0. n 2. n 0 . n1 . n1 . n 2 . 2 M |m • Affected individual times more likely to have marker genotype MM than mm, and M |m times more likely to have genotype Mm than mm. Interpretation A significant result in a test of disease-marker association may imply: • Marker locus is causative, directly influencing disease risk: needs to be established via functional studies. • Alleles at marker locus are correlated with alleles at the disease locus, but do not directly influence disease risk: linkage disequilibrium. • Population substructure not accounted for in the analysis, with different disease and marker allele frequencies in each subpopulation. • False positive signal of association. M Disease M D Disease M Pop’n Disease Genome-wide significance • A type I error occurs when we reject the null hypothesis of no association, when in fact the null hypothesis is true. • Specify type I error rate – or significance level – at the design stage of the analysis. • Lower type I error rate reduces the probability of detecting a false positive association, but with the penalty of reducing the power to detect association when it truly exists. • It is important to correct for multiple testing to maintain the type I error rate for the experiment overall (i.e. all the SNPs tested in the association study). • Genome-wide significance threshold: p<5x10-8. • Replication is necessary to confirm association. Logistic regression modelling • We model the log odds of disease for the ith individual as pi log 1 pi βT x i where pi is the probability that the ith individual is affected by disease, xi is a vector of measures of the risk factors, and β is a vector of the corresponding risk effects. • Over all individuals, we can obtain estimates of the risk effects by maximising the log-likelihood l y x ,β y i i log p i 1 y i log 1 p i where yi denotes the phenotype of the ith individual (0 for control and 1 for case). • Extremely flexible modelling framework: can test for joint effects of risk factors (multi-locus analysis) and allow for covariates (adjustment for nongenetic effects). Additive test of association • Assuming multiplicative model of disease risk (additive on the log scale), we can model the log-odds of disease for the ith individual by pi log 1 pi 0 M x Mi where βM denotes the log-odds ratio of allele M relative to allele m, and xMi is an indicator variable that counts the number of M alleles (0, 1 or 2) carried by the ith individual. • Test for association by maximising the likelihood of the model, and comparing with the maximised likelihood under the null hypothesis of no association (i.e. βM = 0): X 2 2 l y x M , 0 , M l y x M , 0 , M 0 • X2 has χ2 distribution with one degree of freedom under the null hypothesis. • Asymptotically equivalent to Cochran-Armitage trend test. Genotypic test of association • To allow for deviations from the multiplicative disease model , we can model the log-odds of disease for the ith individual by pi log 1 pi 0 Mm x Mmi MM x MMi where βMm and βMM denote the log-odds ratios of genotypes Mm and MM relative to genotype mm, and xMmi and xMMi are variables indicating that the ith individual carries genotype Mm and MM, respectively. • Test for association by maximising the likelihood of the model, and comparing with the maximised likelihood under the null hypothesis of no association (i.e. βMm = βMM = 0). X 2 2 l y x Mm , x MM , 0 , Mm , MM l y x Mm , x MM , 0 , Mm MM 0 • X2 has χ2 distribution with two degrees of freedom under the null hypothesis. General disease models • The same modelling framework can be used to test for association under alternative disease models, such as heterozygote advantage, recessive and dominant. • Testing many different models of association requires correction for multiple testing. • We can compare the likelihoods of the genotypic and trend models of association to test for a deviation from the multiplicative model of disease risks. • Trend test is generally most powerful unless there is extreme deviation from a multiplicative disease model. • Testing for association at a marker SNP in LD with the causal SNP weakens the effect of the underlying disease model. Genotype M recessive M dominant Heterozygote advantage MM 1 1 0 Mm 0 1 1 mm 0 0 0 Allowing for covariates • • • • • For complex diseases, we may wish to take account of non-genetic risk factors, such as exposure to specific environments, or the effects of established genetic loci. Assuming multiplicative model of disease risk, we can model the log-odds of disease for the ith individual by pi 0 C x Ci M x Mi log 1 p i where βC denotes the effect of a covariate xC, and xCi denotes the value of the covariate for the ith individual. Obtain χ2 test of association with one degree of freedom by maximising the likelihood of the model, and comparing with the maximised likelihood under the null hypothesis of no association (i.e. βM = 0). Estimated log-odds ratio of allele M adjusted for the effect of the covariate. Can be generalised to allow for any number of covariates and general models of disease risk. Quantitative traits • The methodology described here generalises to quantitative (continuous) traits. It is straightforward to compare the mean response for each marker genotype by analysis of variance, assuming a normally distributed trait, within the standard linear regression framework. • A powerful strategy is to ascertain individuals from the extremes of the quantitative trait distribution: cases and hyper controls. • We can analyse trait values by linear regression, although this leads to biased estimates of mean trait values for marker genotypes. • We can ignore the trait values, and analyse as a standard case-control sample. • Are hyper controls representative, or are there polygenic effects involved? • This strategy may not be cost effective if phenotyping is expensive relative to genotyping. Software • Contingency table analysis and generalised linear modelling can be performed using standard statistical software. • Define indicator variables for specific genetic models from the observed SNP genotype data. • Some statistical software packages include specific libraries of routines to perform genetic analyses (R, STATA) • Specialised genetic analysis software: • PLINK. Whole genome association analysis toolset designed to perform a range of basic, large-scale analyses. Allows for data management and basic QC analyses. Performs simple case-control tests of association. • SNPTEST. Designed for analysis of whole genome association studies. Allows for flexible single-locus analysis of genotype data allowing for covariates. Replication • To confirm positive association signals from an initial study, it is essential to replicate the result in independent samples from the same and/or different populations. • Replication of positive association signals has not proved to be easy: will depend on power of both initial and replication studies. • Multi-stage designs: genotype a proportion of samples with GWAS array and follow-up the strongest signals of association in the remaining samples through de novo genotyping. • Collaboration between international groups studying the same trait allows for in silico replication. Meta-analysis • We can increase power to detect rarer variants of more modest effect by collecting larger and larger samples. • Alternatively, we can combine the results of GWA studies of the same trait through meta-analysis, without direct exchange of genotype data. • Exchange summary statistics for each SNP including “risk” allele, p-value, odds ratio (effect) and 95% confidence interval (standard error). • GWA studies can be combined through meta-analysis even if genotyped directly for different sets of SNPs through imputation. • Software: GWAMA and METAL. Summary • Standard statistical procedures available for the analysis of genotype data from genetic association studies. • Logistic regression provides a flexible framework for modelling SNP association with disease. • Signals of association should be validated through replication and/or meta-analysis.