Genotype Phasing and Imputation in 1x Sequencing Data Warren W. Kretzschmar DPhil Genomic Medicine and Statistics Wellcome Trust Centre for Human Genetics, Oxford, UK Supervisor: Jonathan Marchini Major Depression • Commonest psychiatric disorder and the second ranking cause of morbidity worldwide. • Affects 1 in 10 people in their lifetime. • Estimates of heritability range between 30-40%. Top Ten causes of DALYs DALY : Disability adjusted life year : number of years lost due to ill-health, disability or early death Lower respiratory infections Other unintentional injuries Cerebrovascular disease Chronic obstructive pulmonary disease Major depressive disorders Violence Ischaemic heart disease Diabetes mellitus Road traffic accidents Alcohol use disorders Genetics of Major Depression Major Depressive Disorder Working Group of the Psychiatric GWAS Consortium (2012). A mega-analysis of genome-wide association studies for major depressive disorder. Molecular Psychiatry 18.4:497-511. Study Design • Unrelated Europeans • 9240 cases • 9519 controls • 1.2 million SNPs Hypotheses • Depression has heterogeneous environmental and genetic causes • Depression is a complex trait with genetic components of small effect size CONVERGE (China, Oxford and VCU Experimental Research on Genetic Epidemiology) 59 hospitals, 45 cities, 21 provinces. Genetically Homogeneous : All subjects are female and their grandparents are Han Chinese 6,000 cases : typically severe affected: 85% qualify for a diagnosis of melancholia by DSMIV. >25% reported a family history of MD in one or more first-degree relatives 6,000 controls : patients undergoing minor surgical procedures. Extensive Phenotyping : primary disorder of major depression, common comorbid disorders (e.g. generalized anxiety disorder, panic disorder), within disorder symptoms (e.g. suicidal ideation), disorder subtypes (e.g. melancholia, dysthymia), possible endophenotypes (e.g. neuroticism) and a range of risk factors (e.g. child abuse, stressful life events, social and marital relationships, parenting, post-natal depression, demographics). Sequencing : mean depth 1.7X using lllumina HiSeq at Beijing Genomics Institute Current status Sequencing finished. We have data on 12,000 samples. For now we have only considered ~13M sites polymorphic 1000 Genomes Asian samples. Analysis ongoing… Sequence analysis pipeline Phase 1: genotype likelihood estimation One sample at a time Genotype likelihoods Raw reads Mapping Stampy Phase 2: phasing and imputation All samples together My focus! 48 TB 350 GB Duplicate marking Picard Base quality GATK recalibration Genotype likelihood estimation Phasing and 2.7 CPU imputation years 5 CPU years 650 GB SNPTools 4.6 CPU years Genotype likelihoods Genotype probabilities GENOTYPE PHASING AND IMPUTATION Genotype Phasing Example SNP chip data Unphased: G/G A/T A/A T/T G/T A/T T/T A/A G/G G/C After Phasing Hap 1: G A A T T T T A G C Hap 2: G T A T G A T A G G Phase-informative Sites Genotype Imputation from Haplotypes J Marchini and B Howie. Nature Rev. Genet. 2010 GENOTYPE LIKELIHOODS What is a Genotype Likelihood? Genotype likelihoods (aka GL) are defined on a site by site basis. GLs are conditional probabilities. Genotype Likelihood = Pr( R | G ) R = Reads; also known as the “observed data” G = Genotype; usually one of ref/ref, ref/alt, alt/alt How are Genotype Likelihoods Useful? Genotype likelihoods allow us to quantify how much the reads support each possible genotype independent of other information. To determine the most likely genotype call, we need a genotype probability. Genotype Probability = Pr ( G | R ) proportional to Pr( R | G ) * Pr( G ) Pr( G ) = prior probability of G. May be determined through haplotype phasing and imputation approaches. Genotype Likelihood Creation with SNPTools observed reads Pr(R|G = ref/ref) = 0.06 Pr(R|G = alt/alt) = 10e-6 Pr(R|G = ref/alt) = 10e-3 Y Wang, J Lu, J Yu, RA Gibbs, FL Yu. Genome Research. 2013 Genotype Phasing using Genotype Likelihoods Reference Haplotypes Hap 1: G A A T T A C A G G Hap 2: G T A T T A T A G G Hap 3: G T A T G A C A G G Hap 4: G T A T G A T A G C Example GL data Pr(ref/ref): Pr(ref/alt): Pr(alt/alt): Plausible Haplotypes after Phasing Hap 5: G A A T T A T A G C Hap 6: G T A T T A T A G G General MCMC Scheme for Phasing from GLs When using GLs, haplotype estimation is currently done in an iterative Markov Chain Monte Carlo (MCMC) scheme 1. Initalize haplotypes for each sample randomly 2. for a predetermined number of iterations 1. for each sample 1. Find a plausible haplotype pair using its GLs and all other haplotypes as a reference panel 2. Update that sample’s haplotypes with the plausible haplotype pair 3. Return each sample’s current pair of haplotypes The Tools/Languages I use Coding Emacs Scripting Perl with DistributedMake for pipelines Statistical Methods C++ Figure Generation R Statistical Analysis & Report Writing LaTeX with SWeave Presentations PowerPoint or LaTeX A Bioinformatician’s Best Practices according to Nick Loman & Mick Watson. Nature Biotechnology. 2013 see also: W. S. Noble. PLoS Computational Biology. 2009 - Understand your goals and choose appropriate methods - Be suspicious and trust nobody - Set traps for your own scripts and other people’s - Be a detective - You're a scientist, not a programmer - Use version control software - Pipelineitis is a nasty disease - An Obama frame of mind - Someone has already done this. Find them! Good Directory Structure according to W. S. Noble. PLoS Computational Biology. 2009 Thank you. Questions?