DTL Focus meeting: Using GRCh38 in NGS data analysis Time slot Speaker Subject 12:45-13:00 Coffee/tea Coffee/tea 13:00-13:20 Ies Nijman (UMCU) Welcome & Introduction to GRCh38 (hg20) 13:20-13:40 Pieter Neerinx (UMCG) Migration of tools, pipelines to support GRCh38 13:40-14:00 Pjotr Prins BWA handling of ALTcontigs 14:00-14:10 Tea break Tea break Zuotian Tatum (LUMC) New insights on Differential Gene Expression using GRCh38 14:30-14:50 Wibowo Arindrarto (LUMC) Comparison of hg19 and GRCh38 in the study of DUX4 gene 14:50-15:30 Ies Nijman (UMCU) Wrap-up and open discussions 14:10-14:30 GRCh38 / hg20 Human genome build hg20 • Basic new assembly released dec 24th 2013, now GRCh38.p2 (dec 8th, 2014) • 5-7 megabases of added sequence to primary reference • Many corrected regions (patches) to hg19 • 261 alternative loci: chromosomal regions with high variability (~66 MB) • 128 large unplaced sequence regios • Human_herpes_virus (EBV) mapping decoy (171 kb) • Centromere sequences: gaps are replaced by sequence models of the centromer repeats • New mitochondrial sequence: Revised Cambridge Reference Sequence (rCRS) from MITOMAP • 4 PAR regions • This means that coordinates change! Lift-over strategies will not completely solve it. Human genome build hg20 Human genome build hg20 • New genebuild now available (20.364 coding genes; 2.101 in alternative loci) • Only few calling/annotation tools support hg20 yet (VEP fi) • Ensembl default genome is hg20!! Latest hg19 site is beeing maintained through archive link. • dbSNP locations available for hg20 • 1000G data will be remapped and recalled (est Q1,/Q2 2015) Human genome build hg20 -Challenges and opportunities- • How to use these alternative loci? In hg19 only few were present and mostly blissfully ignored.. • Challenge I: mapping strategy and tools needs to be changed • In prep: iBWA, srprism • BWA 0.7.12 (29 dec 2014) supports ALTs in a two-step approach • Challenge II: variant callers need to be aware of alternative references (and context) • Challenge III: how to display this data in genome browsers etc, while maintaining context? • Challenge IV: nomenclature • The primary assembly contains all patches and fixes to hg19 and is still a good starting point. What are these ALT loci? • Scaffolds that provide an alternate representation of a locus found in the primary reference. • long regions with clustered variations (ie LRC/KIR chr19 and MHC on chr6.HLA loci) • Next to different haplo-variants of genes, contain also genes not in the primary assembly (20 prot.coding, ~40 predicted prot.cod., pseudogenes, lincs) • Mind: ALTernative approaches between NCBI and ensembl: NCBI uses primary chromosomes and ALT loci while ensembl build a completely new ALT chromosome (so incl identical sequence) Usage scenarios • I: use primary reference (toplevel chrs) • II: use primary reference + mapping decoys (Un + EBV) • Improves mapping accuracy • Only feed primary reference to variantcaller • III: use primary reference + ALT loci + mapping decoys (Un + EBV) • Improves mapping accuracy (?) • A:Only feed primary reference to variantcaller • B: Run variantcaller on all loci… Adding the mapping decoys Grch38_full_plus_analysisset Class Total bp Primary 3.088.286.401 Unlocalized 6.978.808 Unplaced 4.485.509 ALT 109.535.387 decoy 5.964.345 Total 3.215.250.450 graphs based on 11 Xten WGS samples Grch38_full_analysisset Total bp 3.088.286.401 6.978.808 4.485.509 109.535.387 171.823 3.209.457.928 GRCh37.p13 Improved alignments outside of fix patch regions Jason Harris Regions outside of fix patches hs37d5 GRCh37.p13 10 hs37d5 GRCh37.p13 Personalis, Inc. | Confidential and Proprietary Heng Li: BWA approach to ALT mapping • ALTs supported in >v0.7.11 through additional ID-list file $ref.alt • Advised to use NCBI ngs-analyses sets (3 flavors) with slightly modified sequences to facilitate mapping (hardmasked PAR and centromeric regions) 1. The original mapQ of a non-ALT hit is computed across non-ALT hits only. The reported mapQ of an ALT hit is always computed across all hits. 2. An ALT hit is only reported if its score is better than all overlapping nonALT hits. A reported ALT hit is flagged with 0x800 (supplementary) unless there are no non-ALT hits. 3. The mapQ of a non-ALT hit is reduced to zero if its score is less than 80% (controlled by option -g) of the score of an overlapping ALT hit. In this case, the original mapQ is moved to the om tag. Heng Li: BWA approach Variantcalling on ALTs? Variant calling on ALTs? Variant calling on ALTs? • By adding the ALT loci in mapping and calling we gain better haplo aware mappings/calls, but it is not clearly reflected in the vcf • Adding ‘ haplotyping’ to the VCF format A. Quinlan, Virginia, GRC WS 2014 Variant Annotation on HG20 / ALTs • Ensembl VEP • snpEFF • dbNSFP in next release (~may) Nomenclature chr19_KI270938v1_alt CHR_HSCHR19KIR_G248_BA2_HAP_CTG3_1 hg38 / GRCh38 not hg20 please… GenBank: KI270886.1 RefSeq: NT_187640.1 17 Personalis, Inc. | Confidential and Proprietary Everything is in a state of flux, including the status quo. -Robert Byrne- • Even after 1.5 years after the release many things are uncertain about the use of the full build. • GATK is remarkably silent • Ewan Birney and Richard Durbin agreed march24th to rebuild a new reference/analysis set with more standardized set of chr, ALTs and decoys (pers. Comm). • Henk Li: “ The current BWA-MEM method is just a start. []We may make changes. It is also possible that we might make breakthrough on the representation of multiple genomes, in which case, we can even get rid of ALT contigs for good.”