Bioinformatic things to come to a computer near you in

advertisement
DTL Focus meeting:
Using GRCh38 in NGS data analysis
Time slot
Speaker
Subject
12:45-13:00
Coffee/tea
Coffee/tea
13:00-13:20
Ies Nijman (UMCU)
Welcome & Introduction to
GRCh38 (hg20)
13:20-13:40
Pieter Neerinx (UMCG)
Migration of tools, pipelines to
support GRCh38
13:40-14:00
Pjotr Prins
BWA handling of ALTcontigs
14:00-14:10
Tea break
Tea break
Zuotian Tatum (LUMC)
New insights on Differential
Gene Expression using
GRCh38
14:30-14:50
Wibowo Arindrarto (LUMC)
Comparison of hg19 and
GRCh38 in the study of DUX4
gene
14:50-15:30
Ies Nijman (UMCU)
Wrap-up and open discussions
14:10-14:30
GRCh38 / hg20
Human genome build hg20
•
Basic new assembly released dec 24th 2013, now GRCh38.p2 (dec 8th, 2014)
• 5-7 megabases of added sequence to primary reference
• Many corrected regions (patches) to hg19
• 261 alternative loci: chromosomal regions with high variability (~66 MB)
• 128 large unplaced sequence regios
• Human_herpes_virus (EBV) mapping decoy (171 kb)
• Centromere sequences: gaps are replaced by sequence models of the
centromer repeats
• New mitochondrial sequence: Revised Cambridge Reference Sequence (rCRS)
from MITOMAP
• 4 PAR regions
•
This means that coordinates change! Lift-over strategies will not completely solve it.
Human genome build hg20
Human genome build hg20
• New genebuild now available (20.364 coding genes; 2.101 in alternative
loci)
• Only few calling/annotation tools support hg20 yet (VEP fi)
• Ensembl default genome is hg20!! Latest hg19 site is beeing maintained
through archive link.
• dbSNP locations available for hg20
• 1000G data will be remapped and recalled (est Q1,/Q2 2015)
Human genome build hg20
-Challenges and opportunities-
• How to use these alternative loci? In hg19 only few were present and
mostly blissfully ignored..
• Challenge I: mapping strategy and tools needs to be changed
• In prep: iBWA, srprism
• BWA 0.7.12 (29 dec 2014) supports ALTs in a two-step approach
• Challenge II: variant callers need to be aware of alternative
references (and context)
• Challenge III: how to display this data in genome browsers etc, while
maintaining context?
• Challenge IV: nomenclature
• The primary assembly contains all patches and fixes to hg19 and is still a
good starting point.
What are these ALT loci?
• Scaffolds that provide an alternate representation of a locus found in the
primary reference.
• long regions with clustered variations (ie LRC/KIR chr19 and MHC on
chr6.HLA loci)
• Next to different haplo-variants of genes, contain also genes not in the
primary assembly (20 prot.coding, ~40 predicted prot.cod., pseudogenes,
lincs)
• Mind: ALTernative approaches between NCBI and ensembl: NCBI uses
primary chromosomes and ALT loci while ensembl build a completely
new ALT chromosome (so incl identical sequence)
Usage scenarios
• I: use primary reference (toplevel chrs)
• II: use primary reference + mapping decoys (Un + EBV)
• Improves mapping accuracy
• Only feed primary reference to variantcaller
• III: use primary reference + ALT loci + mapping decoys (Un + EBV)
• Improves mapping accuracy (?)
• A:Only feed primary reference to variantcaller
• B: Run variantcaller on all loci…
Adding the mapping decoys
Grch38_full_plus_analysisset
Class
Total bp
Primary
3.088.286.401
Unlocalized
6.978.808
Unplaced
4.485.509
ALT
109.535.387
decoy
5.964.345
Total
3.215.250.450
graphs based on 11 Xten WGS samples
Grch38_full_analysisset
Total bp
3.088.286.401
6.978.808
4.485.509
109.535.387
171.823
3.209.457.928
GRCh37.p13 Improved alignments outside of fix patch regions
Jason Harris
Regions outside of fix patches
hs37d5
GRCh37.p13
10
hs37d5
GRCh37.p13
Personalis, Inc. | Confidential and Proprietary
Heng Li: BWA approach to ALT mapping
• ALTs supported in >v0.7.11 through additional ID-list file $ref.alt
• Advised to use NCBI ngs-analyses sets (3 flavors) with slightly modified
sequences to facilitate mapping (hardmasked PAR and centromeric
regions)
1. The original mapQ of a non-ALT hit is computed across non-ALT hits only.
The reported mapQ of an ALT hit is always computed across all hits.
2. An ALT hit is only reported if its score is better than all overlapping nonALT hits. A reported ALT hit is flagged with 0x800 (supplementary) unless
there are no non-ALT hits.
3. The mapQ of a non-ALT hit is reduced to zero if its score is less than 80%
(controlled by option -g) of the score of an overlapping ALT hit. In this
case, the original mapQ is moved to the om tag.
Heng Li: BWA approach
Variantcalling on ALTs?
Variant calling on ALTs?
Variant calling on ALTs?
• By adding the ALT loci in mapping and calling we gain better haplo
aware mappings/calls, but it is not clearly reflected in the vcf
• Adding ‘ haplotyping’ to the VCF format
A. Quinlan, Virginia, GRC WS 2014
Variant Annotation on HG20 / ALTs
• Ensembl VEP
• snpEFF
• dbNSFP in next release
(~may)
Nomenclature
chr19_KI270938v1_alt
CHR_HSCHR19KIR_G248_BA2_HAP_CTG3_1
hg38 / GRCh38 not hg20 please…
GenBank: KI270886.1
RefSeq: NT_187640.1
17
Personalis, Inc. | Confidential and Proprietary
Everything is in a state of flux, including
the status quo.
-Robert Byrne-
• Even after 1.5 years after the release many things are uncertain about
the use of the full build.
• GATK is remarkably silent
• Ewan Birney and Richard Durbin agreed march24th to rebuild a new
reference/analysis set with more standardized set of chr, ALTs and
decoys (pers. Comm).
• Henk Li: “ The current BWA-MEM method is just a start. []We may make
changes. It is also possible that we might make breakthrough on the
representation of multiple genomes, in which case, we can even get rid
of ALT contigs for good.”
Download