Considerations for Analyzing Targeted NGS Data – HLA

advertisement
Considerations for Analyzing
Targeted NGS Data
HLA
Tim Hague, CTO
Introduction
 Human leukocyte antigen (HLA) is the
major histocompatibility complex (MHC) in
humans.
 Group of genes ('superregion') on
chromosome 6
 Essentially encodes cell-surface antigenpresenting proteins.
Functions
HLA genes have functions in:
combating infectious diseases
graft/transplant rejection
autoimmunity
cancer
Alleles
 Large number of alleles (and proteins).
 Many alleles are already known.
The number of
known alleles is
increasing
HLA Class I
Gene
A
B
C
Alleles 2013 2605 1551
Proteins 1448
1988
1119
HLA Class II
Gene
DRA DRB* DQA1 DQB1 DPA1 DPB1
Alleles
7
1260 47
176 34
155
Proteins 2
901 29
126 17
134
HLA Class II - DRB Alleles
Gene
DRB1
DRB3
DRB4
Alleles 1159
58
15
Proteins 860
46
8
DRB5
20
17
Analysis Challenges
HLA genes
have
specific
analysis
challenges regardless of the sequencing
technology.
High Polymorphism
High rate of polymorphism – up to 100 times
the average human mutation rate.
The HLA-DRB1 and HLA-B loci have the highest
sequence variation rate within the human genome.
High degree of heterozygosity – homozygotes are
the exception in this region.
Duplications
 High level of segmental duplications
 Lots of similar genes and lots of very similar
pseudegenes.
 Duplicated segments can be more similar to each other
within an individual than they are similar to the
corresponding segments of the reference genome.
Complex Genetics
 Particularly HLA-DRB*
 The DR β-chain is encoded by 4 loci, however
only no more than 3 functional loci are present
in a single individual, and only a maximum of 2
per chromosome.
Mitigating Factors
It's not all bad news:
Many HLA alleles are already well known – both in
terms of sequence and frequencies within the
population.
The HLA region is fairly small so there a high degree
of linkage disequilibrium, and therefore lots of known
haplotypes.
Traditional Typing
 SSO – low resolution, high throughput,
cheap
 SSP – very fast results, low resolution
 SBT – sequence-based typing, high
resolution, usually done by Sanger
sequencing.
NGS Typing
High resolution, an alternative to Sangerbased SBT
Why is it needed?
Sanger and HLA
 Sanger data is still the gold standard in
the genomic sequencing industry, even
though it is very expensive compared to
NGS.
 1 in 1'000 base error rate, if forward and
reverse typing are done, error rate drops
to 1 in 1'000'000.
So why is it bad for HLA?
Phase Resolution
 2x chromosome 6
 Many loci, many alleles
 Lots of heterozygosity
Allele Phasing problem
reference sequence
G
/
T
T
/
A
consensus sequence
OR???
Allele 1
Allele 2
T
A
Allele 1
Allele 2
A
T
The Problem with Sanger
 There is only one signal
 High degree of heterozygosity = high degree of
ambiguity
 Requires statistical techniques based on known
allele frequencies, plus manual intervention by
trained operators
 Ambiguity can only be resolved statistically, which
can lead to wrong assignment for rare types
HLA typing by Sanger method
GGACSGGRASACACGGAAWGTGAAGGCCCACTCACAGACTSACCGAGYGRACCTGGGGACCCTGCGCGGCTACTACAACCAGAGCGAGGMCGGT
550
500
450
400
350
300
250
200
150
100
50
0
Number of potential alleles
NGS Advantages
 Can reduce ambiguity
 Phase resolution - two signals, but lots of
short reads
 Cheaper and faster than Sanger
 Less manual intervention required
NGS Data - Unphased
NGS Data - Phased
NGS Approaches
 HLA*IMP – chip based imputation engine
 Reference-based alignment, followed by a
HLA call based on the variants detected during
alignment
 Search against database of known alleles
NGS Reference-based
 Fraught with difficulties
 Very hard to align reads to this region
 The variant/HLA call is only as good as the
alignment
 No coverage = no call
Has been attempted by Broad Institute (HLA Caller)
and Roche
Alignment Efforts
RainDance provide a targeted HLA amplification kit call
HLAseq.
Target: the whole MHC superregion (except for some
tandem repeat regions)
Goal: align this data, before doing
variant/HLA call.
Diverse variant “density” in the MHC superregion
Based on a single
sample
Default BWA alignment – No coverage at an exon of
HLA-DMB
Low coverage and orphaned reads at a HLA-DRB1 exon
BWA vs more permissive alignment:
higher coverage = higher noise
Large targeted region without usable coverage
NGS Reference-based
Not providing enough coverage everywhere
What about de novo?
De novo assembly (MIRA)
287 contigs (longest contig: 2199 bp)
Mean contig size: 268 bp
Median contig size: 209 bp
Total consensus: 77084 bp
RainDance target: ~ 3800000 bp
De novo assembly (MIRA)
NGS De Novo Alignment
Not enough contigs produced, not enough coverage of
the target region.
What about a hybrid approach?
De novo assembly with “backbone”
First, alignment to backbone, then de novo
assembly
Backbone: 2220 contigs from HG19 chr 6 (sum:
3554852 bps) → almost whole RainDance
target
Results:
Max reads / backbone contig: 197
Max coverage: 71
De novo assembly with “backbone”
NGS Typing - Alignment Based
We tried:
Burrows Wheeler aligner
More sensitive, seed and extend aligner
De novo aligner
'Hybrid' de novo aligner
The variant/HLA call is only as good as the
alignment
The alignments were not good enough
NGS Database Based
 Search against 'database' of known alleles
 Such as IMGT/HLA database, available from EBI
web site
Stanford, Connexio, JSI Medical, BC Cancer Agency
and Omixon have all tried this approach.
DB Based Approach
Advantages
Less mapping headaches
Unambiguous results
Potential to be fast
Difficulties
Novel allele detection
Homozygous alleles
Results with Exome data
Exon level detail
Detailed results - short read pileup
Conclusions
 DB based approach to HLA typing is new but very
promising
 NGS approaches can resolve much of the
ambiguity of Sanger SBT
 DB based approach can also overcome the
limitations of NGS reference-based alignment
Conclusions
Available DB based HLA typing tools differ in:
Speed
Sequencers supported
Types of sequencing data supported (targeted,
exome, whole genome)
Ease of use
Ambiguity of results
Degree of manual intervention required
Novel allele detection capabilities
Download