Considerations for Analyzing Targeted NGS Data – Exomes

Considerations for Analyzing
Targeted NGS Data
Exome
Tim Hague, CTO
Exome Analysis
 3 sets of full exome sequences for the
same individual, targeted by 3 different kits
 One set had data problems because reads
were from 2 different sequencers
 Remaining 2 sets were analyzed both by
the customer and by Omixon
Exome Targets
 Illumina TruSeq ~62 Mbp
 Nimblegen SeqCap EZ Exome ~64 Mbp
 ~35 Mbp overlap between targets
 Exons, ORFs and putative translated regions
captured
 40M and 37M read pairs resp., 101bp length
Full Analysis Pipelines
 In this case we are comparing two full NGS
analysis pipelines
 Including the mapping/alignment and a
multi-step variant call pipeline
 The Omixon pipeline for this analysis uses two variant
callers
 The Omixon pipeline also uses recalibration and indel
realignment
Finding long indels 1.
Better indel resolution 1.
Better indel resolution 2.
Indel Handling
 If indels are important to an analysis then
this needs to be taken into account, from
the planning stage onwards
 BWA does better when indel realignment is
used, in combination with paired data
Less low quality false positives
Quality and Coverage
 Some of these low quality variants can be
removed by filtering, after variant call
 Quality and coverage cut-offs have to be
parameterized properly in the alignment and
variant call
 Quality recalibration can also help to reduce low
quality false positives
Variations next to coding areas
Splicing and Promoters
 Most of the exon kits also provide variant
calls close to the coding regions
 These should be included in the analysis if
possible
Less false positives in complex regions 1.
Less false positives in complex regions 2.
Less false positives in complex regions 3.
Less false positives in complex regions 4.
Higher coverage.
Less false positives in complex regions 5.
Lower coverage.
Complex regions
 Mismappings due to pseudogenes or
repeats – or just complex regions?
 Sometime more coverage can actually be
bad
 Need to watch out for non-specific read
mappings (reads mapping to multiple
places)
Regions where both aligners are confused 1.
Regions where both aligners are confused 2.
Very Complex Regions
 Some regions are extremely difficult to map
with any techniques
 A different approach may be required to
mapping/alignment
 A different approach may be required to
variant call (local de novo, phasing etc)
Problems with sex chromosomes
 There are may heterozygous calls in the X and Y
chromosomes that are certainly false positives or
incorrect calls.
 This is true for both pipelines, the read specificity and
variant call procedure has to be improved for these
chromosomes.
Summary
 These kinds of comparative studies can be useful in
analyzing the effectiveness of exome sequencing
 Different exome kits can give different results
 The data analysis and variant call tools chosen for the
analysis can also have a big impact
 There is some potential to improve the quality of the
customer's exome analysis pipeline