Analysis of exome sequencing data using the UCL Legion cluster

advertisement
Analysis of exome sequencing data using the UCL Legion cluster
Francesco Lescai1,2, Elia Stupka1,3
UCL Genomics, 2Division of Research Strategy, 3UCL Cancer Institute, 72 Huntley street WC1E 6BT London, UK
18000
During the last year, we have setup and optimized an analysis pipeline for exome capture next
generation sequencing data, exploiting the potential of the UCL Legion cluster. Given the pace
at which the sequencing cost/throughput ratio is diminishing, optimization of the entire pipeline
in a cluster environment becomes essential in order to process the data within a reasonable
time frame. We have now implemented the entire exome analysis pipeline including initial QC,
alignment, analysis (SNP and INDEL calling) and annotation on the cluster. In this poster we
present the overall flow of our pipeline, the main challenges encountered in its implementation,
as well as providing an overview of the most important biological observations emerging from
the data analyzed so far, i.e. over 70 human exomes.
QC
alignment
fastqc statistics
reference
assembling
legion launcher
R reporting
alignment
statistics
MAQ
BWA
Samtools
calling
Novoalign
conversion
Dindel
calling
BEDtools
coverage
coverage
statistics
R graphical
reporting
SNP/InDel
annotation
SNP annotation
InDel annotation
pairwise
comparison
VCF conversion
statistics
Figure 3 - Increase of the data
storage during 2010
population
statistics
Figure 1: scheme of
the UCL Genomics
pipeline
Every step is performed
in the cluster with
specific jobs. The
structure of the pipeline
is based on scripts to
launch each analysis
on multiple samples,
scripts that monitor the
activities and trigger
following steps and
jobs integrating Perl
and R programming
to report the results.
Several QC steps
are included in the
workflow.
Introduction
Targeted capture and the progressively decreasing costs of next generation sequencing
made easier to access this technology and scan the human genome at a high resolution.
While the cost of whole genome sequencing still does not allow a large scale population
approach, exome sequencing represents a reliable solution, especially in a clinical context
where clinical information and familiar history may suggest the causal genetic variant to be
located in a coding region.
The 1000 Genomes consortium provided recently a very
large overview as a result of the pilot phase of the project.
16000
12000
10000
8000
6000
4000
2000
June 2010
July 2010
7 samples
14 samples
21 samples
Aug. 2010
Sept. 2010
Oct. 2010
Nov. 2010
74 samples
80 samples
87 samples
The figure shows a
distribution plot of the
sequenced length of
our Indels and those
of the records hit in the
ENSEMBL database
within
the
10bp
flanking window. The
insertions display an
adjusted r2= 0.7642
(p value < 2.2*10-16),
and the deletions an
adjusted r2= 0.6118
(p value <2.2*10-16).
In a first phase we used the samples in order to compare the performance of different aligners,
assess the impact of different filtering criteria, and test the overlap between the calls of
the different combinations of aligners/variant callers. An average of 80,694,156 PE reads
have been generated: on average Maq resulted in 68,906,156 mapped PE reads, BWA in
79,181,231 and Novoalign in 73,138,377. On the capture probes, an average coverage of
Figure 4 - Comparison of InDel
size between our calls, ensembl
and the 1000genome calls
The figure shows a comparison
of the density distributions of the
Indel size called with Dindel on
our sample set (exome capture
regions only), those released by
the 1000genome Consortium and
those in the ENSEMBL database.
None of our calls exceeds 15bp,
and most of them have a length
lower than 6bp. ENSEMBL Indels
of 1bp length are much more
frequent than those in the other two
datasets, while for other lengths
the 1000genomes dataset includes
relatively more variations than our
dataset, with the exclusion of 3bp
length Indels.
They showed how each person carries approximately 250
to 300 loss-of-function variants in annotated genes and 50
to 100 variants previously implicated in inherited disorders.
Figure 2 - Scheme of the storage network and data generation
A central storage is located at Computer Science. Values are expressed per sample to be analyzed.
−15
−10
deletion
−5
0
5
10
15
insertion
40
20
0
−20
−40
−15
−10
−5
0
5
10
15
Sequenced length (bp)
Comments
0.5
0.4
0.3
dataset
1000genome
ensembl
samples
0.2
0.1
0.0
−10
−5
0
length
5
71.22 was reported for Maq, an average of 71.80 for BWA and
67.87 for Novoalign. An average core of 11,596 described SNPs
(±631) has been called by all the methods applied. For novel SNPs
the core shared by all the four approaches include 280±88 calls.
In classifying the Indels as described or novel we applied the same
approach selected by the 1000genome consortium, i.e. looking
for Indels in dbSNP on a window spanning 10 flanking bases of
the sequenced Indel. The core of calls shared by all the aligners
and callers is 59±3 described Indels, while for the novel Indels
the shared core is of 10±4 variations. According to the results of
our comparisons, we selected Novoalign as our aligner of choice,
Samtools to call the SNPs and Dindel for Indel calling.
Results
Setting up our exome pipeline in the Legion environment
required several testing steps, the major obstacles being
the data flow between several storage units, and between
login nodes and compute nodes, as well as the connection
to external databases for the annotation, which is capable
to trigger thousands of queries from the same IP address
range.
Storage itself proved to be a major issue across UCL and
especially for Next Generation Sequencing. The growth in
the amount of data generated, and those foreseen has to
be managed and planned carefully.
The figure shows the quick pace
at which storage requirements
are growing. The main drivers
of such a growth are both the
progressive improvement of
chemistry, allowing more data
per sample to be generate, and
the continuing cost decrease,
making this technology affordable
and accessive to many research
groups across UCL.
14000
May 2010
Figure 5 - Size
correlation between
our
calls
and
ensembl InDels
The SNP calling on our samples using Novoalign and Samtools
revealed an average number of 17,897.42 SNPs already described
in dbSNP, an average number 570.27 SNPs described only in the
latest release of the 1000genome project, and an average of 536.80
novel SNPs across the entire dataset. On average 999.8 described
Indels have been called and 222.63 novel Indel variants.
10
The use of a cluster environment is fundamental in order to be able to process
several samples at the same time, with such a large data amount. Computation
is not particularly intense, but the data often require to process them in chunks
which in turn generates a high amount of jobs per sample. This happened in
particular for Dindel, where a single sample is capable to generate 110 jobs. The
need for a central storage facility in order to facilitate the access to data without
copying them from different locations is essential.
Our results highlights the impact read quality has in the downstream results of
variant calling. For this reason the choice of the aligner cannot be considered
independent from the calling software to be used. The removal of low quality
regions before the mapping ensures a better alignment, and therefore an improved
variant calling in the following step.
The Indel calling still represents a challenge for several reasons. A substantial
limit in the Indels detection has to be stressed. The annotation of Indels as
novel or described is not straightforward, considering that most of the insertions
and deletions populating dbSNP have been described and mapped with other
technologies. Additional work is current performed to improve specifically this
area.
Figure 6 dis t a nc e
between call
position and
e n s e m b l
InDel.
The
majority
of
described
variations fall
within 5bp from
the
starting
point of the
Indel called in
our dataset, and
most of them
within
10bp.
Notably, most
of the variations
appear
to
locate
only
downstream
to our starting
point.
0.5
0.4
cate
density
Abstract
Average database length (bp)
Data (GB)
density
1
0.3
sample
1000genome
0.2
0.1
0.0
−20
−10
0
distance
10
20
Download