Analysis of exome sequencing data using the UCL Legion cluster Francesco Lescai1,2, Elia Stupka1,3 UCL Genomics, 2Division of Research Strategy, 3UCL Cancer Institute, 72 Huntley street WC1E 6BT London, UK 18000 During the last year, we have setup and optimized an analysis pipeline for exome capture next generation sequencing data, exploiting the potential of the UCL Legion cluster. Given the pace at which the sequencing cost/throughput ratio is diminishing, optimization of the entire pipeline in a cluster environment becomes essential in order to process the data within a reasonable time frame. We have now implemented the entire exome analysis pipeline including initial QC, alignment, analysis (SNP and INDEL calling) and annotation on the cluster. In this poster we present the overall flow of our pipeline, the main challenges encountered in its implementation, as well as providing an overview of the most important biological observations emerging from the data analyzed so far, i.e. over 70 human exomes. QC alignment fastqc statistics reference assembling legion launcher R reporting alignment statistics MAQ BWA Samtools calling Novoalign conversion Dindel calling BEDtools coverage coverage statistics R graphical reporting SNP/InDel annotation SNP annotation InDel annotation pairwise comparison VCF conversion statistics Figure 3 - Increase of the data storage during 2010 population statistics Figure 1: scheme of the UCL Genomics pipeline Every step is performed in the cluster with specific jobs. The structure of the pipeline is based on scripts to launch each analysis on multiple samples, scripts that monitor the activities and trigger following steps and jobs integrating Perl and R programming to report the results. Several QC steps are included in the workflow. Introduction Targeted capture and the progressively decreasing costs of next generation sequencing made easier to access this technology and scan the human genome at a high resolution. While the cost of whole genome sequencing still does not allow a large scale population approach, exome sequencing represents a reliable solution, especially in a clinical context where clinical information and familiar history may suggest the causal genetic variant to be located in a coding region. The 1000 Genomes consortium provided recently a very large overview as a result of the pilot phase of the project. 16000 12000 10000 8000 6000 4000 2000 June 2010 July 2010 7 samples 14 samples 21 samples Aug. 2010 Sept. 2010 Oct. 2010 Nov. 2010 74 samples 80 samples 87 samples The figure shows a distribution plot of the sequenced length of our Indels and those of the records hit in the ENSEMBL database within the 10bp flanking window. The insertions display an adjusted r2= 0.7642 (p value < 2.2*10-16), and the deletions an adjusted r2= 0.6118 (p value <2.2*10-16). In a first phase we used the samples in order to compare the performance of different aligners, assess the impact of different filtering criteria, and test the overlap between the calls of the different combinations of aligners/variant callers. An average of 80,694,156 PE reads have been generated: on average Maq resulted in 68,906,156 mapped PE reads, BWA in 79,181,231 and Novoalign in 73,138,377. On the capture probes, an average coverage of Figure 4 - Comparison of InDel size between our calls, ensembl and the 1000genome calls The figure shows a comparison of the density distributions of the Indel size called with Dindel on our sample set (exome capture regions only), those released by the 1000genome Consortium and those in the ENSEMBL database. None of our calls exceeds 15bp, and most of them have a length lower than 6bp. ENSEMBL Indels of 1bp length are much more frequent than those in the other two datasets, while for other lengths the 1000genomes dataset includes relatively more variations than our dataset, with the exclusion of 3bp length Indels. They showed how each person carries approximately 250 to 300 loss-of-function variants in annotated genes and 50 to 100 variants previously implicated in inherited disorders. Figure 2 - Scheme of the storage network and data generation A central storage is located at Computer Science. Values are expressed per sample to be analyzed. −15 −10 deletion −5 0 5 10 15 insertion 40 20 0 −20 −40 −15 −10 −5 0 5 10 15 Sequenced length (bp) Comments 0.5 0.4 0.3 dataset 1000genome ensembl samples 0.2 0.1 0.0 −10 −5 0 length 5 71.22 was reported for Maq, an average of 71.80 for BWA and 67.87 for Novoalign. An average core of 11,596 described SNPs (±631) has been called by all the methods applied. For novel SNPs the core shared by all the four approaches include 280±88 calls. In classifying the Indels as described or novel we applied the same approach selected by the 1000genome consortium, i.e. looking for Indels in dbSNP on a window spanning 10 flanking bases of the sequenced Indel. The core of calls shared by all the aligners and callers is 59±3 described Indels, while for the novel Indels the shared core is of 10±4 variations. According to the results of our comparisons, we selected Novoalign as our aligner of choice, Samtools to call the SNPs and Dindel for Indel calling. Results Setting up our exome pipeline in the Legion environment required several testing steps, the major obstacles being the data flow between several storage units, and between login nodes and compute nodes, as well as the connection to external databases for the annotation, which is capable to trigger thousands of queries from the same IP address range. Storage itself proved to be a major issue across UCL and especially for Next Generation Sequencing. The growth in the amount of data generated, and those foreseen has to be managed and planned carefully. The figure shows the quick pace at which storage requirements are growing. The main drivers of such a growth are both the progressive improvement of chemistry, allowing more data per sample to be generate, and the continuing cost decrease, making this technology affordable and accessive to many research groups across UCL. 14000 May 2010 Figure 5 - Size correlation between our calls and ensembl InDels The SNP calling on our samples using Novoalign and Samtools revealed an average number of 17,897.42 SNPs already described in dbSNP, an average number 570.27 SNPs described only in the latest release of the 1000genome project, and an average of 536.80 novel SNPs across the entire dataset. On average 999.8 described Indels have been called and 222.63 novel Indel variants. 10 The use of a cluster environment is fundamental in order to be able to process several samples at the same time, with such a large data amount. Computation is not particularly intense, but the data often require to process them in chunks which in turn generates a high amount of jobs per sample. This happened in particular for Dindel, where a single sample is capable to generate 110 jobs. The need for a central storage facility in order to facilitate the access to data without copying them from different locations is essential. Our results highlights the impact read quality has in the downstream results of variant calling. For this reason the choice of the aligner cannot be considered independent from the calling software to be used. The removal of low quality regions before the mapping ensures a better alignment, and therefore an improved variant calling in the following step. The Indel calling still represents a challenge for several reasons. A substantial limit in the Indels detection has to be stressed. The annotation of Indels as novel or described is not straightforward, considering that most of the insertions and deletions populating dbSNP have been described and mapped with other technologies. Additional work is current performed to improve specifically this area. Figure 6 dis t a nc e between call position and e n s e m b l InDel. The majority of described variations fall within 5bp from the starting point of the Indel called in our dataset, and most of them within 10bp. Notably, most of the variations appear to locate only downstream to our starting point. 0.5 0.4 cate density Abstract Average database length (bp) Data (GB) density 1 0.3 sample 1000genome 0.2 0.1 0.0 −20 −10 0 distance 10 20