2014 Informatics Internships at Illumina UK At Illumina, our goal is to apply innovative technologies and revolutionary assays to the analysis of genetic variation and function, making studies possible that were not even imaginable just a few years ago. Our products and services are used globally by a broad range of institutions in academic, government, pharmaceutical, biotechnology, and other leading institutions around the globe. The Computational Biology Group at Illumina UK is a multidisciplinary team that creates software and performs data analysis to support new and existing applications of Illumina’s DNA sequencing technology, including human genome re-sequencing, targeted sequencing, de novo assembly, gene expression, methylation, and copy number variation. Working closely with our colleagues in the UK and US and with our academic collaborators, we extract biological knowledge from large-scale datasets using skills in high-performance computing, data mining, visualisation and algorithmics. We are occasionally able to offer internship placements to able and motivated students who are studying for higher degrees in relevant disciplines. Placement projects for this year will be selected from the proposals outlined on the following pages. To apply, please send your CV to the contact person for your project of interest, detailing why the project is of particular interest and why your skills and experience make you particularly suited to carry it out. Please do not send blanket applications to all projects. Start date and duration are flexible but you will be expected to work on site at Illumina’s headquarters in Chesterford for a minimum of 8 weeks. Chesterford is about 15km south of Cambridge, UK and can be reached from there by train and shuttle bus. We can sometimes offer practical help with obtaining temporary accommodation. A pro rata annual salary of between £14,000 and £18,000 will be offered. Title: Visualising Genetic Variation Data on a Genome Scale Supervisor: Peter Krusche Genetic variation data describes the differences between a sequenced genome and a static reference sequence. Analysing such data is instrumental in understanding how specific genetic mutations can affect an organism; and this as many applications ranging from basic research in biology to modern medicine. In this project, we are interested in creating an interactive visualisation that can be used to browse and compare one or more sets of genetic variants. The goal is to allow scientists to easily access genome-wide variation information, and be able to zoom in to a sequence level to see the effects caused by each variant. Moreover, we would like to connect the variation data to additional information about the changed DNA loci (e.g. genes or output from computational analyses). Circos [1] and Mizbee [2] are some examples of such visualisations. An example use case could be to provide easy access to the gold standard variant datasets used for improving the tools to detect such variants (see [3], [4]). The challenge of this project lies in the fact that variant datasets can be very large (millions of records or larger): It will be necessary to find a good way to make this data accessible, as well as implementing a solution that is efficient and scalable. The project provides a great opportunity to learn about high-throughput sequencing technology and its applications, as well as the relevant computational methods. We will provide access to modern computing infrastructure as well as plenty of data to experiment with. This project would suit a student with a keen interest in interactive data visualisation. Students should have a strong background in web programming and databases (using e.g. D3.js – Data-driven documents http://d3js.org/). Further Reading: [1] Circos circular visualisations: http://circos.ca [2] MizBee: http://www.cs.utah.edu/~miriah/mizbee/Overview.html [3] 1000 Genomes: http://www.1000genomes.org/, http://samtools.github.io/htsspecs/VCFv4.1.pdf [4] Platinum Genomes: http://www.illumina.com/platinumgenomes/ [5] BaseSpace: https://basespace.illumina.com/home/index Title: New Models for Representing Genome Reference Data Supervisors: Peter Krusche and Epameinondas Fritzilas A common way to represent a genome is in the form of a static reference sequence: in the case of the human species, this sequence consists of more than 3 billion nucleotide characters (A/C/G/T/N) and is divided into 22+2 chromosomes. High-throughput sequencing methods generate millions of small sequence fragments (“reads”), which are then placed along (“aligned to”) this reference sequence. Each individual sample will have differences from the reference (“genetic variants”), such as sequence insertions, deletions or more complex sequence rearrangements. The presence of such variants makes it difficult to accurately place all reads. We can improve this accuracy by taking data about previously known genetic variants into account. For example, this could be implemented by allowing multiple possible reference sequences at the location of a variant, giving rise to a non-linear reference sequence. In this project, we are interested in designing and implementing a prototype representation model for storing and querying such a non-linear reference sequence. The challenge is to implement a solution that is flexible and has the potential to scale to very large datasets. Then, we would be interested in evaluating how many more reads we can place using this approach. The project provides a great opportunity to learn about high-throughput sequencing technology and its applications, making key contributions towards improving core computational methods. We will provide access to modern computing infrastructure as well as plenty of data to experiment with. The project would suit a student interested in making the connection between models for genetic variation data and efficient algorithms and data structures for their representation. Students interested in this project should be familiar with algorithms and data structures (hashing, trees, graphs), as well as experience in programming (ideally C++ and a scripting language like Perl, Python or R). Further reading: [1] The human genome reference: http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/human/data/index.shtml [2] Some modern sequence alignment methods: http://www.nature.com/nmeth/journal/v9/n4/full/nmeth.1923.html http://bioinformatics.oxfordjournals.org/content/29/16/2041 [3] How genetic variation data is represented: http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-formatversion-41 Viral Quasispecies Reconstruction Supervisors: Miao He and Lilian Janin This internship will allow a student with a bioinformatics, mathematical or computing background to apply their expertise towards the creation of a tool to interpret viral “quasispecies”. The student will contribute to solving an important problem that could directly improve the treatment of viral infection. HIV and hepatitis C virus (HCV) have very high mutation rates, and evolve rapidly within a patient, resulting in a family of co-existing viruses related by similar mutations (= a viral “quasispecies”) [1]. Reconstructing the genomes of the co-existing members of a quasispecies is clinically important as high heterogeneity is associated with poor response to treatments [2]. Identifying the exact sequences contained within the quasispecies could improve our current ability of drug resistance prediction and help doctors prescribe more effective medication [3]. We have been implementing a smart string permutation method, the Burrows-Wheeler transform for several applications including pattern matching [4, 5]. Through applying this method, the student will create a tool for reconstructing multiple genomes of a viral quasispecies from next-generation sequencing data. This project would suit a student with solid programming skills and passion for applying them into creating solutions for real problems in biomedical research. Knowledge of algorithms and nextgeneration sequencing data are advantageous but not required. This will be an excellent opportunity for a student to work in an interdisciplinary area where biomedical sciences intersect with computing and mathematics. Further reading: [1] The quasispecies (extremely heterogeneous) nature of viral RNA genome populations: biological relevance--a review. Gene. 1985;40(1):1-8. [2] Clinical implications of viral quasispecies heterogeneity in chronic hepatitis C. J Med Virol. 1996 Jul;49(3):242-7. [3] Computational methods for the design of effective therapies against drug resistant HIV strains. Bioinformatics. 2005 Nov 1;21(21):3943-50. [4] The Burrows-Wheeler transform: data compression, suffix arrays and pattern matching. Springer 2008, ISBN 978-0-387-78908-8 [5] metaBEETL: high-throughput analysis of heterogeneous microbial populations from shotgun DNA sequences. BMC Bioinformatics. 2013;14 Suppl 5:S2. Title: Targeted deep sequencing quality control Supervisors: John Peden and Jennifer Becq This internship will allow a student with a bioinformatics, mathematical or computing background to apply their programming skills to develop a tool to assist with assessing the quality of biological data coming through our targeted sequencing pipelines. This project will contribute to Illumina’s goal of revolutionizing modern medicine. Targeted deep sequencing of DNA will soon be routinely used for cancer diagnostics [1].Targeted sequencing facilitates multiple samples to be indexed and pooled together such that a single lane from a HiSeq2000 can contain 96 separate samples, and one run of a HiSeq would generate 16 lanes. In order to provide an accurate diagnostic it is critical to assess the quality of every DNA sequencing experiment. Because of the heterogeneous nature of cancer, some mutations might be present at an extremely low frequency in the data. Large datasets may contain noise as well as variation in coverage, so it is crucial to provide a statistical significance to each mutation we find. The aim of the project is to complement our quality-control pipeline for targeted deep sequencing, Illumina uses the Picard suite to generate targeted sequencing metrics and there is a need to provide more detailed reports and to visualize these metrics both within and between experiments. A key component of the project will be to create a series of in silico experiments to assess the significance of low frequency mutations. Illumina has subtraction-based calling methods to simultaneously compare the observed allele frequency in both tumour and normal samples, in order to reliably detect low frequency somatic variants. These methods were developed on whole genome datasets where the read depth at each position within an exon is broadly constant, yet the characteristics of a targeted panel are such that the read depth can vary markedly within an exon and between samples. An evaluation of the impact of this variation on detection sensitivity will be part of this project. In longitudinal studies somatic variants are tracked across many samples so it is therefore important that we interpret the absence or presence of a variant within a sample in a proper statistical framework. This project would suit a student with solid statistical knowledge and good programming skills. Moreover, the student should have an interest in learning about the future of biomedicine and its intersection with information technology. [1] TRACERx: http://scienceblog.cancerresearchuk.org/2013/07/18/a-new-era-in-lung-cancerresearch-the-tracerx-study/ Characterizing a gold standard set of genomic variants for diploid and somatic genomes Supervisor: Philip Tedder The cost of sequencing has dropped dramatically recently and we have moved into the era where tens of thousands of whole human genomes are being sequenced every year. However even though the cost and time of sequencing has decreased significantly, there are still issues in the reproducibility of the variants called from any one sequenced genome. The Platinum Genome1 and Platinum Cancer projects are attempts to collate a set of “gold standard” variants for a diploid and somatic genome respectively. Both these projects have a rich collection of data sets to aid this task including high depth, technical replicates, long insert and Moleculo data that can greatly aid in this characterization. Collation of SNPs/SNVs is well advanced and the investigation of small indels is in progress. Going forward, we have the far harder task of detecting larger variants and this internship will aid in this task using a combination of currently available software packages, visualization tools and targeted resequencing. There is considerable opportunity for the intern to focus on areas of interest to them and evolve the scope of the project. This is a data analysis project that would suit a student with previous experience in bioinformatics which should include reasonable programming skills in a scripting language (e.g. R, Perl or Python) and a strong understanding of genomics/next generation sequencing. Further reading: [1] www.platinumgenomes.org Title: Learning confidence scores for de-novo assemblies of next-gen sequencing reads Supervisors: Ole Schulz-Trieglaff and Jared O’Connell A de-novo assembly is the reconstruction of a genome sequence from sequencing reads [1]. Despite recent progress it is still considered a challenging computational problem, in particular for large genomes [2]. An assembly will usually contain sequences of varying degree of quality: we will have high confidence in some unique regions of the sample genome and much less confidence in others, such as low-complexity or in general repetitive regions. The aim of this internship project is to use statistical learning to assign confidence values to nucleotides in the assembled sequence. Potentially useful features for a learning algorithm are the sequencing reads to assemble the region in question, their quality values and sequence metrics such as sequence diversity or GC content. Training data can be generated by aligning assembled contigs to a reference genome and also from simulations. This is a great opportunity to gain experience in an interdisciplinary corporate R&D environment. We have a large number of real and simulated data sets for testing purposes. The successful candidate is expected to submit a written report summarizing the results of the project and to give short presentation at the end of the internship. This project is suitable for a candidate with good skills in programming and ideally some knowledge of statistics and machine learning. Previous experience with Illumina sequencing data and de-novo assembly is an advantage but not required. Further reading: [1] Flicek and Birney 2009 “Sense from sequence reads: methods for alignment and assembly” Nature Methods [2] Bradnam et al. (2013) “Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species” GigaScience [2] Mark Howison, Felipe Zapata and Casey W. Dunn (2012) “Toward a statistically explicit understanding of de novo sequence assembly“ Bioinformatics Title: Algorithm for analysis of metagenomic DNA sequencing data Supervisors: Sergii Ivakhno and Lilian Janin Evaluation of the functional capacities of microorganisms long relied on laboratory cultivation of individual species. Recent improvements in DNA sequencing and sample preparation enable direct recovery of draft genomes of uncultivated bacteria from diverse natural communities, such as human gut or wound biofilms, offering new possibilities for environmental and clinical applications. Detection and quantification of thousands of microorganisms from raw sequencing data remains an open bioinformatics challenge. Recent work at Illumina demonstrated the utility of Burrows-Wheeler transform in facilitating fast detection of bacteria from metagenomic sequencing data [1]. The internship project will focus on extending the core functionality of the algorithm and its applications, including such areas as: - Non-exact string matching to address a high mutation rate of bacterial genomes; Characterization of microbiomes based on prevailing functions rather than taxonomy; Comparing microbiomes to find significant changes in bacterial composition between different conditions. The intern will work in a collaborative bioinformatics research environment and gain experience in the analysis of high-throughput DNA sequencing data, algorithm development and testing. The project will be most suitable for a student with strong C/C++ programming skills. Familiarity with algorithms and data structures for text indexing and string matching would be an advantage. Further reading: [1] metaBEETL: high-throughput analysis of heterogeneous microbial populations from shotgun DNA sequences, BMC Bioinformatics http://www.biomedcentral.com/1471-2105/14/S5/S2 [2] Gut metagenome in European women with normal, impaired and diabetic glucose control, Nature http://www.nature.com/nature/journal/v498/n7452/full/nature12198.html [3] High-throughput bacterial genome sequencing: an embarrassment of choice, a world of opportunity, Nature Reviews Microbiology http://www.nature.com/nrmicro/journal/v10/n9/full/nrmicro2850.html Title: Algorithm for detection of sub-chromosomal rearrangements from DNA sequencing data for non-invasive prenatal testing Supervisor: Sergii Ivakhno Screening for chromosomal abnormalities in a child during pregnancy became an important diagnostic tool for detection of aneuplodies that lead to such conditions as Down syndrome. Noninvasive prenatal testing (NIPT) works by analysing DNA fragments present in the maternal plasma during pregnancy, which originate from the placenta and represent an unborn baby’s genome (cellfree fetal DNA, cffDNA). While sequence-based NIPT screening for aneuplodies has started to enter clinical practice, detection of smaller sub-chromosomal rearranged, also known as copy number variants (CNVs), still remains a bioinformatics challenge due to low sequence coverage, short read length and the presence of mosaic embryos where variants only found in a subset of cells. The internship will focus on algorithms for CNVs detection from NIPT DNA sequencing data by adapting/extending existing methods developed at Illumina [1, 2] and/or developing novel approaches utilizing pattern recognition and text matching algorithms, such as Hidden Markov Models and Burrows-Wheeler transform. The project will provide an opportunity to work in a translational bioinformatics environment and allow student to gain experience in the analysis of high-throughput DNA sequencing data, algorithm development and testing. The ideal candidate will have some background in machine learning and text matching algorithms and knowledge of C/C++ and R or related statistical data analysis language. Further reading: [1] CNAseg – a novel framework for identification of copy number changes in cancer from secondgeneration sequencing data, Bioinformatics http://bioinformatics.oxfordjournals.org/content/26/24/3051 [2] Comparing DNA Sequence Collections by Direct Comparison of Compressed Text Indexes, Lecture Notes in Computer Science http://arxiv.org/abs/1304.5535 [3] Noninvasive Detection of Fetal Subchromosome Abnormalities via Deep Sequencing of Maternal Plasma, American Journal of Human Genetics http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3567270/ Title: Metrics analytical framework for genomic data Supervisors: Adrian Alexa We are looking for an intern to develop an API and data mining methods within the R statistical language. The API will enable users to access and visualise data stored in JSON format which is currently output by our metrics framework. These metrics assess the quality of high throughput sequencing data and are vital in both the continual assessment of data quality and for research and development of new methods and algorithms. Depending on the interest of applicants the project could focus more on developing methods for the storage and access of the data by use of a NoSQL database like mongoDB, efficient mining and aggregation, or on methods of visualisation. This would be an ideal opportunity for someone interested in gaining experience of software development within a cutting edge biotechnology company. The project would require strong programming skills, ideally in R, and an interest in data analysis, databases and/or visualization. Experience or interest in learning NoSQL databases would be advantages as would an interest in biological data. The API will aim to enable users with limited familiarity with R to construct custom analyses of metrics with little knowledge of the underlying data structures, and to produce publication quality graphics. Title: BaseSpaceR - a package for developing NGS analysis methods on Illumina’s BaseSpace platform Supervisor: Adrian Alexa BaseSpace is Illumina's next-generation sequencing cloud computing environment enabling storage, analysis and sharing of genomic data. The BaseSpaceR package provides an R interface to BaseSpace REST API, enabling R based methods development and worldwide deployment of bioinformatics tools. We have an opportunity for an intern to develop and extend the current R API functionality. Depending on the interest of applicants the project could focus more on developing and optimizing the library API calls, implement new methods for accessing alignment and variant data that would facilitate the development of whole-genome sequencing, RNA-Seq, ChIP-Seq, etc. application, or port available Bioconductor workflows to BaseSpace. This project is an opportunity to work in a cutting-edge R&D environment and gain significant intuition about the mechanics of handling high-throughput sequencing data, cloud technologies and data modeling. The intern will sharpen their data mining and programming skills, learn about data access to cloud resources, experiment with efficient data structures for genomic data and understand the inherent difficulties of the process. In terms of technical skills, the candidate is expected to implement the above stated methods and therefore, it is essential to have experience with the R statistical environment and to be familiar with Bioconductor tools. Familiarity with cloud computing technologies, REST and HTTP protocols, and related web technologies is an advantage. Previous experience with high-throughput sequencing data would be ideal, but not a strict requirement.