Internship_proposals_2014_final

advertisement
2014 Informatics Internships at Illumina UK
At Illumina, our goal is to apply innovative technologies and revolutionary assays to the analysis of
genetic variation and function, making studies possible that were not even imaginable just a few
years ago. Our products and services are used globally by a broad range of institutions in academic,
government, pharmaceutical, biotechnology, and other leading institutions around the globe.
The Computational Biology Group at Illumina UK is a multidisciplinary team that creates software
and performs data analysis to support new and existing applications of Illumina’s DNA sequencing
technology, including human genome re-sequencing, targeted sequencing, de novo assembly, gene
expression, methylation, and copy number variation. Working closely with our colleagues in the UK
and US and with our academic collaborators, we extract biological knowledge from large-scale
datasets using skills in high-performance computing, data mining, visualisation and algorithmics.
We are occasionally able to offer internship placements to able and motivated students who are
studying for higher degrees in relevant disciplines. Placement projects for this year will be selected
from the proposals outlined on the following pages. To apply, please send your CV to the contact
person for your project of interest, detailing why the project is of particular interest and why your
skills and experience make you particularly suited to carry it out. Please do not send blanket
applications to all projects.
Start date and duration are flexible but you will be expected to work on site at Illumina’s
headquarters in Chesterford for a minimum of 8 weeks. Chesterford is about 15km south of
Cambridge, UK and can be reached from there by train and shuttle bus. We can sometimes offer
practical help with obtaining temporary accommodation. A pro rata annual salary of between
£14,000 and £18,000 will be offered.
Title: Visualising Genetic Variation Data on a Genome Scale
Supervisor: Peter Krusche
Genetic variation data describes the differences between a sequenced genome and a static
reference sequence. Analysing such data is instrumental in understanding how specific genetic
mutations can affect an organism; and this as many applications ranging from basic research in
biology to modern medicine.
In this project, we are interested in creating an interactive visualisation that can be used to browse
and compare one or more sets of genetic variants. The goal is to allow scientists to easily access
genome-wide variation information, and be able to zoom in to a sequence level to see the effects
caused by each variant. Moreover, we would like to connect the variation data to additional
information about the changed DNA loci (e.g. genes or output from computational analyses). Circos
[1] and Mizbee [2] are some examples of such visualisations. An example use case could be to
provide easy access to the gold standard variant datasets used for improving the tools to detect such
variants (see [3], [4]).
The challenge of this project lies in the fact that variant datasets can be very large (millions of
records or larger): It will be necessary to find a good way to make this data accessible, as well as
implementing a solution that is efficient and scalable.
The project provides a great opportunity to learn about high-throughput sequencing technology and
its applications, as well as the relevant computational methods. We will provide access to modern
computing infrastructure as well as plenty of data to experiment with.
This project would suit a student with a keen interest in interactive data visualisation. Students
should have a strong background in web programming and databases (using e.g. D3.js – Data-driven
documents http://d3js.org/).
Further Reading:
[1] Circos circular visualisations: http://circos.ca
[2] MizBee: http://www.cs.utah.edu/~miriah/mizbee/Overview.html
[3] 1000 Genomes: http://www.1000genomes.org/, http://samtools.github.io/htsspecs/VCFv4.1.pdf
[4] Platinum Genomes: http://www.illumina.com/platinumgenomes/
[5] BaseSpace: https://basespace.illumina.com/home/index
Title: New Models for Representing Genome Reference Data
Supervisors: Peter Krusche and Epameinondas Fritzilas
A common way to represent a genome is in the form of a static reference sequence: in the case of
the human species, this sequence consists of more than 3 billion nucleotide characters (A/C/G/T/N)
and is divided into 22+2 chromosomes. High-throughput sequencing methods generate millions of
small sequence fragments (“reads”), which are then placed along (“aligned to”) this reference
sequence. Each individual sample will have differences from the reference (“genetic variants”), such
as sequence insertions, deletions or more complex sequence rearrangements. The presence of such
variants makes it difficult to accurately place all reads.
We can improve this accuracy by taking data about previously known genetic variants into account.
For example, this could be implemented by allowing multiple possible reference sequences at the
location of a variant, giving rise to a non-linear reference sequence.
In this project, we are interested in designing and implementing a prototype representation model
for storing and querying such a non-linear reference sequence. The challenge is to implement a
solution that is flexible and has the potential to scale to very large datasets. Then, we would be
interested in evaluating how many more reads we can place using this approach.
The project provides a great opportunity to learn about high-throughput sequencing technology and
its applications, making key contributions towards improving core computational methods. We will
provide access to modern computing infrastructure as well as plenty of data to experiment with.
The project would suit a student interested in making the connection between models for genetic
variation data and efficient algorithms and data structures for their representation. Students
interested in this project should be familiar with algorithms and data structures (hashing, trees,
graphs), as well as experience in programming (ideally C++ and a scripting language like Perl, Python
or R).
Further reading:
[1] The human genome reference:
http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/human/data/index.shtml
[2] Some modern sequence alignment methods:
http://www.nature.com/nmeth/journal/v9/n4/full/nmeth.1923.html
http://bioinformatics.oxfordjournals.org/content/29/16/2041
[3] How genetic variation data is represented:
http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-formatversion-41
Viral Quasispecies Reconstruction
Supervisors: Miao He and Lilian Janin
This internship will allow a student with a bioinformatics, mathematical or computing background to
apply their expertise towards the creation of a tool to interpret viral “quasispecies”. The student will
contribute to solving an important problem that could directly improve the treatment of viral
infection.
HIV and hepatitis C virus (HCV) have very high mutation rates, and evolve rapidly within a patient,
resulting in a family of co-existing viruses related by similar mutations (= a viral “quasispecies”) [1].
Reconstructing the genomes of the co-existing members of a quasispecies is clinically important as
high heterogeneity is associated with poor response to treatments [2]. Identifying the exact
sequences contained within the quasispecies could improve our current ability of drug resistance
prediction and help doctors prescribe more effective medication [3].
We have been implementing a smart string permutation method, the Burrows-Wheeler transform
for several applications including pattern matching [4, 5]. Through applying this method, the student
will create a tool for reconstructing multiple genomes of a viral quasispecies from next-generation
sequencing data.
This project would suit a student with solid programming skills and passion for applying them into
creating solutions for real problems in biomedical research. Knowledge of algorithms and nextgeneration sequencing data are advantageous but not required. This will be an excellent opportunity
for a student to work in an interdisciplinary area where biomedical sciences intersect with
computing and mathematics.
Further reading:
[1] The quasispecies (extremely heterogeneous) nature of viral RNA genome populations: biological
relevance--a review. Gene. 1985;40(1):1-8.
[2] Clinical implications of viral quasispecies heterogeneity in chronic hepatitis C. J Med Virol. 1996
Jul;49(3):242-7.
[3] Computational methods for the design of effective therapies against drug resistant HIV strains.
Bioinformatics. 2005 Nov 1;21(21):3943-50.
[4] The Burrows-Wheeler transform: data compression, suffix arrays and pattern matching. Springer
2008, ISBN 978-0-387-78908-8
[5] metaBEETL: high-throughput analysis of heterogeneous microbial populations from shotgun DNA
sequences. BMC Bioinformatics. 2013;14 Suppl 5:S2.
Title: Targeted deep sequencing quality control
Supervisors: John Peden and Jennifer Becq
This internship will allow a student with a bioinformatics, mathematical or computing background to
apply their programming skills to develop a tool to assist with assessing the quality of biological data
coming through our targeted sequencing pipelines. This project will contribute to Illumina’s goal of
revolutionizing modern medicine.
Targeted deep sequencing of DNA will soon be routinely used for cancer diagnostics [1].Targeted
sequencing facilitates multiple samples to be indexed and pooled together such that a single lane
from a HiSeq2000 can contain 96 separate samples, and one run of a HiSeq would generate 16 lanes.
In order to provide an accurate diagnostic it is critical to assess the quality of every DNA sequencing
experiment. Because of the heterogeneous nature of cancer, some mutations might be present at an
extremely low frequency in the data. Large datasets may contain noise as well as variation in
coverage, so it is crucial to provide a statistical significance to each mutation we find.
The aim of the project is to complement our quality-control pipeline for targeted deep sequencing,
Illumina uses the Picard suite to generate targeted sequencing metrics and there is a need to
provide more detailed reports and to visualize these metrics both within and between experiments.
A key component of the project will be to create a series of in silico experiments to assess the
significance of low frequency mutations. Illumina has subtraction-based calling methods to
simultaneously compare the observed allele frequency in both tumour and normal samples, in order
to reliably detect low frequency somatic variants. These methods were developed on whole genome
datasets where the read depth at each position within an exon is broadly constant, yet the
characteristics of a targeted panel are such that the read depth can vary markedly within an exon
and between samples. An evaluation of the impact of this variation on detection sensitivity will be
part of this project. In longitudinal studies somatic variants are tracked across many samples so it is
therefore important that we interpret the absence or presence of a variant within a sample in a
proper statistical framework.
This project would suit a student with solid statistical knowledge and good programming skills.
Moreover, the student should have an interest in learning about the future of biomedicine and its
intersection with information technology.
[1] TRACERx: http://scienceblog.cancerresearchuk.org/2013/07/18/a-new-era-in-lung-cancerresearch-the-tracerx-study/
Characterizing a gold standard set of genomic variants for diploid and somatic genomes
Supervisor: Philip Tedder
The cost of sequencing has dropped dramatically recently and we have moved into the era where
tens of thousands of whole human genomes are being sequenced every year. However even though
the cost and time of sequencing has decreased significantly, there are still issues in the
reproducibility of the variants called from any one sequenced genome.
The Platinum Genome1 and Platinum Cancer projects are attempts to collate a set of “gold standard”
variants for a diploid and somatic genome respectively. Both these projects have a rich collection of
data sets to aid this task including high depth, technical replicates, long insert and Moleculo data
that can greatly aid in this characterization.
Collation of SNPs/SNVs is well advanced and the investigation of small indels is in progress. Going
forward, we have the far harder task of detecting larger variants and this internship will aid in this
task using a combination of currently available software packages, visualization tools and targeted
resequencing. There is considerable opportunity for the intern to focus on areas of interest to them
and evolve the scope of the project.
This is a data analysis project that would suit a student with previous experience in bioinformatics
which should include reasonable programming skills in a scripting language (e.g. R, Perl or Python)
and a strong understanding of genomics/next generation sequencing.
Further reading:
[1] www.platinumgenomes.org
Title: Learning confidence scores for de-novo assemblies of next-gen sequencing reads
Supervisors: Ole Schulz-Trieglaff and Jared O’Connell
A de-novo assembly is the reconstruction of a genome sequence from sequencing reads [1]. Despite
recent progress it is still considered a challenging computational problem, in particular for large
genomes [2]. An assembly will usually contain sequences of varying degree of quality: we will have
high confidence in some unique regions of the sample genome and much less confidence in others,
such as low-complexity or in general repetitive regions.
The aim of this internship project is to use statistical learning to assign confidence values to
nucleotides in the assembled sequence. Potentially useful features for a learning algorithm are the
sequencing reads to assemble the region in question, their quality values and sequence metrics such
as sequence diversity or GC content. Training data can be generated by aligning assembled contigs
to a reference genome and also from simulations.
This is a great opportunity to gain experience in an interdisciplinary corporate R&D environment. We
have a large number of real and simulated data sets for testing purposes. The successful candidate is
expected to submit a written report summarizing the results of the project and to give short
presentation at the end of the internship.
This project is suitable for a candidate with good skills in programming and ideally some knowledge
of statistics and machine learning. Previous experience with Illumina sequencing data and de-novo
assembly is an advantage but not required.
Further reading:
[1] Flicek and Birney 2009 “Sense from sequence reads: methods for alignment and assembly”
Nature Methods
[2] Bradnam et al. (2013) “Assemblathon 2: evaluating de novo methods of genome assembly in
three vertebrate species” GigaScience
[2] Mark Howison, Felipe Zapata and Casey W. Dunn (2012) “Toward a statistically explicit
understanding of de novo sequence assembly“ Bioinformatics
Title: Algorithm for analysis of metagenomic DNA sequencing data
Supervisors: Sergii Ivakhno and Lilian Janin
Evaluation of the functional capacities of microorganisms long relied on laboratory cultivation of
individual species. Recent improvements in DNA sequencing and sample preparation enable direct
recovery of draft genomes of uncultivated bacteria from diverse natural communities, such as
human gut or wound biofilms, offering new possibilities for environmental and clinical applications.
Detection and quantification of thousands of microorganisms from raw sequencing data remains an
open bioinformatics challenge. Recent work at Illumina demonstrated the utility of Burrows-Wheeler
transform in facilitating fast detection of bacteria from metagenomic sequencing data [1]. The
internship project will focus on extending the core functionality of the algorithm and its applications,
including such areas as:
-
Non-exact string matching to address a high mutation rate of bacterial genomes;
Characterization of microbiomes based on prevailing functions rather than taxonomy;
Comparing microbiomes to find significant changes in bacterial composition between
different conditions.
The intern will work in a collaborative bioinformatics research environment and gain experience in
the analysis of high-throughput DNA sequencing data, algorithm development and testing. The
project will be most suitable for a student with strong C/C++ programming skills. Familiarity with
algorithms and data structures for text indexing and string matching would be an advantage.
Further reading:
[1] metaBEETL: high-throughput analysis of heterogeneous microbial populations from shotgun DNA
sequences, BMC Bioinformatics http://www.biomedcentral.com/1471-2105/14/S5/S2
[2] Gut metagenome in European women with normal, impaired and diabetic glucose control, Nature
http://www.nature.com/nature/journal/v498/n7452/full/nature12198.html
[3] High-throughput bacterial genome sequencing: an embarrassment of choice, a world of
opportunity, Nature Reviews Microbiology
http://www.nature.com/nrmicro/journal/v10/n9/full/nrmicro2850.html
Title: Algorithm for detection of sub-chromosomal rearrangements from DNA sequencing data for
non-invasive prenatal testing
Supervisor: Sergii Ivakhno
Screening for chromosomal abnormalities in a child during pregnancy became an important
diagnostic tool for detection of aneuplodies that lead to such conditions as Down syndrome. Noninvasive prenatal testing (NIPT) works by analysing DNA fragments present in the maternal plasma
during pregnancy, which originate from the placenta and represent an unborn baby’s genome (cellfree fetal DNA, cffDNA). While sequence-based NIPT screening for aneuplodies has started to enter
clinical practice, detection of smaller sub-chromosomal rearranged, also known as copy number
variants (CNVs), still remains a bioinformatics challenge due to low sequence coverage, short read
length and the presence of mosaic embryos where variants only found in a subset of cells.
The internship will focus on algorithms for CNVs detection from NIPT DNA sequencing data by
adapting/extending existing methods developed at Illumina [1, 2] and/or developing novel
approaches utilizing pattern recognition and text matching algorithms, such as Hidden Markov
Models and Burrows-Wheeler transform. The project will provide an opportunity to work in a
translational bioinformatics environment and allow student to gain experience in the analysis of
high-throughput DNA sequencing data, algorithm development and testing. The ideal candidate will
have some background in machine learning and text matching algorithms and knowledge of C/C++
and R or related statistical data analysis language.
Further reading:
[1] CNAseg – a novel framework for identification of copy number changes in cancer from secondgeneration sequencing data, Bioinformatics
http://bioinformatics.oxfordjournals.org/content/26/24/3051
[2] Comparing DNA Sequence Collections by Direct Comparison of Compressed Text Indexes, Lecture
Notes in Computer Science http://arxiv.org/abs/1304.5535
[3] Noninvasive Detection of Fetal Subchromosome Abnormalities via Deep Sequencing of Maternal
Plasma, American Journal of Human Genetics
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3567270/
Title: Metrics analytical framework for genomic data
Supervisors: Adrian Alexa
We are looking for an intern to develop an API and data mining methods within the R statistical
language. The API will enable users to access and visualise data stored in JSON format which is
currently output by our metrics framework. These metrics assess the quality of high throughput
sequencing data and are vital in both the continual assessment of data quality and for research and
development of new methods and algorithms. Depending on the interest of applicants the project
could focus more on developing methods for the storage and access of the data by use of a NoSQL
database like mongoDB, efficient mining and aggregation, or on methods of visualisation.
This would be an ideal opportunity for someone interested in gaining experience of software
development within a cutting edge biotechnology company. The project would require strong
programming skills, ideally in R, and an interest in data analysis, databases and/or visualization.
Experience or interest in learning NoSQL databases would be advantages as would an interest in
biological data. The API will aim to enable users with limited familiarity with R to construct custom
analyses of metrics with little knowledge of the underlying data structures, and to produce
publication quality graphics.
Title: BaseSpaceR - a package for developing NGS analysis methods on Illumina’s BaseSpace
platform
Supervisor: Adrian Alexa
BaseSpace is Illumina's next-generation sequencing cloud computing environment enabling storage,
analysis and sharing of genomic data. The BaseSpaceR package provides an R interface to BaseSpace
REST API, enabling R based methods development and worldwide deployment of bioinformatics
tools.
We have an opportunity for an intern to develop and extend the current R API functionality.
Depending on the interest of applicants the project could focus more on developing and optimizing
the library API calls, implement new methods for accessing alignment and variant data that would
facilitate the development of whole-genome sequencing, RNA-Seq, ChIP-Seq, etc. application, or
port available Bioconductor workflows to BaseSpace.
This project is an opportunity to work in a cutting-edge R&D environment and gain significant
intuition about the mechanics of handling high-throughput sequencing data, cloud technologies and
data modeling. The intern will sharpen their data mining and programming skills, learn about data
access to cloud resources, experiment with efficient data structures for genomic data and
understand the inherent difficulties of the process.
In terms of technical skills, the candidate is expected to implement the above stated methods and
therefore, it is essential to have experience with the R statistical environment and to be familiar with
Bioconductor tools. Familiarity with cloud computing technologies, REST and HTTP protocols, and
related web technologies is an advantage. Previous experience with high-throughput sequencing
data would be ideal, but not a strict requirement.
Download