Assembly and analysis group

advertisement
Assembly and analysis group
While the technology for ultra-high throughput DNA sequencing is advancing rapidly, the
technology for storage, automated analysis and dissemination of results from such a vast data
set is still a grand technical challenge. We foresee a distributed network of sequencing facilities
that will feed into a large data center. This will conduct automated assembly and annotation,
while specialized analysis may be performed by external domain experts (Fig. 1). For the
scheme to be successful, several issues will need to be addressed, including sample quality,
sample tracking, coupling of taxonomic data and important phenotypic attributes of each
species and clade, algorithm development for efficient large-scale assembly and multispecies
comparisons, and computer cluster design features. Below, we present a discussion of some
salient features related to the collection, storage, analysis and dissemination of the results
generated from the sequencing of 10,000 vertebrate genomes.
To put the ambition of the Genome 10 K project into perspective, we would like to present the
following test calculation for the sample throughput and processing requirements. A project
duration of 5 years requires 2000 genomes to be sequenced, assembled and annotated every
year. It seems reasonable to distribute the workload over 20 sites, resulting in 100 samples
(genomes) to be completed by one sequencing site per year. This translates to a required
output of 2 genomes per week for each sequencing site, including both sequencing and draft
assembly.
The logistical challenges of distributing 10,000 sequence-ready samples across 20 sequencing
sites at a rate of 100 per site per year (for 5 years) necessitates a highly structured workflow
that minimizes delays in sequencing. We therefore suggest that the sample stewards from the
four taxonomic clades interact with a sequencing steward at one of the 20 sequencing sites
(Fig. 1). With an anticipated rate of 2 genomes per week, an established pipeline with
sequencing samples needs to be implemented at each sequencing site (see attached figure).
Within the proposed pipeline, we recommend policies for the distribution of specimens to
sequencing sites that guard against obvious contamination or errors in sample tracking
procedures. Such a policy might involve ensuring that taxonomically related samples are being
sequenced at different sites. The implementation of the sample distribution policy might involve
a central steward that will coordinate sample collections for each of the 20 sequencing sites.
Besides sequencing, the sequencing sites will conduct the primary analysis of base calling and
minimal assembly. The result of the analysis will be transferred to the data center in form of
fastq files.
The completeness of each species’ assembly will be dependent on the sequencing technology
employed. The most useful assembly for annotation and comparative genomics is a “complete”
assembly, which includes deep enough coverage for whole-chromosome assemblies, i.e.,
sequence coverage that is sufficient to create one (ideally) sequence contig per chromosome,
and identification of DNA polymorphisms within and between species. This is a seminal goal of
the project.
Sample quality guidelines
The goal of the genome10k project is to produce assembled whole chromosomes with high
enough quality to support bioinformatics analysis including gene finding, comparative genomics,
and the identification of functional elements. Samples submitted for sequencing must meet
minimum reporting guidelines so that the quality of the DNA can be assessed. Because
sequencing technology is improving rapidly, sequencers in the future may have the ability to
produce reads of samples covering a wide spectrum of sample quality. Therefore, rather than
specifying a minimum set of quality guidelines, we propose that samples meet a minimum set of
reporting guidelines. For example, while it might be preferable for samples to be viably frozen
using a controlled rate of freezing to preserve cellular components, it will not be possible for all
samples to be stored in this manner. In this case, instead of rejecting samples that do not meet
minimum storage requirements, samples must be accompanied by descriptors that document
how the sample has been preserved.
Because there are so many anticipated samples, reporting should follow controlled
terminologies as much as possible. The genome10k effort can leverage from other data
standardization schemes for reporting the results of biological experiments such as the
Minimum Information for Biological and Biomedical Investigations (MIBBI). Standards for the
reporting of samples and the results of sequencing will be crafted along the MIBBI-compliant
standards such as Minimum Information about a Genome Sequence (MIGS) and Minimum
Information about a high-throughput Sequencing Experiment (MINSEQE). These standards
outline the information necessary to assess the quality and origin of analyzed specimens
including geographic referencing, storage, and handling.
Because reporting samples in standard formats can be onerous for experimental groups, the
genome10k consortium will produce forms and excel spreadsheets to help describe samples
according to the minimum reporting standards. Tools for validating submissions will also be
provided.
Computational challenges
Analysis and interpretation of 10,000 genomes requires scalable resources for data collection
from the sequencing centers, data analysis, and data distribution. The size and scope of the
endeavor requires centralization and strong coordination with the sequencing centers and
analysis groups.
Data Center
Assembly of vertebrate genomes using new sequencing technologies is still an active research
topic. Currently a full assembly involves a large memory machine (~1.5Tb RAM), 128 CPUs
and 2 weeks to complete. It seems likely that assembly methods will improve rapidly over the
next few years and so, to take advantage of these improvements efficiently, we propose a
central data center will compute and house the final assemblies.
However, the amount of storage for a vertebrate sequencing run can run into many Tb of disk.
To transfer all raw data to a central location daily would exceed the capacity of most networks
so we propose all easily automated analysis to be done at the sequencing locations. We
envision this being a ‘minimal’ assembly of all short reads into small contigs. The central data
center will then receive this assembly in a format taking up a much smaller amount of storage
space (e.g. fasta sequence with quality scores)
This ‘minimal’ assembly will be archived and accessioned and made publically available. The
current protocol for data release is that every contig over 1000 bases is made public.
As the
precise nature of the data release depends on the properties of the technology used the spirit of
timely public data release should be maintained even though the details can not yet be finalized.
Sample database
The sample database manager will work with the species groups to develop a schema to store
the sample provenance, quality and availability. Additionally this database will serve as a
tracking database displaying up to date information about the current sequencing, assembly and
annotation as well as projections of future sequencing dates. This tracking database will be
viewable through the web so that any interested party can keep up to date on the progress of
each species’ sequencing project.
Timing
Due to the anticipated rapid improvement in technology, some decisions should be postponed
as late as possible to allow for a better estimate of precise parameters (e.g. exact sample size
and assembly method). However, it is imperative that the sample tracking database be
operational as soon as possible in order to be able to store the thousands of existing samples.
Finally, it should be emphasized that assembly and annotation on this scale has never been
attempted before and will take a considerable effort to ensure things run smoothly.
It would be
foolish to develop such a computational pipeline at the same time as sequence is being
processed so we suggest a year's development time before the sequencing starts.
This would
be used to develop a robust computational pipeline as well as evaluation of assembly and
annotation methods.
Figure 1. An overview of the sample and data flow through the multiple sequencing centers. A
central tracking database (Sample DB) will store the progress of all species through the
sequencing and analysis process and will be publically available.
Download