Assembly and analysis group

Assembly and analysis group While the technology for ultra-high throughput DNA sequencing is advancing rapidly, the technology for storage, automated analysis and dissemination of results from such a vast data set is still a grand technical challenge. We foresee a distributed network of sequencing facilities that will feed into a large data center. This will conduct automated assembly and annotation, while specialized analysis may be performed by external domain experts (Fig. 1). For the scheme to be successful, several issues will need to be addressed, including sample quality, sample tracking, coupling of taxonomic data and important phenotypic attributes of each species and clade, algorithm development for efficient large-scale assembly and multispecies comparisons, and computer cluster design features. Below, we present a discussion of some salient features related to the collection, storage, analysis and dissemination of the results generated from the sequencing of 10,000 vertebrate genomes. To put the ambition of the Genome 10 K project into perspective, we would like to present the following test calculation for the sample throughput and processing requirements. A project duration of 5 years requires 2000 genomes to be sequenced, assembled and annotated every year. It seems reasonable to distribute the workload over 20 sites, resulting in 100 samples (genomes) to be completed by one sequencing site per year. This translates to a required output of 2 genomes per week for each sequencing site, including both sequencing and draft assembly. The logistical challenges of distributing 10,000 sequence-ready samples across 20 sequencing sites at a rate of 100 per site per year (for 5 years) necessitates a highly structured workflow that minimizes delays in sequencing. We therefore suggest that the sample stewards from the four taxonomic clades interact with a sequencing steward at one of the 20 sequencing sites (Fig. 1). With an anticipated rate of 2 genomes per week, an established pipeline with sequencing samples needs to be implemented at each sequencing site (see attached figure). Within the proposed pipeline, we recommend policies for the distribution of specimens to sequencing sites that guard against obvious contamination or errors in sample tracking procedures. Such a policy might involve ensuring that taxonomically related samples are being sequenced at different sites. The implementation of the sample distribution policy might involve a central steward that will coordinate sample collections for each of the 20 sequencing sites. Besides sequencing, the sequencing sites will conduct the primary analysis of base calling and minimal assembly. The result of the analysis will be transferred to the data center in form of fastq files. The completeness of each species’ assembly will be dependent on the sequencing technology employed. The most useful assembly for annotation and comparative genomics is a “complete” assembly, which includes deep enough coverage for whole-chromosome assemblies, i.e., sequence coverage that is sufficient to create one (ideally) sequence contig per chromosome, and identification of DNA polymorphisms within and between species. This is a seminal goal of the project. Sample quality guidelines The goal of the genome10k project is to produce assembled whole chromosomes with high enough quality to support bioinformatics analysis including gene finding, comparative genomics, and the identification of functional elements. Samples submitted for sequencing must meet minimum reporting guidelines so that the quality of the DNA can be assessed. Because sequencing technology is improving rapidly, sequencers in the future may have the ability to produce reads of samples covering a wide spectrum of sample quality. Therefore, rather than specifying a minimum set of quality guidelines, we propose that samples meet a minimum set of reporting guidelines. For example, while it might be preferable for samples to be viably frozen using a controlled rate of freezing to preserve cellular components, it will not be possible for all samples to be stored in this manner. In this case, instead of rejecting samples that do not meet minimum storage requirements, samples must be accompanied by descriptors that document how the sample has been preserved. Because there are so many anticipated samples, reporting should follow controlled terminologies as much as possible. The genome10k effort can leverage from other data standardization schemes for reporting the results of biological experiments such as the Minimum Information for Biological and Biomedical Investigations (MIBBI). Standards for the reporting of samples and the results of sequencing will be crafted along the MIBBI-compliant standards such as Minimum Information about a Genome Sequence (MIGS) and Minimum Information about a high-throughput Sequencing Experiment (MINSEQE). These standards outline the information necessary to assess the quality and origin of analyzed specimens including geographic referencing, storage, and handling. Because reporting samples in standard formats can be onerous for experimental groups, the genome10k consortium will produce forms and excel spreadsheets to help describe samples according to the minimum reporting standards. Tools for validating submissions will also be provided. Computational challenges Analysis and interpretation of 10,000 genomes requires scalable resources for data collection from the sequencing centers, data analysis, and data distribution. The size and scope of the endeavor requires centralization and strong coordination with the sequencing centers and analysis groups. Data Center Assembly of vertebrate genomes using new sequencing technologies is still an active research topic. Currently a full assembly involves a large memory machine (~1.5Tb RAM), 128 CPUs and 2 weeks to complete. It seems likely that assembly methods will improve rapidly over the next few years and so, to take advantage of these improvements efficiently, we propose a central data center will compute and house the final assemblies. However, the amount of storage for a vertebrate sequencing run can run into many Tb of disk. To transfer all raw data to a central location daily would exceed the capacity of most networks so we propose all easily automated analysis to be done at the sequencing locations. We envision this being a ‘minimal’ assembly of all short reads into small contigs. The central data center will then receive this assembly in a format taking up a much smaller amount of storage space (e.g. fasta sequence with quality scores) This ‘minimal’ assembly will be archived and accessioned and made publically available. The current protocol for data release is that every contig over 1000 bases is made public. As the precise nature of the data release depends on the properties of the technology used the spirit of timely public data release should be maintained even though the details can not yet be finalized. Sample database The sample database manager will work with the species groups to develop a schema to store the sample provenance, quality and availability. Additionally this database will serve as a tracking database displaying up to date information about the current sequencing, assembly and annotation as well as projections of future sequencing dates. This tracking database will be viewable through the web so that any interested party can keep up to date on the progress of each species’ sequencing project. Timing Due to the anticipated rapid improvement in technology, some decisions should be postponed as late as possible to allow for a better estimate of precise parameters (e.g. exact sample size and assembly method). However, it is imperative that the sample tracking database be operational as soon as possible in order to be able to store the thousands of existing samples. Finally, it should be emphasized that assembly and annotation on this scale has never been attempted before and will take a considerable effort to ensure things run smoothly. It would be foolish to develop such a computational pipeline at the same time as sequence is being processed so we suggest a year's development time before the sequencing starts. This would be used to develop a robust computational pipeline as well as evaluation of assembly and annotation methods. Figure 1. An overview of the sample and data flow through the multiple sequencing centers. A central tracking database (Sample DB) will store the progress of all species through the sequencing and analysis process and will be publically available.

Assembly and analysis group

Related documents

Products

Support

Assembly and analysis group

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib