JanPlan 2012
• Find three different definitions of the word
“bioinformatics”
• How is “bioinformatics different from
“computational biology”?
• What areas of biological research are dependent on bioinformatics?
• Database searching
• Sequence analysis
• Phylogenetic reconstruction
• Molecular evolution
• Gene expression
• Genome assembly
• Genome annotation
• Metagenomics
• NCBI, EMBL & DDBJ
• What function do these organizations play in the global society?
• How do their missions differ?
• NCBI Training and Tutorials page
• The NCBI Handbook
• NCBI How-To page
• NCBI Help Manual
• Annotated collection of all publicly available nucleotide sequences and their protein translations.
• Receives sequences produced in laboratories throughout the world from more than 100,000 distinct organisms.
• Grows exponentially, doubling every 10 months
• Initially built and maintained at Los Alamos
National Laboratory.
• Transferred to NCBI in early 1990s by congressional mandate.
• Most journal publishers require deposition of sequence data into GanBank prior to publication so an accession number may be cited.
• Submitters may keep their data confidential for a specified period of time prior to publication.
• A typical GenBank submission consists of a single, contiguous stretch of DNA or RNA sequence
(contigs) with annotations (metadata).
• If part of a nucleotide sequence encodes a protein, a conceptual translation, called a CDS (coding sequence) is annotated, and the span mapped.
• Example
• HTGS entries are submitted in bulk by genome centers, processed by an automated system, and then released to GenBank.
• Currently, about 30 genome centers are submitting data for a number of organisms, including human, mouse, rat, rice, and Plasmodium falciparum.
• Data submitted in 4 phases .
• Phase 0: Sequences are one-to-few reads of a single clone and are not usually assembled into contigs. They are lowquality sequences that are often used to check whether another center is already sequencing a particular clone.
• Phase 1: Entries are assembled into contigs that are separated by sequence gaps, the relative order and orientation of which are not known.
• Phase 2: Entries are also unfinished sequences that may or may not contain sequence gaps. If there are gaps, then the contigs are in the correct order and orientation.
• Phase 3: Sequences are of finished quality and have no gaps.
For each organism, the group overseeing the sequencing effort determines the definition of finished quality.
• Shotgun sequence reads are assembled into contigs, submitted, and updated as the sequencing project progresses and new assemblies are computed.
• EST = Expressed Sequence Tags (dbEST): Short (< 1 kb), single-pass cDNA sequences from a particular tissue and/or developmental stage. They lack annotation.
• STS = Sequence Tagged Sites (dbSTS): Short genomic landmark sequences. They are operationally unique in that they are specifically amplified from the genome by
PCR amplification. They define a specific location on the genome and are thus useful for mapping.
• GSS = Genome Survey Sequences (dbGSS): Short sequences derived from genomic DNA, about which little is known.
• HTC = High-Throughput cDNA/mRNA: Similar to
ESTs, but often contain more information. May have a systematic gene name that is related to the lab or center that submitted them, and the longest ORF is often annotated as a coding region.
• FLIC = Full-Length Insert cDNA: Contains the entire sequence of a cloned cDNA/mRNA. Generally longer, and sometimes full-length mRNAs. Usually annotated with genes and coding regions. May be systematic gene names rather than functional names.
• BankIt: Web-based form for submission of a small number of sequences with minimal annotation to
GenBank.
• Sequin: More appropriate for complicated submissions containing a significant amount of annotation or many sequences. Stand-alone application available on NCBI’s FTP site.
• Triage: Within 48 hours of direct submission with BankIt or
Sequin, the database staff reviews the submission to determine whether it meets the minimal criteria and then assigns an
Accession number.
• All sequences must be > 50 bp in length and be sequenced by, or on behalf of, the group submitting the sequence.
• GenBank will not accept sequences constructed in silico
• GenBank will not accept noncontiguous sequences containing internal, unsequenced spacers.
• GenBank will not accept sequences for which there is not a physical counterpart, such as those derived from a mix of genomic DNA and mRNA.
• Submissions are checked to determine whether they are new or updates.
• Indexing:
• Biological validity: Translation, organism lineage, BLAST searches
• Vector contamination: Is there any vector DNA present in the sequence?
• Publication status: If published, citation is included in annotation and linked to Entrez
• Formatting and spelling
• Sequences are sent to submitter for final review before release into the public database.
• Sequences must become publicly available once the accession number or the sequence has been published.
• GenBank annotation staff process about 1900 submissions/month, or about 20,000 sequences.
• A curated collection of DNA, RNA, and protein sequences built by NCBI.
• Unlike GenBank, RefSeq provides only one example of each natural biological molecule for major organisms ranging from viruses to bacteria to eukaryotes.
• May include separate linked records for genomic DNA, the gene transcripts, and the proteins arising from those transcripts.
• Limited to major organisms for which sufficient data is available (only 4000 as of Jan 2007), while GenBank includes sequences for any organism submitted (~250k different organisms).
• Contains nucleotide sequences built from existing primary data with new annotation that has been published in a peer-reviewed scientific journal.
• Two types of records:
• Experimental: Annotation supported by wet-lab evidence
• Inferential: Annotation inferred only
• Bridges the gap between GenBank and RefSeq:
Permitting authors publishing new experimental evidence to re-annotate sequences in a public database as they think best, even if they are not the primary sequencer or the curator of a model organism database.
• Protein sequence database that was formed through the merger of three protein databases:
1.
The Swiss Institute of Bioinformatics
2.
The European Bioinformatics Institute’s Swiss-Prot and Translated EMBL Nucleotide Sequence Data
Library (TrEMBL) databases
3.
Georgetown University’s Protein Information
Resource Protein Sequence Database (PIR-PSD)
• ftp://ftp.ncbi.nih.gov/pub/education/tutorials/gen bank.pdf
• Linked on today’s web page