Bioinformatics Overview: NCBI & GenBank

Bioinformatics Overview,

NCBI & GenBank

JanPlan 2012

What is Bioinformatics

• Find three different definitions of the word

“bioinformatics”

• How is “bioinformatics different from

“computational biology”?

• What areas of biological research are dependent on bioinformatics?

What is Bioinformatics Used

For?

• Database searching

• Sequence analysis

• Phylogenetic reconstruction

• Molecular evolution

• Gene expression

• Genome assembly

• Genome annotation

• Metagenomics

Introduction to NCBI

• NCBI, EMBL & DDBJ

• What function do these organizations play in the global society?

• How do their missions differ?

• NCBI Training and Tutorials page

• The NCBI Handbook

• NCBI How-To page

• NCBI Help Manual

GenBank

• Annotated collection of all publicly available nucleotide sequences and their protein translations.

• Receives sequences produced in laboratories throughout the world from more than 100,000 distinct organisms.

• Grows exponentially, doubling every 10 months

GenBank

• Initially built and maintained at Los Alamos

National Laboratory.

• Transferred to NCBI in early 1990s by congressional mandate.

• Most journal publishers require deposition of sequence data into GanBank prior to publication so an accession number may be cited.

• Submitters may keep their data confidential for a specified period of time prior to publication.

Direct Submission

• A typical GenBank submission consists of a single, contiguous stretch of DNA or RNA sequence

(contigs) with annotations (metadata).

• If part of a nucleotide sequence encodes a protein, a conceptual translation, called a CDS (coding sequence) is annotated, and the span mapped.

• Example

High-Throughput Genomic

Sequence (HTGS)

• HTGS entries are submitted in bulk by genome centers, processed by an automated system, and then released to GenBank.

• Currently, about 30 genome centers are submitting data for a number of organisms, including human, mouse, rat, rice, and Plasmodium falciparum.

High-Throughput Genomic

Sequence (HTGS)

• Data submitted in 4 phases .

• Phase 0: Sequences are one-to-few reads of a single clone and are not usually assembled into contigs. They are lowquality sequences that are often used to check whether another center is already sequencing a particular clone.

• Phase 1: Entries are assembled into contigs that are separated by sequence gaps, the relative order and orientation of which are not known.

• Phase 2: Entries are also unfinished sequences that may or may not contain sequence gaps. If there are gaps, then the contigs are in the correct order and orientation.

• Phase 3: Sequences are of finished quality and have no gaps.

For each organism, the group overseeing the sequencing effort determines the definition of finished quality.

Whole Genome Shotgun

Sequences (WGS)

• Shotgun sequence reads are assembled into contigs, submitted, and updated as the sequencing project progresses and new assemblies are computed.

EST, STS, and GSS

• EST = Expressed Sequence Tags (dbEST): Short (< 1 kb), single-pass cDNA sequences from a particular tissue and/or developmental stage. They lack annotation.

• STS = Sequence Tagged Sites (dbSTS): Short genomic landmark sequences. They are operationally unique in that they are specifically amplified from the genome by

PCR amplification. They define a specific location on the genome and are thus useful for mapping.

• GSS = Genome Survey Sequences (dbGSS): Short sequences derived from genomic DNA, about which little is known.

HTC and FLIC

• HTC = High-Throughput cDNA/mRNA: Similar to

ESTs, but often contain more information. May have a systematic gene name that is related to the lab or center that submitted them, and the longest ORF is often annotated as a coding region.

• FLIC = Full-Length Insert cDNA: Contains the entire sequence of a cloned cDNA/mRNA. Generally longer, and sometimes full-length mRNAs. Usually annotated with genes and coding regions. May be systematic gene names rather than functional names.

Submission Tools

• BankIt: Web-based form for submission of a small number of sequences with minimal annotation to

GenBank.

• Sequin: More appropriate for complicated submissions containing a significant amount of annotation or many sequences. Stand-alone application available on NCBI’s FTP site.

Sequence Data Flow and

Processing

• Triage: Within 48 hours of direct submission with BankIt or

Sequin, the database staff reviews the submission to determine whether it meets the minimal criteria and then assigns an

Accession number.

• All sequences must be > 50 bp in length and be sequenced by, or on behalf of, the group submitting the sequence.

• GenBank will not accept sequences constructed in silico

• GenBank will not accept noncontiguous sequences containing internal, unsequenced spacers.

• GenBank will not accept sequences for which there is not a physical counterpart, such as those derived from a mix of genomic DNA and mRNA.

• Submissions are checked to determine whether they are new or updates.

Sequence Data Flow and

Processing

• Indexing:

• Biological validity: Translation, organism lineage, BLAST searches

• Vector contamination: Is there any vector DNA present in the sequence?

• Publication status: If published, citation is included in annotation and linked to Entrez

• Formatting and spelling

• Sequences are sent to submitter for final review before release into the public database.

• Sequences must become publicly available once the accession number or the sequence has been published.

• GenBank annotation staff process about 1900 submissions/month, or about 20,000 sequences.

RefSeq

• A curated collection of DNA, RNA, and protein sequences built by NCBI.

• Unlike GenBank, RefSeq provides only one example of each natural biological molecule for major organisms ranging from viruses to bacteria to eukaryotes.

• May include separate linked records for genomic DNA, the gene transcripts, and the proteins arising from those transcripts.

• Limited to major organisms for which sufficient data is available (only 4000 as of Jan 2007), while GenBank includes sequences for any organism submitted (~250k different organisms).

Third Party Annotation (TPA) database

• Contains nucleotide sequences built from existing primary data with new annotation that has been published in a peer-reviewed scientific journal.

• Two types of records:

• Experimental: Annotation supported by wet-lab evidence

• Inferential: Annotation inferred only

• Bridges the gap between GenBank and RefSeq:

Permitting authors publishing new experimental evidence to re-annotate sequences in a public database as they think best, even if they are not the primary sequencer or the curator of a model organism database.

Universal Protein Resource

(UniProt)

• Protein sequence database that was formed through the merger of three protein databases:

1.

The Swiss Institute of Bioinformatics

2.

The European Bioinformatics Institute’s Swiss-Prot and Translated EMBL Nucleotide Sequence Data

Library (TrEMBL) databases

3.

Georgetown University’s Protein Information

Resource Protein Sequence Database (PIR-PSD)

Problem Set

• ftp://ftp.ncbi.nih.gov/pub/education/tutorials/gen bank.pdf

• Linked on today’s web page

Bioinformatics Overview: NCBI & GenBank

Bioinformatics Overview,

NCBI & GenBank

What is Bioinformatics

What is Bioinformatics Used

For?

Introduction to NCBI

GenBank

GenBank

Direct Submission

High-Throughput Genomic

Sequence (HTGS)

High-Throughput Genomic

Sequence (HTGS)

Whole Genome Shotgun

Sequences (WGS)

EST, STS, and GSS

HTC and FLIC

Submission Tools

Sequence Data Flow and

Processing

Sequence Data Flow and

Processing

RefSeq

Third Party Annotation (TPA) database

Universal Protein Resource

(UniProt)

Problem Set

Related documents

Products

Support

Bioinformatics Overview: NCBI & GenBank