Bioinformatics Overview, NCBI & GenBank

advertisement

Bioinformatics Overview,

NCBI & GenBank

JanPlan 2012

What is Bioinformatics

• Find three different definitions of the word

“bioinformatics”

• How is “bioinformatics different from

“computational biology”?

• What areas of biological research are dependent on bioinformatics?

What is Bioinformatics Used

For?

• Database searching

• Sequence analysis

• Phylogenetic reconstruction

• Molecular evolution

• Gene expression

• Genome assembly

• Genome annotation

• Metagenomics

Introduction to NCBI

• NCBI, EMBL & DDBJ

• What function do these organizations play in the global society?

• How do their missions differ?

• NCBI Training and Tutorials page

• The NCBI Handbook

• NCBI How-To page

• NCBI Help Manual

GenBank

• Annotated collection of all publicly available nucleotide sequences and their protein translations.

• Receives sequences produced in laboratories throughout the world from more than 100,000 distinct organisms.

• Grows exponentially, doubling every 10 months

GenBank

• Initially built and maintained at Los Alamos

National Laboratory.

• Transferred to NCBI in early 1990s by congressional mandate.

• Most journal publishers require deposition of sequence data into GanBank prior to publication so an accession number may be cited.

• Submitters may keep their data confidential for a specified period of time prior to publication.

Direct Submission

• A typical GenBank submission consists of a single, contiguous stretch of DNA or RNA sequence

(contigs) with annotations (metadata).

• If part of a nucleotide sequence encodes a protein, a conceptual translation, called a CDS (coding sequence) is annotated, and the span mapped.

• Example

High-Throughput Genomic

Sequence (HTGS)

• HTGS entries are submitted in bulk by genome centers, processed by an automated system, and then released to GenBank.

• Currently, about 30 genome centers are submitting data for a number of organisms, including human, mouse, rat, rice, and Plasmodium falciparum.

High-Throughput Genomic

Sequence (HTGS)

• Data submitted in 4 phases .

• Phase 0: Sequences are one-to-few reads of a single clone and are not usually assembled into contigs. They are lowquality sequences that are often used to check whether another center is already sequencing a particular clone.

• Phase 1: Entries are assembled into contigs that are separated by sequence gaps, the relative order and orientation of which are not known.

• Phase 2: Entries are also unfinished sequences that may or may not contain sequence gaps. If there are gaps, then the contigs are in the correct order and orientation.

• Phase 3: Sequences are of finished quality and have no gaps.

For each organism, the group overseeing the sequencing effort determines the definition of finished quality.

Whole Genome Shotgun

Sequences (WGS)

• Shotgun sequence reads are assembled into contigs, submitted, and updated as the sequencing project progresses and new assemblies are computed.

EST, STS, and GSS

• EST = Expressed Sequence Tags (dbEST): Short (< 1 kb), single-pass cDNA sequences from a particular tissue and/or developmental stage. They lack annotation.

• STS = Sequence Tagged Sites (dbSTS): Short genomic landmark sequences. They are operationally unique in that they are specifically amplified from the genome by

PCR amplification. They define a specific location on the genome and are thus useful for mapping.

• GSS = Genome Survey Sequences (dbGSS): Short sequences derived from genomic DNA, about which little is known.

HTC and FLIC

• HTC = High-Throughput cDNA/mRNA: Similar to

ESTs, but often contain more information. May have a systematic gene name that is related to the lab or center that submitted them, and the longest ORF is often annotated as a coding region.

• FLIC = Full-Length Insert cDNA: Contains the entire sequence of a cloned cDNA/mRNA. Generally longer, and sometimes full-length mRNAs. Usually annotated with genes and coding regions. May be systematic gene names rather than functional names.

Submission Tools

• BankIt: Web-based form for submission of a small number of sequences with minimal annotation to

GenBank.

• Sequin: More appropriate for complicated submissions containing a significant amount of annotation or many sequences. Stand-alone application available on NCBI’s FTP site.

Sequence Data Flow and

Processing

• Triage: Within 48 hours of direct submission with BankIt or

Sequin, the database staff reviews the submission to determine whether it meets the minimal criteria and then assigns an

Accession number.

• All sequences must be > 50 bp in length and be sequenced by, or on behalf of, the group submitting the sequence.

• GenBank will not accept sequences constructed in silico

• GenBank will not accept noncontiguous sequences containing internal, unsequenced spacers.

• GenBank will not accept sequences for which there is not a physical counterpart, such as those derived from a mix of genomic DNA and mRNA.

• Submissions are checked to determine whether they are new or updates.

Sequence Data Flow and

Processing

• Indexing:

• Biological validity: Translation, organism lineage, BLAST searches

• Vector contamination: Is there any vector DNA present in the sequence?

• Publication status: If published, citation is included in annotation and linked to Entrez

• Formatting and spelling

• Sequences are sent to submitter for final review before release into the public database.

• Sequences must become publicly available once the accession number or the sequence has been published.

• GenBank annotation staff process about 1900 submissions/month, or about 20,000 sequences.

RefSeq

• A curated collection of DNA, RNA, and protein sequences built by NCBI.

• Unlike GenBank, RefSeq provides only one example of each natural biological molecule for major organisms ranging from viruses to bacteria to eukaryotes.

• May include separate linked records for genomic DNA, the gene transcripts, and the proteins arising from those transcripts.

• Limited to major organisms for which sufficient data is available (only 4000 as of Jan 2007), while GenBank includes sequences for any organism submitted (~250k different organisms).

Third Party Annotation (TPA) database

• Contains nucleotide sequences built from existing primary data with new annotation that has been published in a peer-reviewed scientific journal.

• Two types of records:

• Experimental: Annotation supported by wet-lab evidence

• Inferential: Annotation inferred only

• Bridges the gap between GenBank and RefSeq:

Permitting authors publishing new experimental evidence to re-annotate sequences in a public database as they think best, even if they are not the primary sequencer or the curator of a model organism database.

Universal Protein Resource

(UniProt)

• Protein sequence database that was formed through the merger of three protein databases:

1.

The Swiss Institute of Bioinformatics

2.

The European Bioinformatics Institute’s Swiss-Prot and Translated EMBL Nucleotide Sequence Data

Library (TrEMBL) databases

3.

Georgetown University’s Protein Information

Resource Protein Sequence Database (PIR-PSD)

Problem Set

• ftp://ftp.ncbi.nih.gov/pub/education/tutorials/gen bank.pdf

• Linked on today’s web page

Download