Database Resources of the National Center for Biotechnology Information David L. Wheeler et al. Nucleic Acids Research, Vol. 33, Database issue Baharak Rastegari MEDG 505 presentation February 3, 2005 baharak@cs.ubc.ca 1 NCBI! What is it? • Created in 1998 • At the National Institutes of Health • To develop information systems for molecular biology • Maintains: GenBank(R) nucleic acid sequence database • Provides: Data retrieval systems & computational resources 2 3 DB Resources Categories • Databases retrieval tools • The BLAST family of sequence-similarity search programs • Resources for Gene-level sequences • Resources for Genome-scale analysis • Resources for the analysis of patterns of gene expression and phenotypes • The molecular modeling database, the conserved domain database search, CDART and Protein interactions 4 DB Resources Categories • Databases retrieval tools • The BLAST family of sequence-similarity search programs • Resources for Gene-level sequences • Resources for Genome-scale analysis • Resources for the analysis of patterns of gene expression and phenotypes • The molecular modeling database, the conserved domain database search, CDART and Protein interactions 5 Entrez • Text searching → using Boolean queries → of a diverse set of over 20 databases • Simultaneous searches across all Entrez databases at speeds comparable to a single database search 6 7 Entrez • Retrieved record can be displayed in a wide variety of formats → GenBank Flatfile, FASTA, XML, … • Graphical display is offered for some type of records • Search history → allows users to recall result of previous searches and combine them using Boolean logic 8 Entrez • PubMed → includes 12.8 million references and abstracts in MEDLINE(R) → with links to the full text of more than 4400 journals available on web • PubMed Central → digital archive of peer reviewed journals in life sciences → access to over 300 000 full text articles → over 160 journals • Books database → Contains more than 35 online scientific textbook 9 Taxonomy • Indexed over 165 000 named organisms • Can be used to view taxonomic position or retrieve data from a database for particular organism or group • Searches can be made on whole, partial or phonetically spelled organism names • Links to organisms commonly used in biological research are provided • Display custom taxonomic trees, representing userdefined subsets of the full NCBI taxonomy 12 13 14 15 Entrez Gene • Successor to LocusLink • Provides an interface to curated sequences and descriptive information about genes • With links to gene related resources → NCBI’s Map Viewer, Evidence Viewer, Blast Link, .. 16 DB Resources Categories • Databases retrieval tools • The BLAST family of sequence-similarity search programs • Resources for Gene-level sequences • Resources for Genome-scale analysis • Resources for the analysis of patterns of gene expression and phenotypes • The molecular modeling database, the conserved domain database search, CDART and Protein interactions 17 BLAST Family • BLAST → Local alignment search tool → performing sequence-similarity searches against variety of sequence databases → returning a set of gapped alignments btw the query and database sequences • BLAST2Sequences → comparing two DNA or protein sequences → producing a dot-plot representation of the alignments 18 19 BLAST Family • MegaBLAST → designed to search for nearly exact matches → handles batch nucleotide queries → operates up to 10 times faster than standard nucleotide BLAST • BLASTLink (BLink) → displays pre-computed protein BLAST alignments for each protein in the Entrez databases → can display subset of these alignments by taxonomic criteria, database of origin, … 20 DB Resources Categories • Databases retrieval tools • The BLAST family of sequence-similarity search programs • Resources for Gene-level sequences • Resources for Genome-scale analysis • Resources for the analysis of patterns of gene expression and phenotypes • The molecular modeling database, the conserved domain database search, CDART and Protein interactions 21 UniGene • System for automatically partitioning Gen-Bank sequences, including ESTs, into a non-redundant set of gene-oriented clusters • Each cluster contains sequences that represent a unique gene, and is linked to related information • Human UniGene → over 4.5 million human ESTs → reduced to 42-fold in number to approximately 107 000 sequence clusters • Has been used as a source of unique sequences for the fabrication of microarrays for the large-scale study of gene expression 22 ProEST • Analogous to BLASTLink • Presents pre-computed BLAST alignment btw protein sequences from model organisms and six-frame translations of UniGene nucleotide sequences • Reports are updated in tandem with UniGene protein similarities 23 Trace & Assembly Archives • Trace Archive allows for flexible searching and download of sequencing traces • Assembly Archive links the raw sequence information found in the Trace Archive with assembly information found in GenBank 24 HomoloGene • System for automated detection of homologs among the annotated genes of several completely sequence eukaryotic • New HomoloGene build is guided by the taxonomic tree, relies on: → conserved gene order & measures of DNA similarity among closely related species → protein similarity for more distantly related organisms • 25 …HomoloGene • ‘Ancestor’ field → refers to the taxonomic group of the last common ancestor of the species represented in HomoloGene entry → using it is possible to limit a search to genes conserved in one of 22 ancestral group • ‘Pairwise Score’ display gives a table of pairwise statistics for members of a Homologene group that includes → percent amino acid and nucleotide identities → Jukes-Cantor genetic distance parameter → the ratio of non-synonymous to synonymous amino acid substitutions (Ka/Ks) 26 Reference Sequences • RefSeq provides curated references for → transcripts, proteins and genomic regions → computationally derived nucleotide sequences and proteins • Containing 1.3 million sequences → including more than 1 million protein sequences → representing more than 2400 organisms 28 ORF Finder and Spidey • ORF finder → performs a six-frame translation of a nucleotide sequence → returns the location of each ORF within a specified size range • Spidey → alignment tool for eukaryotic genomic sequences → takes into account predicted splice sites in constructing its alignment, and can use one of four splice-site models → returns exon alignments, protein translations and a summary showing the alignment quality, … 29 Electronic PCR (e-PCR) • Forward e-PCR → searches for matches to STS primer pairs in the UniSTS database of over 450 000 markers → to increase sensitivity, allows the size of primer segment to be matched, number of mismatches, number of gaps and the size of the STS to be adjusted • Reverse e-PCR → used to estimate the genomic binding site, amplicon size and specificity for sets of primer pairs by searching against the genomic and transcript databases 30 31 32 dbSNP • Database of single nucleotide polymorphisms • Repository for single base nucleotide substitutions and short deletion and insertion polymorphisms • Contains 9.8 million human SNPs as well as about 5 million from a variety of other organisms 33 DB Resources Categories • Databases retrieval tools • The BLAST family of sequence-similarity search programs • Resources for Gene-level sequences • Resources for Genome-scale analysis • Resources for the analysis of patterns of gene expression and phenotypes • The molecular modeling database, the conserved domain database search, CDART and Protein interactions 34 Entrez Genomes • Provides access to genomic data contributed by the scientific community for species whose sequencing and mapping is complete or in progress • Includes: → over 180 complete microbial genomes → more than 1600 viral genomes → over 550 reference sequences for eukaryotic organelles →… • Complete genome can be accessed hierarchically starting from either → an alphabetical listing → phylogenetic tree for each of six principal taxonomic groups 35 COGs database • Clusters of orthologous groups • Presents a compilation of orthologous groups of proteins from 66 completely sequenced organisms • Eukaryotic version, KOGs, is available for seven eukaryotes 36 MAP & Evidence Viewer • MAP Viewer displays → genome assemblies → genetic and physical markers → the result of annotation, and other analyses using sets of aligned maps • Evidence Viewer displays the alignments to a → genomic contig of RefSeq transcripts → GenBank mRNAs → known or potential transcripts → EST’s supporting a gene model 37 Cancer Chromosome • Consists of → NCI/NCBI SKY, M-FISH and CGH databases → NCI Mitelman database of chromosome Aberrations in cancer → NCI Recurrent Chromosome Aberrations in Cancer dtabase • Three search formats are available → convential Entrez query → Quick/Simple search: set of menus to select a disease site or diagnosis → Advanced search : combination of forms for more complex queries 39 DB Resources Categories • Databases retrieval tools • The BLAST family of sequence-similarity search programs • Resources for Gene-level sequences • Resources for Genome-scale analysis • Resources for the analysis of patterns of gene expression and phenotypes • The molecular modeling database, the conserved domain database search, CDART and Protein interactions 40 SAGEmap • Provides two-way mapping btw → regular (10 base) and LongSAGE (17 base) SAGE tags → UniGene clusters • SAGEmap repository contains → 381 SAGE experiments from 11 organisms • Can also construct a user-configurable table of data comparing one group of SAGE libraries with another • Is updated weekly 41 42 Gene Expression Ominbus • Data repository and retrieval system for any highthroughput gene expression or molecular abundance data • Contains → microarray-based experiments measuring the abundance of mRNA → genomic DNA and protein molecules → non-array-based technologies such as SAGE → mass spectrometry peptide profiling • Now contains → high-throughput gene expression data from about 30 000 hybridization experiment → about 1000 array definitions → half a billion individual spot measurement data derived from over 100 organisms 43 OMIM • Catalog of human genes and genetic disorders authored and edited by Victor A. McKusick at the John Hopkins University • Contains information on disease phenotypes and genes • Contains → about 16 000 entries 44 DB Resources Categories • Databases retrieval tools • The BLAST family of sequence-similarity search programs • Resources for Gene-level sequences • Resources for Genome-scale analysis • Resources for the analysis of patterns of gene expression and phenotypes • The molecular modeling database, the conserved domain database search, CDART and Protein interactions 45 MMDB • Built by processing entries from the Protein Data Bank • Structures are linked to sequences in Entrez and to the Conserved Domain Database. • Conserved Domain Search can be used to search a protein sequence for conserved domains in CDD • Wherever possible, CDD hits are linked to structure which can be viewed with NCBI’s 3D molecular structure viwer, Cn3D 46 HIV-I/Human Protein Interaction DB • Concise summary of documented interactions between HIV-1 proteins and → host cell proteins → other HIV-1 proteins → proteins from disease organisms associated with HIV or AIDS • Summaries, including protein RefSeq accession numbers, Entrez Gene ID number, … are presented 47 Summary / Conclusion • NCBI provides many tools for data retrieval and analysis of data in GenBank and other biological data • All of the tools and resources can be find easily on the website http://www.ncbi.nih.gov/ along with documentations and explanatory material • NCBI Handbook and several tutorials are available • One can search for tools and information in NCBI website by choosing NCBI Website as database 48 49 Thank you! 50 Outline • • • • • • • Introduction Related work Components of a Pseudoknotted Sec. Str. Parsing algorithm Enumerating loops Akutsu’s structure class Conclusion & Future work 51