Wheeler, 2005

advertisement
Database Resources of the National
Center for Biotechnology
Information
David L. Wheeler et al. Nucleic Acids Research, Vol. 33,
Database issue
Baharak Rastegari
MEDG 505 presentation
February 3, 2005
baharak@cs.ubc.ca
1
NCBI! What is it?
• Created in 1998
• At the National Institutes of Health
• To develop information systems for molecular
biology
• Maintains: GenBank(R) nucleic acid sequence
database
• Provides: Data retrieval systems & computational
resources
2
3
DB Resources Categories
• Databases retrieval tools
• The BLAST family of sequence-similarity
search programs
• Resources for Gene-level sequences
• Resources for Genome-scale analysis
• Resources for the analysis of patterns of
gene expression and phenotypes
• The molecular modeling database, the
conserved domain database search, CDART
and Protein interactions
4
DB Resources Categories
• Databases retrieval tools
• The BLAST family of sequence-similarity
search programs
• Resources for Gene-level sequences
• Resources for Genome-scale analysis
• Resources for the analysis of patterns of
gene expression and phenotypes
• The molecular modeling database, the
conserved domain database search, CDART
and Protein interactions
5
Entrez
• Text searching
→ using Boolean queries
→ of a diverse set of over 20 databases
• Simultaneous searches across all Entrez
databases at speeds comparable to a single
database search
6
7
Entrez
• Retrieved record can be displayed in a wide
variety of formats
→ GenBank Flatfile, FASTA, XML, …
• Graphical display is offered for some type of
records
• Search history
→ allows users to recall result of previous searches
and combine them using Boolean logic
8
Entrez
• PubMed
→ includes 12.8 million references and abstracts in
MEDLINE(R)
→ with links to the full text of more than 4400
journals available on web
• PubMed Central
→ digital archive of peer reviewed journals in life
sciences
→ access to over 300 000 full text articles
→ over 160 journals
• Books database
→ Contains more than 35 online scientific textbook
9
Taxonomy
• Indexed over 165 000 named organisms
• Can be used to view taxonomic position or retrieve
data from a database for particular organism or
group
• Searches can be made on whole, partial or
phonetically spelled organism names
• Links to organisms commonly used in biological
research are provided
• Display custom taxonomic trees, representing userdefined subsets of the full NCBI taxonomy
12
13
14
15
Entrez Gene
• Successor to LocusLink
• Provides an interface to curated sequences
and descriptive information about genes
• With links to gene related resources
→ NCBI’s Map Viewer, Evidence Viewer, Blast Link, ..
16
DB Resources Categories
• Databases retrieval tools
• The BLAST family of sequence-similarity
search programs
• Resources for Gene-level sequences
• Resources for Genome-scale analysis
• Resources for the analysis of patterns of
gene expression and phenotypes
• The molecular modeling database, the
conserved domain database search, CDART
and Protein interactions
17
BLAST Family
• BLAST
→ Local alignment search tool
→ performing sequence-similarity searches against
variety of sequence databases
→ returning a set of gapped alignments btw the
query and database sequences
• BLAST2Sequences
→ comparing two DNA or protein sequences
→ producing a dot-plot representation of the
alignments
18
19
BLAST Family
• MegaBLAST
→ designed to search for nearly exact matches
→ handles batch nucleotide queries
→ operates up to 10 times faster than standard
nucleotide BLAST
• BLASTLink (BLink)
→ displays pre-computed protein BLAST alignments
for each protein in the Entrez databases
→ can display subset of these alignments by
taxonomic criteria, database of origin, …
20
DB Resources Categories
• Databases retrieval tools
• The BLAST family of sequence-similarity
search programs
• Resources for Gene-level sequences
• Resources for Genome-scale analysis
• Resources for the analysis of patterns of
gene expression and phenotypes
• The molecular modeling database, the
conserved domain database search, CDART
and Protein interactions
21
UniGene
• System for automatically partitioning Gen-Bank
sequences, including ESTs, into a non-redundant set of
gene-oriented clusters
• Each cluster contains sequences that represent a
unique gene, and is linked to related information
• Human UniGene
→ over 4.5 million human ESTs
→ reduced to 42-fold in number to approximately 107 000
sequence clusters
• Has been used as a source of unique sequences for the
fabrication of microarrays for the large-scale study
of gene expression
22
ProEST
• Analogous to BLASTLink
• Presents pre-computed BLAST alignment btw
protein sequences from model organisms and
six-frame translations of UniGene nucleotide
sequences
• Reports are updated in tandem with UniGene
protein similarities
23
Trace & Assembly Archives
• Trace Archive allows for flexible searching
and download of sequencing traces
• Assembly Archive links the raw sequence
information found in the Trace Archive with
assembly information found in GenBank
24
HomoloGene
• System for automated detection of homologs
among the annotated genes of several
completely sequence eukaryotic
• New HomoloGene build is guided by the
taxonomic tree, relies on:
→ conserved gene order & measures of DNA similarity
among closely related species
→ protein similarity for more distantly related
organisms
•
25
…HomoloGene
•
‘Ancestor’ field
→ refers to the taxonomic group of the last common ancestor of
the species represented in HomoloGene entry
→ using it is possible to limit a search to genes conserved in one
of 22 ancestral group
• ‘Pairwise Score’ display gives a table of pairwise
statistics for members of a Homologene group that
includes
→ percent amino acid and nucleotide identities
→ Jukes-Cantor genetic distance parameter
→ the ratio of non-synonymous to synonymous amino acid
substitutions (Ka/Ks)
26
Reference Sequences
• RefSeq provides curated references for
→ transcripts, proteins and genomic regions
→ computationally derived nucleotide sequences and
proteins
• Containing 1.3 million sequences
→ including more than 1 million protein sequences
→ representing more than 2400 organisms
28
ORF Finder and Spidey
• ORF finder
→ performs a six-frame translation of a nucleotide
sequence
→ returns the location of each ORF within a specified
size range
• Spidey
→ alignment tool for eukaryotic genomic sequences
→ takes into account predicted splice sites in
constructing its alignment, and can use one of four
splice-site models
→ returns exon alignments, protein translations and a
summary showing the alignment quality, …
29
Electronic PCR (e-PCR)
• Forward e-PCR
→ searches for matches to STS primer pairs in the
UniSTS database of over 450 000 markers
→ to increase sensitivity, allows the size of primer
segment to be matched, number of mismatches,
number of gaps and the size of the STS to be
adjusted
• Reverse e-PCR
→ used to estimate the genomic binding site, amplicon
size and specificity for sets of primer pairs by
searching against the genomic and transcript
databases
30
31
32
dbSNP
• Database of single nucleotide polymorphisms
• Repository for single base nucleotide
substitutions and short deletion and insertion
polymorphisms
• Contains 9.8 million human SNPs as well as about
5 million from a variety of other organisms
33
DB Resources Categories
• Databases retrieval tools
• The BLAST family of sequence-similarity
search programs
• Resources for Gene-level sequences
• Resources for Genome-scale analysis
• Resources for the analysis of patterns of
gene expression and phenotypes
• The molecular modeling database, the
conserved domain database search, CDART
and Protein interactions
34
Entrez Genomes
• Provides access to genomic data contributed by the
scientific community for species whose sequencing and
mapping is complete or in progress
• Includes:
→ over 180 complete microbial genomes
→ more than 1600 viral genomes
→ over 550 reference sequences for eukaryotic organelles
→…
• Complete genome can be accessed hierarchically starting
from either
→ an alphabetical listing
→ phylogenetic tree for each of six principal taxonomic groups
35
COGs database
• Clusters of orthologous groups
• Presents a compilation of orthologous groups of
proteins from 66 completely sequenced
organisms
• Eukaryotic version, KOGs, is available for seven
eukaryotes
36
MAP & Evidence Viewer
• MAP Viewer displays
→ genome assemblies
→ genetic and physical markers
→ the result of annotation, and other analyses using
sets of aligned maps
• Evidence Viewer displays the alignments to a
→ genomic contig of RefSeq transcripts
→ GenBank mRNAs
→ known or potential transcripts
→ EST’s supporting a gene model
37
Cancer Chromosome
• Consists of
→ NCI/NCBI SKY, M-FISH and CGH databases
→ NCI Mitelman database of chromosome Aberrations
in cancer
→ NCI Recurrent Chromosome Aberrations in Cancer
dtabase
• Three search formats are available
→ convential Entrez query
→ Quick/Simple search: set of menus to select a
disease site or diagnosis
→ Advanced search : combination of forms for more
complex queries
39
DB Resources Categories
• Databases retrieval tools
• The BLAST family of sequence-similarity
search programs
• Resources for Gene-level sequences
• Resources for Genome-scale analysis
• Resources for the analysis of patterns of
gene expression and phenotypes
• The molecular modeling database, the
conserved domain database search, CDART
and Protein interactions
40
SAGEmap
• Provides two-way mapping btw
→ regular (10 base) and LongSAGE (17 base) SAGE tags
→ UniGene clusters
• SAGEmap repository contains
→ 381 SAGE experiments from 11 organisms
• Can also construct a user-configurable table of
data comparing one group of SAGE libraries with
another
• Is updated weekly
41
42
Gene Expression Ominbus
• Data repository and retrieval system for any highthroughput gene expression or molecular abundance data
• Contains
→ microarray-based experiments measuring the abundance of
mRNA
→ genomic DNA and protein molecules
→ non-array-based technologies such as SAGE
→ mass spectrometry peptide profiling
• Now contains
→ high-throughput gene expression data from about 30 000
hybridization experiment
→ about 1000 array definitions
→ half a billion individual spot measurement data derived from
over 100 organisms
43
OMIM
• Catalog of human genes and genetic disorders
authored and edited by Victor A. McKusick at
the John Hopkins University
• Contains information on disease phenotypes and
genes
• Contains
→ about 16 000 entries
44
DB Resources Categories
• Databases retrieval tools
• The BLAST family of sequence-similarity
search programs
• Resources for Gene-level sequences
• Resources for Genome-scale analysis
• Resources for the analysis of patterns of
gene expression and phenotypes
• The molecular modeling database, the
conserved domain database search, CDART
and Protein interactions
45
MMDB
• Built by processing entries from the Protein
Data Bank
• Structures are linked to sequences in Entrez and
to the Conserved Domain Database.
• Conserved Domain Search can be used to search
a protein sequence for conserved domains in
CDD
• Wherever possible, CDD hits are linked to
structure which can be viewed with NCBI’s 3D
molecular structure viwer, Cn3D
46
HIV-I/Human Protein Interaction DB
• Concise summary of documented interactions
between HIV-1 proteins and
→ host cell proteins
→ other HIV-1 proteins
→ proteins from disease organisms associated with HIV
or AIDS
• Summaries, including protein RefSeq accession
numbers, Entrez Gene ID number, … are
presented
47
Summary / Conclusion
• NCBI provides many tools for data retrieval and
analysis of data in GenBank and other biological
data
• All of the tools and resources can be find easily
on the website http://www.ncbi.nih.gov/ along
with documentations and explanatory material
• NCBI Handbook and several tutorials are
available
• One can search for tools and information in
NCBI website by choosing NCBI Website as
database
48
49
Thank you!
50
Outline
•
•
•
•
•
•
•
Introduction
Related work
Components of a Pseudoknotted Sec. Str.
Parsing algorithm
Enumerating loops
Akutsu’s structure class
Conclusion & Future work
51
Download