Tutorial - Protein Information Resource

advertisement
Tutorial:
Bioinformatics Resources
(http://pir.georgetown.edu/~huz/class/bioinfo_resource.html)
Bio-Trac 25 (Proteomics: Principles and Methods)
March 25, 2005
Zhang-Zhi Hu, M.D.
Senior Bioinformatics Scientist
Protein Information Resource
National Biomedical Research Foundation, GUMC
What is Bioinformatics?
computer + mouse = bioinformatics
(information)
(biology)
NIH Biomedical Information Science and Technology
Initiative (BISTI) Working Definition (2000) - Research,
development, or application of computational tools and
approaches for expanding the use of biological, medical,
behavioral or health data, including those to acquire, store,
organize, archive, analyze, or visualize such data.
2
Molecular Biology Database Collection
(http://nar.oupjournals.org/cgi/content/full/33/suppl_1/D5)
-- 719 key databases
of 14 categories
3
Database Collection in Nucleic Acids Res.
NAR Molecular Biology Database Collection
800
719
Database number
700
600
548
500
386
400
335
281
300
202
226
200
100
0
1999
2000
2001
2002
2003
2004
2005
Year
4
http://pir.georgetown.edu/~huz/class/2005_database_update.html
5
Overview
Database Contents, Search and Retrieval
I.
II.
III.
IV.
V.
VI.
Text search / Information retrieval
Sequence & genomics databases
Protein family databases
Database of protein functions
Databases of protein structures
Proteomics databases
6
Entrez Text Searches
(http://www.ncbi.nlm.nih.gov/Entrez/)
7
PubMed Literature Database
(http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?CMD=Search&DB=PubMed)
8
UniProt Text Search
(http://www.pir.uniprot.
org/cgi-bin/textSearch)
9
PIR Text
Search (I)
(http://pir.georgetown.edu/pir
www/search/textsearch.html)
What’s different
between
CRAA_RABIT
& CYRBAA?
How about
Search:
Crystallin and
SuperFamily?
10
PIR Text
Search
(II)
Can you find which
crystallin that has
3D structure
determined using
PIR text search?
11
I. Sequence & Genomics Databases
GenBank: An annotated collection of all publicly available nucleotide
and protein sequences.
RefSeq: NCBI non-redundant set of reference sequences, including
genomic DNA, transcript (RNA), and protein products
UniProt Consortium Database: Universal protein knowledgebase, a
central resource of protein sequence and function from Swiss-Prot,
TrEMBL and PIR.
Entrez Gene: Gene-centered information at NCBI.
UniGene: Unified clusters of ESTs and full-length mRNA sequences .
OMIM: Online Mendelian inheritance in man: a catalog of human
genetic and genomic disorders.
Model Organism Genome Databases: MGD, RGD, SGD, Flybase…
GeneCards: Integrated database of human genes, maps, proteins and
diseases.
SNP Consortium Database
12
UniProt Consortium Database
UniProtKB
(knowledgebase)
UniRef
(100,90,50)
UniParc
(archive)
(http://www.uniprot.org)
13
UniProt Sequence Report (I)
(http://www.pir.uniprot.org/cgibin/unipEntry?id=CRAA_RABIT)
14
UniProt Sequence Report (II)
(http://www.pir.uni
prot.org/cgibin/unipEntry?id=
UniRef90_P02489)
15
Entrez Gene
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd
=Retrieve&dopt=Graphics&list_uids=12954#ubor0_RefSeq
16
OMIM: Online Mendelian inheritance in man
(http://www.ncbi.nlm.nih.gov/entrez/dispomim.cgi?id=123580)
17
II. Protein Family Databases
Whole Proteins
 PIRSF: A Network Classification System of Protein Families
 COG (Clusters of Orthologous Groups) of Complete Genomes
 ProtoNet: Automated Hierarchical Classification of Proteins
Protein Domains
 Pfam: Alignments and HMM Models of Protein Domains
 SMART: Protein Domain Families
 CDD: Conserved Domain Database
Protein Motifs
 PROSITE: Protein Patterns and Profiles
 BLOCKS: Protein Sequence Motifs and Alignments
 PRINTS: Protein Sequence Motifs and Signatures
Integrated Family Databases
 iProClass: Superfamilies/Families, Domains, Motifs, Rich Links
 InterPro: Integrate Pfam, PRINTS, PROSITES, ProDom, SMART, PIRSF,
SuperFamily
18
Protein Clustering
COGs:
(http://www.ncbi.nlm.
nih.gov/COG/)
19
KOGs:
Eukaryotic
Clusters
(http://www.ncbi.nlm.nih.
gov/COG/new/shokog.cgi?
KOG3591)
20
Domain
Classification
(http://www.sanger.ac.uk/cgibin/Pfam/swisspfamget.pl?na
me=CRAA_RABIT)
(http://pir.georgetown.edu/cgi-bin/ipcEntry?id=CRAA_RABIT)
21
Pfam Domain
(http://www.sanger.ac.uk/cgibin/Pfam/getacc?PF00525)
22
Integrated Family Classification
InterPro:
An integrated
resource unifying
PROSITE,
PRINTS, ProDom,
Pfam, SMART, and
TIGRFAMs,
PIRSF.
(http://www.ebi.ac.
uk/interpro/search.
html)
23
PIRSF:
Full Length
Classification
iProClass
Family Report
(http://pir.georgetown.edu/c
gi-bin/ipcSF?id=SF002280) 24
Protein Motifs
PROSITE is a database of protein families and domains. It consists of
biologically significant sites, patterns and profiles. (http://us.expasy.org/prosite/)
25
III. Databases of Protein Functions
Metabolic Pathways, Enzymes, and Compounds








Enzyme Classification: Classification and Nomenclature of Enzyme-Catalysed
Reactions (EC-IUBMB)
KEGG (Kyoto Encyclopedia of Genes and Genomes): Metabolic Pathways
LIGAND (at KEGG): Chemical Compounds, Reactions and Enzymes
EcoCyc: Encyclopedia of E. coli Genes and Metabolism
MetaCyc: Metabolic Encyclopedia (Metabolic Pathways)
WIT: Functional Curation and Metabolic Models
BRENDA: Enzyme Database
UM-BBD: Microbial Biocatalytic Reactions and Biodegradation Pathways
Cellular Regulation and Gene Networks





EpoDB: Genes Expressed during Human Erythropoiesis
BIND: Descriptions of interactions, molecular complexes and pathways
DIP: Catalogs experimentally determined interactions between proteins
BioCarta: Biological pathways of human and mouse
GO: Gene Ontology Consortium Database
26
KEGG Metabolic & Regulatory Pathways
KEGG is a suite of databases and associated software, integrating our current knowledge
on molecular interaction networks, the information of genes and proteins, and of chemical
compounds and reactions. (http://www.genome.ad.jp/kegg/kegg2.html)
(http://www.genome.ad.jp/dbgetbin/show_pathway?hsa00220+4.3.2.1)
27
BioCyc (EcoCyc/MetaCyc
Metabolic Pathways)
The BioCyc Knowledge Library is a collection of
Pathway/Genome Databases (http://biocyc.org/)
28
BioCarta Cellular Pathways
(http://www.biocarta.com/index.asp)
29
Protein-Protein Interaction: BIND
(http://www.bind.ca/)
30
Gene Ontology
(http://www.geneontology.org/)
Three GOs:
Molecular Function
Biological Process
Cellular Component
31
IV. Databases of Protein Structures
Protein Structure





PDB: Structure Determined by X-ray Crystallography and NMR
PDBsum: Summaries and analyses of PDB structures
MMDB: NCBI’s database of 3D structures, part of NCBI Entrez
SWISS-MODEL Repository: Database of annotated protein 3D models
ModBase: Annotated comparative protein structure models
Structure Classification



CATH: Hierarchical Classification of Protein Domain Structures
SCOP: Familial and Structural Protein Relationships
FSSP: Protein Fold Classification Based on Structure--Structure
Alignment
32
PDB 3D Structure
Rat gamma-crystallin,
chain A, B.
Can you do a text search
at PIR to find this?
(http://www.rcsb.org/pdb/)
33
PDBsum:
Summary and Analysis
(http://www.biochem.ucl.
ac.uk/bsm/pdbsum)
34
Protein Structural Classification (1)
CATH: Hierarchical domain
classification of protein structures
(http://www.biochem.
ucl.ac.uk/bsm/cath_new/)
35
Protein Structural Classification (2)
SCOP: comprehensive description of structural and evolutionary relationships
between all proteins whose structure is known.
(http://scop.mrc-lmb.cam.ac.uk/scop/data/scop.b.html)
36
SWISS-MODEL Repository
A database of annotated three-dimensional
comparative protein structure models
(http://swissmodel.expasy.org/repository/s
mr.php?sptr_ac=CRGE_RAT&job=2)
37
VI. Proteomic Resources
GELBANK (http://gelbank.anl.gov): 2D-gel patterns from completed
genomes; SWISS-2DPAGE (http://www.expasy.org/ch2d/)
PEP: Predictions for Entire Proteomes: (http://cubic.bioc.columbia.edu/
pep/): Summarized analyses of protein sequences
Proteome BioKnowledge Library: (http://www.proteome.com): Detailed
information on human, mouse and rat proteomes
Proteome Analysis Database (http://www.ebi.ac.uk/proteome/): Online
application of InterPro and CluSTr for the functional classification of
proteins in whole genomes
Expression Profiling databases: GNF (http://expression.gnf.org/cgibin/index.cgi, human and mouse transcriptome), SMD (http://genomewww5.stanford.edu/MicroArray/SMD/, Stanford microarray data
analysis), EBI Microarray Informatics (http://www.ebi.ac.uk/microarray/
index.html , managing, storing and analyzing microarray data)
38
2D-Gel Image Databases (1)
(http://us.expasy.org/ch2d/2d-index.html)
(http://us.expasy.org/cgi-bin/nice2dpage.pl?P02489)
39
2D-Gel Image Databases (2)
(http://gelbank.anl.gov/2dgels/index.asp)
40
Expression Profiling
Human and Mouse Transcriptome
(http://genomewww.stanford.edu
/serum/)
(http://expression.gnf.org/cgi-bin/index.cgi)
(http://expression.gnf.org/
cgi-bin/index.cgi/)
41
Lab:
Alpha crystallin (UniProt: CRAA_RABIT)
Delta crystallin II
(Argininosuccinate lyase)
(UniProt: CRD2_ANAPL)
Choose additional
protein IDs to browse the
variety of molecular
biology databases each
sequence report links to.
42
Download