Tutorial: Bioinformatics Resources (http://pir.georgetown.edu/~huz/class/bioinfo_resource.html) Bio-Trac 25 (Proteomics: Principles and Methods) March 25, 2005 Zhang-Zhi Hu, M.D. Senior Bioinformatics Scientist Protein Information Resource National Biomedical Research Foundation, GUMC What is Bioinformatics? computer + mouse = bioinformatics (information) (biology) NIH Biomedical Information Science and Technology Initiative (BISTI) Working Definition (2000) - Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data. 2 Molecular Biology Database Collection (http://nar.oupjournals.org/cgi/content/full/33/suppl_1/D5) -- 719 key databases of 14 categories 3 Database Collection in Nucleic Acids Res. NAR Molecular Biology Database Collection 800 719 Database number 700 600 548 500 386 400 335 281 300 202 226 200 100 0 1999 2000 2001 2002 2003 2004 2005 Year 4 http://pir.georgetown.edu/~huz/class/2005_database_update.html 5 Overview Database Contents, Search and Retrieval I. II. III. IV. V. VI. Text search / Information retrieval Sequence & genomics databases Protein family databases Database of protein functions Databases of protein structures Proteomics databases 6 Entrez Text Searches (http://www.ncbi.nlm.nih.gov/Entrez/) 7 PubMed Literature Database (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?CMD=Search&DB=PubMed) 8 UniProt Text Search (http://www.pir.uniprot. org/cgi-bin/textSearch) 9 PIR Text Search (I) (http://pir.georgetown.edu/pir www/search/textsearch.html) What’s different between CRAA_RABIT & CYRBAA? How about Search: Crystallin and SuperFamily? 10 PIR Text Search (II) Can you find which crystallin that has 3D structure determined using PIR text search? 11 I. Sequence & Genomics Databases GenBank: An annotated collection of all publicly available nucleotide and protein sequences. RefSeq: NCBI non-redundant set of reference sequences, including genomic DNA, transcript (RNA), and protein products UniProt Consortium Database: Universal protein knowledgebase, a central resource of protein sequence and function from Swiss-Prot, TrEMBL and PIR. Entrez Gene: Gene-centered information at NCBI. UniGene: Unified clusters of ESTs and full-length mRNA sequences . OMIM: Online Mendelian inheritance in man: a catalog of human genetic and genomic disorders. Model Organism Genome Databases: MGD, RGD, SGD, Flybase… GeneCards: Integrated database of human genes, maps, proteins and diseases. SNP Consortium Database 12 UniProt Consortium Database UniProtKB (knowledgebase) UniRef (100,90,50) UniParc (archive) (http://www.uniprot.org) 13 UniProt Sequence Report (I) (http://www.pir.uniprot.org/cgibin/unipEntry?id=CRAA_RABIT) 14 UniProt Sequence Report (II) (http://www.pir.uni prot.org/cgibin/unipEntry?id= UniRef90_P02489) 15 Entrez Gene http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd =Retrieve&dopt=Graphics&list_uids=12954#ubor0_RefSeq 16 OMIM: Online Mendelian inheritance in man (http://www.ncbi.nlm.nih.gov/entrez/dispomim.cgi?id=123580) 17 II. Protein Family Databases Whole Proteins PIRSF: A Network Classification System of Protein Families COG (Clusters of Orthologous Groups) of Complete Genomes ProtoNet: Automated Hierarchical Classification of Proteins Protein Domains Pfam: Alignments and HMM Models of Protein Domains SMART: Protein Domain Families CDD: Conserved Domain Database Protein Motifs PROSITE: Protein Patterns and Profiles BLOCKS: Protein Sequence Motifs and Alignments PRINTS: Protein Sequence Motifs and Signatures Integrated Family Databases iProClass: Superfamilies/Families, Domains, Motifs, Rich Links InterPro: Integrate Pfam, PRINTS, PROSITES, ProDom, SMART, PIRSF, SuperFamily 18 Protein Clustering COGs: (http://www.ncbi.nlm. nih.gov/COG/) 19 KOGs: Eukaryotic Clusters (http://www.ncbi.nlm.nih. gov/COG/new/shokog.cgi? KOG3591) 20 Domain Classification (http://www.sanger.ac.uk/cgibin/Pfam/swisspfamget.pl?na me=CRAA_RABIT) (http://pir.georgetown.edu/cgi-bin/ipcEntry?id=CRAA_RABIT) 21 Pfam Domain (http://www.sanger.ac.uk/cgibin/Pfam/getacc?PF00525) 22 Integrated Family Classification InterPro: An integrated resource unifying PROSITE, PRINTS, ProDom, Pfam, SMART, and TIGRFAMs, PIRSF. (http://www.ebi.ac. uk/interpro/search. html) 23 PIRSF: Full Length Classification iProClass Family Report (http://pir.georgetown.edu/c gi-bin/ipcSF?id=SF002280) 24 Protein Motifs PROSITE is a database of protein families and domains. It consists of biologically significant sites, patterns and profiles. (http://us.expasy.org/prosite/) 25 III. Databases of Protein Functions Metabolic Pathways, Enzymes, and Compounds Enzyme Classification: Classification and Nomenclature of Enzyme-Catalysed Reactions (EC-IUBMB) KEGG (Kyoto Encyclopedia of Genes and Genomes): Metabolic Pathways LIGAND (at KEGG): Chemical Compounds, Reactions and Enzymes EcoCyc: Encyclopedia of E. coli Genes and Metabolism MetaCyc: Metabolic Encyclopedia (Metabolic Pathways) WIT: Functional Curation and Metabolic Models BRENDA: Enzyme Database UM-BBD: Microbial Biocatalytic Reactions and Biodegradation Pathways Cellular Regulation and Gene Networks EpoDB: Genes Expressed during Human Erythropoiesis BIND: Descriptions of interactions, molecular complexes and pathways DIP: Catalogs experimentally determined interactions between proteins BioCarta: Biological pathways of human and mouse GO: Gene Ontology Consortium Database 26 KEGG Metabolic & Regulatory Pathways KEGG is a suite of databases and associated software, integrating our current knowledge on molecular interaction networks, the information of genes and proteins, and of chemical compounds and reactions. (http://www.genome.ad.jp/kegg/kegg2.html) (http://www.genome.ad.jp/dbgetbin/show_pathway?hsa00220+4.3.2.1) 27 BioCyc (EcoCyc/MetaCyc Metabolic Pathways) The BioCyc Knowledge Library is a collection of Pathway/Genome Databases (http://biocyc.org/) 28 BioCarta Cellular Pathways (http://www.biocarta.com/index.asp) 29 Protein-Protein Interaction: BIND (http://www.bind.ca/) 30 Gene Ontology (http://www.geneontology.org/) Three GOs: Molecular Function Biological Process Cellular Component 31 IV. Databases of Protein Structures Protein Structure PDB: Structure Determined by X-ray Crystallography and NMR PDBsum: Summaries and analyses of PDB structures MMDB: NCBI’s database of 3D structures, part of NCBI Entrez SWISS-MODEL Repository: Database of annotated protein 3D models ModBase: Annotated comparative protein structure models Structure Classification CATH: Hierarchical Classification of Protein Domain Structures SCOP: Familial and Structural Protein Relationships FSSP: Protein Fold Classification Based on Structure--Structure Alignment 32 PDB 3D Structure Rat gamma-crystallin, chain A, B. Can you do a text search at PIR to find this? (http://www.rcsb.org/pdb/) 33 PDBsum: Summary and Analysis (http://www.biochem.ucl. ac.uk/bsm/pdbsum) 34 Protein Structural Classification (1) CATH: Hierarchical domain classification of protein structures (http://www.biochem. ucl.ac.uk/bsm/cath_new/) 35 Protein Structural Classification (2) SCOP: comprehensive description of structural and evolutionary relationships between all proteins whose structure is known. (http://scop.mrc-lmb.cam.ac.uk/scop/data/scop.b.html) 36 SWISS-MODEL Repository A database of annotated three-dimensional comparative protein structure models (http://swissmodel.expasy.org/repository/s mr.php?sptr_ac=CRGE_RAT&job=2) 37 VI. Proteomic Resources GELBANK (http://gelbank.anl.gov): 2D-gel patterns from completed genomes; SWISS-2DPAGE (http://www.expasy.org/ch2d/) PEP: Predictions for Entire Proteomes: (http://cubic.bioc.columbia.edu/ pep/): Summarized analyses of protein sequences Proteome BioKnowledge Library: (http://www.proteome.com): Detailed information on human, mouse and rat proteomes Proteome Analysis Database (http://www.ebi.ac.uk/proteome/): Online application of InterPro and CluSTr for the functional classification of proteins in whole genomes Expression Profiling databases: GNF (http://expression.gnf.org/cgibin/index.cgi, human and mouse transcriptome), SMD (http://genomewww5.stanford.edu/MicroArray/SMD/, Stanford microarray data analysis), EBI Microarray Informatics (http://www.ebi.ac.uk/microarray/ index.html , managing, storing and analyzing microarray data) 38 2D-Gel Image Databases (1) (http://us.expasy.org/ch2d/2d-index.html) (http://us.expasy.org/cgi-bin/nice2dpage.pl?P02489) 39 2D-Gel Image Databases (2) (http://gelbank.anl.gov/2dgels/index.asp) 40 Expression Profiling Human and Mouse Transcriptome (http://genomewww.stanford.edu /serum/) (http://expression.gnf.org/cgi-bin/index.cgi) (http://expression.gnf.org/ cgi-bin/index.cgi/) 41 Lab: Alpha crystallin (UniProt: CRAA_RABIT) Delta crystallin II (Argininosuccinate lyase) (UniProt: CRD2_ANAPL) Choose additional protein IDs to browse the variety of molecular biology databases each sequence report links to. 42