Tutorial: Bioinformatics Resources (http://pir.georgetown.edu/pirwww/workshop/bioinfo_resource.html) Bio-Trac 25 (Proteomics: Principles and Methods) April 4, 2008 Zhang-Zhi Hu, M.D. Research Associate Professor Protein Information Resource, Department of Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center 1 What is Bioinformatics? computer + mouse = bioinformatics (information) (biology) • NIH Biomedical Information Science and Technology Initiative (BISTI) Working Definition (2000) - Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data. 2 Molecular Biology Database Collection 1078 key databases of 14 categories (http://nar.oxfordjournals.org/cgi/cont ent/full/36/suppl_1/D2) 3 Database Collection in Nucleic Acids Res. 4 Online Access to Database Collection http://pir.georgetown.edu/pirwww/workshop/2005_database_update.html 2008 http://www.oxfordjournals.org/nar/database/cap/ 5 Overview Database Contents, Search and Retrieval I. Text search / Information retrieval II. Sequence & genomics databases III. Protein family databases IV. Database of protein functions V. Databases of protein structures VI. Proteomics databases Lab session 6 Entrez Text Searches Integrated one-stop search (http://www.ncbi.nlm.nih.gov/Entrez/) Lab 7 PubMed Literature Database (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?CMD=Search&DB=PubMed) Literature mining PMID:14640721 Lab 8 iProLINK: Protein Literature Mining Resource RLIMS-P: Text mining for protein phosphorylation BioThesaurus: Gene/protein name thesaurus: synonyms, ambiguous names… http://pir.georgetown.edu/iprolink/ Lab 9 BioThesaurus: Gene/protein name searches - synonyms, ambiguous names… Synonyms: CRYAA crystallin, alpha A CRYA1 HSPB4… http://pir.georgetown.edu/iprolink/biothesaurus 10 Lab RLIMS-P: Text mining for protein phosphorylation Lab http://pir.georgetown.edu/iprolink/rlimsp/ 11 UniProt Text Search (http://www.pir.uniprot. org/cgi-bin/textSearch) Google type search vs. Boolean searches: AND, OR, NOT 12 Lab PIR Text Search (I) (http://pir.georgetown.edu/pirww w/search/textsearch.html) Search: alpha crystallin A chain that are in protein families? Search for synonyms 13 Lab PIR Text Search (II) Argininosuccinate lyase (EC 4.3.2.1) Search: what crystallins are enzymes and what families they belong to? Can you find which crystallins have 3D structure determined? 14 Lab I. Sequence & Genomics Databases • NCBI Resources – GenBank: An annotated collection of all publicly available nucleotide and protein sequences. – RefSeq: NCBI non-redundant set of reference sequences, including genomic DNA, transcript (RNA), and protein products – Entrez Gene: Gene-centered information at NCBI. – UniGene: Unified clusters of ESTs and full-length mRNA sequences . – OMIM: Online Mendelian inheritance in man: a catalog of human genetic and genomic disorders. • • • • UniProt Consortium Database: Universal protein resource, a central repository of protein sequence and function. Model Organism Genome Databases: MGD, RGD, SGD, Flybase… GeneCards: Integrated database of human genes, maps, proteins and diseases. SNP Consortium Database (dbSNP); International HapMap Project: Genes associated with human diseases (http://www.oxfordjournals.org/nar/database/cap/) 15 UniProt Consortium Databases Universal Protein Resource Since October 2002 New! http://beta.uniprot.org/ 5.8 million (http://www.uniprot.org) 16 UniProt Sequence Report (I) UniProtKB What’s the difference between CRYAA_RABIT & CYRBAA? (http://www.pir.uniprot.org/cgibin/unipEntry?id=CRYAA_RABIT) Lab 17 UniProt Report (II): UniRef100 & 90 UniRef100 (http://www.pir.uniprot.org/cgi-bin/unipEntry?id=UniRef100_P02489) UniRef90 18 (http://www.pir.uniprot.org/cgi-bin/unipEntry?id=UniRef90_P02489) Entrez Gene – Gene centric information 19 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&cmd=Retrieve&dopt=Graphics&list_uids=12954#ubor0_RefSeq OMIM: Online Mendelian inheritance in man Juvenile cataract of Down syndrome Autosomal recessive congenital progressive cataract (http://www.ncbi.nlm.nih.gov/entrez/dispomim.cgi?id=123580) 20 II. Protein Family Databases • • • • Whole Proteins – PIRSF: Nonoverlapping Classification of Full Length Proteins Based on Evolutionary Relationship – COG (Clusters of Orthologous Groups) of Complete Genomes – PANTHER: Proteins Classified into Families/Subfamilies of Shared Function – ProtoNet: Automatic Hierarchical Classification of Proteins Protein Domains – Pfam: Alignments and HMM Models of Protein Domains – SMART: Protein Domain Identification and Annotation – CDD: Conserved Domain Database Protein Motifs – PROSITE: Protein Patterns and Profiles – BLOCKS: Protein Sequence Motifs and Alignments – PRINTS: Compendium of Protein Fingerprints (a group of conserved motifs) Integrated Family Databases – InterPro: Integrate Pfam, PRINTS, PROSITES, ProDom, SMART, PIRSF, SuperFamily… 21 Protein Clustering Initial version COGs: (http://www.ncbi.nlm. nih.gov/COG/) New version: Includes Eukaryotic Clusters 22 KOGs Lab PIRSF: Full Length Classification iProClass Family Report 23 (http://pir.georgetown.edu/cgi-bin/ipcSF?id=SF002280) Domain Classification – Pfam Domain (http://www.sanger.ac.uk/cgibin/Pfam/swisspfamget.pl?name= CRYAA_RABIT) (http://pir.georgetown.edu/cgibin/ipcEntry?id=P02493) 24 Pfam Domain (http://www.sanger.ac.uk/cgibin/Pfam/getacc?PF00525) 25 Protein Motifs: PROSITE – A database of protein families and domains. It consists of biologically significant sites, patterns and profiles. (http://us.expasy.org/prosite/) 26 Integrated Family Classification InterPro: An integrated resource unifying PROSITE, PRINTS, ProDom, Pfam, SMART, and TIGRFAMs, PIRSF. (http://www.ebi.ac.uk/ interpro/search.html) Mapping of families 27 III. Databases of Protein Functions • Metabolic Pathways, Enzymes, and Compounds – Enzyme Classification: Classification and Nomenclature of EnzymeCatalysed Reactions (EC-IUBMB) – KEGG (Kyoto Encyclopedia of Genes and Genomes): Metabolic Pathways – LIGAND (at KEGG): Chemical Compounds, Reactions and Enzymes – EcoCyc: Encyclopedia of E. coli Genes and Metabolism – MetaCyc: Metabolic Encyclopedia (Metabolic Pathways) – BRENDA: Enzyme Database – UM-BBD: Microbial Biocatalytic Reactions and Biodegradation Pathways • Inter-Molecular interactions and Regulatory Pathways – – – – – – IntAct: Protein interaction data from literature and user submission BIND: Descriptions of interactions, molecular complexes and pathways DIP: Catalogs experimentally determined interactions between proteins Reactome - A curated knowledgebase of biological pathways BioCarta: Biological pathways of human and mouse GO: Gene Ontology Consortium Database • Pathway Resources - Pathguide 28 Biological Pathway Resource Collection http://www.pathguide.org/ • • • • • Protein-protein interactions Metabolic pathways Signaling pathways Pathway diagrams Transcription factors / gene regulatory networks • Protein-compound interactions • Genetic interaction networks 29 http://www.pathway commons.org/pc/ho me.do 30 Lab KEGG Metabolic & Regulatory Pathways KEGG is a suite of databases and associated software, integrating our current knowledge on molecular interaction networks, the information of genes and proteins, and of chemical compounds and reactions. (http://www.genome.ad.jp/kegg/kegg2.html) (http://www.genome.ad.jp/dbgetbin/show_pathway?hsa00220+4.3.2.1) 31 BioCyc: EcoCyc/MetaCyc Metabolic Pathways The BioCyc Knowledge Library is a collection of Pathway/Genome Databases (http://biocyc.org/) 32 BioCarta Cellular Pathways (http://www.biocarta.com/index.asp) 33 Reactome: • • • • • http://www.reactome.org/ Collaboration of CSHL, EBI and GO Consortium Curated resource of core pathways and reactions in human biology Authored by biological researchers of field experts Cross-referenced with NCBI, Ensembl and UniProt, HapMap, KEGG… Inferred orthologous events in 22 non-human species (mouse, rat…) 34 Transforming Growth Factor (TGF) beta signaling [Homo sapiens] Reactome: events and objects (including modified forms and complex) (http://reactome.org/cgibin/eventbrowser?DB=gk_curre nt&FOCUS_SPECIES=Homo% 20sapiens&ID=170834&) Event ->REACT_6879.1: Activated type I receptor phosphorylates R-SMAD directly [Homo sapiens] Object -> REACT_7364.1: Phospho-R-SMAD [cytosol] Event -> REACT_6760.1: Phospho-R-SMAD forms a complex with CO-SMAD [Homo sapiens] Object -> REACT_7344.1: Phospho-R-SMAD:CO-SMAD complex [cytosol] Event -> REACT_6726.1: The phospho-R-SMAD:CO-SMAD transfers to the nucleus Object -> REACT_7382.2: Phospho-R-SMAD:CO-SMAD complex [nucleoplasm] …… 35 Protein-Protein Interaction Database - IntAct (http://www.ebi.ac.uk/intact/) 36 Gene Ontology (GO) (http://www.geneontology.org/) - Molecular Function - Biological Process - Cellular Component 37 IV. Databases of Protein Structures • Protein Structure – PDB: Structure Determined by X-ray Crystallography and NMR – PDBsum: Summaries and analyses of PDB structures – MMDB: NCBI’s database of 3D structures, part of NCBI Entrez – SWISS-MODEL Repository: Database of annotated protein 3D models – ModBase: Annotated comparative protein structure models • Structure Classification – CATH: Hierarchical Classification of Protein Domain Structures – SCOP: Familial and Structural Protein Relationships – FSSP: Protein Fold Classification Based on Structure--Structure Alignment 38 PDB: Experimental 3D Structure Repository Rat gamma-crystallin (chain A, B.) Can you do a text search at PIR to find this (CRGE_RAT)? (http://www.rcsb.org/pdb/) 39 Lab PDBsum: Pictorial Database to Provide Summary and Analysis to PDB Entries Search 3-D structure summary 2-D structure summary (http://www.ebi.ac.uk/thornto n-srv/databases/pdbsum/) 40 Protein Structural Classification (1) CATH: Hierarchical domain classification of protein structures (http://www.cathdb.info/latest/index.html) 41 Protein Structural Classification (2) SCOP: comprehensive description of structural and evolutionary relationships between all proteins whose structure is known. 42 (http://scop.mrc-lmb.cam.ac.uk/scop/data/scop.b.html) SWISS-MODEL Repository http://swissmodel.expasy.org/ http://swissmodel.expasy.org/repository/ A database of annotated three-dimensional comparative protein structure models (http://swissmodel.expasy.org/repository/smr.php?spt r_ac=CRBA1_MOUSE&job=2) 43 VI. Proteomic Resources • GELBANK (http://gelbank.anl.gov): 2D-gel patterns of species with completed genomes. • SWISS-2DPAGE (http://www.expasy.org/ch2d/): index of 2D-gels • PEP (http://cubic.bioc.columbia.edu/ pep/): Predictions for Entire Proteomes: summarized analyses of protein sequences • Integr8 (http://www.ebi.ac.uk/integr8/): A browser for information relating to completed genomes and proteomes, based on data contained in Genome Reviews and the UniProt proteome sets • PRIDE (http://www.ebi.ac.uk/pride/): PRoteomics IDEntifications database Expression Profiling databases • GPMdb (http://gpmdb.thegpm.org/): Mass spec proteomics Databases • PeptideAtlas (http://www.peptideatlas.org/): compendium of peptides identified in a large set of tandem mass spectrometry proteomic experiments • HUPO (http://www.hupo.org/): Human Proteome Organization to 44 foste international proteomics initiatives. Lab 2D-Gel Image Databases (http://us.expasy.org/ch2d/) Part of WORLD-2DPAGE: index to 2-D PAGE databases and services 45 (http://us.expasy.org/swiss-2dpage/ac=P02489) GPMdb: MS Data Search (http://gpmdb.thegpm.org/) 46 Craig, et al., J Proteome Res. 2004, 3:1234-42. PRIDE: centralized, standards compliant, public data repository for proteomics data http://www.ebi.ac.uk/pride/ HUPO Plasma Proteome Project 47 Lab: I. Text search / Information retrieval 1. Literature search and text mining – Finding synonyms (BioThesaurus) – Information extraction (e.g., protein phosphorylation sites) 2. Find the sequence for the rabbit alpha crystallin A chain 3. Find all alpha crystallin A chain classified in protein families 4. Search crystallins that have active enzyme activities 5. Find crystallins that have determined 3D structures II. Database contents (reports) 1. Sequence & genomics databases (UniProt) 2. Protein family databases (PIRSF) 3. Database of protein functions (KEGG) 4. Databases of protein structures (PDB) 5. Proteomics databases (Swiss-2D) Protein Examples Rabbit alpha crystallin A (UniProtKB: CRYAA_RABIT/P02493) • Delta crystallin II (Argininosuccinate lyase) (UniProtKB: ARLY2_ANAPL/P24058) • Any additional proteins of your interest for search and retrieval • 48