Other biological databases and ontologies Biological systems Sequence data Protein folding and 3D structure Taxonomic data Literature Pathways and networks Protein families and domains Small molecules Whole genome data Ontologies -GO Biological systems Ontologies • An ontology is a formal specification of terms and relationships between them –widely used in biology and boinformatics (e.g. taxonomy) • The relationships are important and represented as graphs • Ontology terms should have definitions • Ontologies are machine-readable • They are needed for ordering and comparing large data sets What’s in a name? • What is a cell? What’s in a name? • What is a cell? Ambiguities in naming • The same name can be used to describe different concepts, e.g: – – – – – Glucose synthesis Glucose biosynthesis Glucose formation Glucose anabolism Gluconeogenesis • All refer to the process of making glucose • Makes it difficult to compare the information • Solution: use Ontologies and Data Standards Gene Ontology (GO) http://www.geneontology.org • Controlled vocabulary/ontology • Introduced to provide standardised way of annotating gene products (http://www.geneontology.org) • Used for functional annotation of genes or proteins GO ontologies • Molecular function: – tasks performed by gene product –e.g. G-protein coupled receptor • Biological process: – broad biological goals accomplished by one or more gene products –e.g. G-protein signaling pathway • Cellular component: – part(s) of a cell of which a gene product is a component; includes extracellular environment of cells –e.g nucleus, membrane etc. GO term examples • GO terms arranged in DAG • Relationships between terms How to annotate to GO • See if gene product annotated already e.g. by MOD or GOA • Manual annotation –need evidence codes • Blast2GO • Using GO mapping files (e.g. InterPro, EC, Swiss-Prot keyword) Multiple GO terms Process mappings: -Cell communication (IPR2GO) -GPCR pathways (SPKW2GO) -GPCR pathways (IDA) Select most manual first, then most specific Finding existing GO annotation • Small-scale –QuickGO or AmiGO browsers • Large-scale: – GOA FTP site • GOA proteomes (>25% coverage) • GOA human, mouse, rat, cow, zebrafish, Arabidopsis, etc. • GOA UniProt – Proteome Analysis Searching GOA in QuickGO • http://www.ebi.ac.uk/ego Uses of GO annotation Microarray data analysis GO classification Larkin JE et al, Physiol Genomics, 2004 Cunliffe HE et al, Cancer Res, 2003 Analysis of high-throughput data Proteomics data analysis GO classification Open Biomedical Ontologies (OBO) http://obo.sourceforge.net • Central web location for accessing well-structured CVs and ontologies for use in the biological and medical sciences. • Provides a simple format for ontologies that encodes terms, relationships between terms and definitions of terms (Not all OBO ontologies use this format however). Scope of OBO • • • • • • • • • • • • • Anatomy Animal natural history and life history Chemical Development Ethology Evidence codes Experimental conditions Genomic and proteomic Metabolomics OBO relationship types Phenotype Taxonomic classification Vocabularies Other Biological Databases • • • • • • • Transcription factor binding sites -TRANSFAC Protein structure databases- PDB, SCOP, CATH Protein family databases- Pfam, Prints, PROSITE etc. Chemicals and small molecules -ChEBI Gene expression databases –GEO, ArrayExpress Metabolic pathways - Reactome, KEGG Genome Databases- Ensembl, FlyBase, WormBase etc. Transcription factor binding sites • TRANSFAC –database of eukaryotic transcription factors: http://www.generegulation.com/pub/databases.html#transfac • TESS –Transcription Element Search System –for predicting transcription factor binding sites, uses TRANSFAC: http://www.cbi.upenn.edu/tess • TFsearch –for searching transcription factor binding sites: http://www.cbrc.jp/research/db/TFSEARCH.html Protein structure databases • Main resource is Protein Data Bank (PDB): http://www.rcsb.org/pdb/ • Repository for solved structures • Can search by PDB code • Structural family databases based on PDB –SCOP (http://scop.mrc-lmb.cam.ac.uk/scop/) and CATH (http://www.biochem.ucl.ac.uk/bsm/cath/) • Predicted structures in SWISS-MODEL (http://swissmodel.expasy.org//SWISSMODEL.html) Searching MSD http://www.ebi.ac.uk/msd -Search by PDB code Link to CATH Protein family databases • Databases that produce signatures for identifying protein families or domains • Used for functional classification of proteins • E.g. Pfam, PROSITE, Prints, SMART, TIGRFAMs etc. • Integrated into single resource InterPro (http://www.ebi.ac.uk/interpro) InterProScan sequence search Stand-alone version available Results for protein acc Example InterPro entry Chemicals and small molecules • Chemical abstracts- http://www.cas.org/ • ChEBI- http://www.ebi.ac.uk/chebi • KEGG –part of it includes chemicals http://www.genome.jp/kegg • ChemID plus -chemicals cited in NLM databases http://chem2.sis.nlm.nih.gov/chemidplus/chemi dlite.jsp • MSD-Chem –ligands and chemicals in MSD CheBI example entry Hierarchy for chemicals Gene expression databases • NCBI Gene Expression Omnibus (GEO) http://www.ncbi.nlm.nih.gov/geo/ • ArrayExpress http://www.ncbi.nlm.nih.gov/geo/ • Stanford microarray database http://genomewww5.stanford.edu/ • Can usually search for experiments or particular expression profiles GEO search page Profiles search results Specific entry and experiment info ArrayExpress search results Metabolic Pathways • PATHGUIDE >200 pathways • KEGG (Kyoto encyclopedia of genes and genomes): http://www.genome.jp/kegg -includes: – Database of chemicals, genes and networks (metabolic, regulatory etc.) – Well-curated and quite specific • EcoCyc (Encyclopedia of E. coli K12 genes and metabolism): http://ecocyc.org –curation of entries genome • Reactome –curated biological pathways: http://www.reactome.org/ • GenMAPP –pathways contributed by users Pathway in Reactome Example of a pathway in BioCyc Protein-protein interaction databases • Protein-protein interaction databases store pairwise interactions or complexes • IntAct http://www.ebi.ac.uk/intact • DIP (Database of Interacting Proteins) http://dip.doe-mbi.ucla.edu/ • BIND (Biomolecular Interaction Network Database) http://submit.bind.ca:8080/bind/ Protein-protein interactions Genome browsers • Integrate sequence & functional data for a genome • Ensembl –genome browser for major eukaryotic genomes, e.g. human, mouse etc. http://www.ensembl.org • UCSC browser -http://genome.ucsc.edu/ • FlyBase –Drosophila genome database: http://www.ebi.ac.uk/flybase • WormBase –C. elegans: http://www.wormbase.org • PlasmoDB –Plasmodium (malaria): http://plasmodb.org • Etc. Ensembl genome browser Ensembl gene view 1 Ensembl gene view 2 Gene within context on chromosome