Other_biol_databases_TDR

advertisement
Other biological databases
and ontologies
Biological systems
Sequence data
Protein folding and 3D structure
Taxonomic data
Literature
Pathways and networks
Protein families and domains
Small molecules
Whole genome data
Ontologies -GO
Biological systems
Ontologies
• An ontology is a formal specification of terms and
relationships between them –widely used in biology and
boinformatics (e.g. taxonomy)
• The relationships are important and represented as graphs
• Ontology terms should have definitions
• Ontologies are machine-readable
• They are needed for ordering and comparing large data
sets
What’s in a name?
• What is a cell?
What’s in a name?
• What is a cell?
Ambiguities in naming
• The same name can be used to describe different
concepts, e.g:
–
–
–
–
–
Glucose synthesis
Glucose biosynthesis
Glucose formation
Glucose anabolism
Gluconeogenesis
• All refer to the process of making glucose
• Makes it difficult to compare the information
• Solution: use Ontologies and Data Standards
Gene Ontology (GO)
http://www.geneontology.org
• Controlled vocabulary/ontology
• Introduced to provide standardised way of annotating
gene products (http://www.geneontology.org)
• Used for functional annotation of genes or proteins
GO ontologies
• Molecular function:
– tasks performed by gene product –e.g. G-protein coupled
receptor
• Biological process:
– broad biological goals accomplished by one or more gene
products –e.g. G-protein signaling pathway
• Cellular component:
– part(s) of a cell of which a gene product is a component;
includes extracellular environment of cells –e.g nucleus,
membrane etc.
GO term examples
• GO terms arranged
in DAG
• Relationships
between terms
How to annotate to GO
• See if gene product annotated already e.g. by
MOD or GOA
• Manual annotation –need evidence codes
• Blast2GO
• Using GO mapping files (e.g. InterPro, EC,
Swiss-Prot keyword)
Multiple GO
terms
Process mappings:
-Cell communication
(IPR2GO)
-GPCR pathways
(SPKW2GO)
-GPCR pathways (IDA)
Select most manual
first, then most specific
Finding existing GO annotation
• Small-scale –QuickGO or AmiGO browsers
• Large-scale:
– GOA FTP site
• GOA proteomes (>25% coverage)
• GOA human, mouse, rat, cow, zebrafish, Arabidopsis, etc.
• GOA UniProt
– Proteome Analysis
Searching GOA in QuickGO
• http://www.ebi.ac.uk/ego
Uses of GO annotation
Microarray data analysis
GO classification
Larkin JE et al, Physiol Genomics, 2004
Cunliffe HE et al, Cancer Res, 2003
Analysis of high-throughput data
Proteomics data analysis
GO classification
Open Biomedical Ontologies (OBO)
http://obo.sourceforge.net
• Central web location for accessing well-structured
CVs and ontologies for use in the biological and
medical sciences.
• Provides a simple format for ontologies that
encodes terms, relationships between terms and
definitions of terms (Not all OBO ontologies use
this format however).
Scope of OBO
•
•
•
•
•
•
•
•
•
•
•
•
•
Anatomy
Animal natural history and life history
Chemical
Development
Ethology
Evidence codes
Experimental conditions
Genomic and proteomic
Metabolomics
OBO relationship types
Phenotype
Taxonomic classification
Vocabularies
Other Biological Databases
•
•
•
•
•
•
•
Transcription factor binding sites -TRANSFAC
Protein structure databases- PDB, SCOP, CATH
Protein family databases- Pfam, Prints, PROSITE etc.
Chemicals and small molecules -ChEBI
Gene expression databases –GEO, ArrayExpress
Metabolic pathways - Reactome, KEGG
Genome Databases- Ensembl, FlyBase, WormBase etc.
Transcription factor binding sites
• TRANSFAC –database of eukaryotic transcription
factors: http://www.generegulation.com/pub/databases.html#transfac
• TESS –Transcription Element Search System –for
predicting transcription factor binding sites, uses
TRANSFAC: http://www.cbi.upenn.edu/tess
• TFsearch –for searching transcription factor binding
sites:
http://www.cbrc.jp/research/db/TFSEARCH.html
Protein structure databases
• Main resource is Protein Data Bank (PDB):
http://www.rcsb.org/pdb/
• Repository for solved structures
• Can search by PDB code
• Structural family databases based on PDB –SCOP
(http://scop.mrc-lmb.cam.ac.uk/scop/) and CATH
(http://www.biochem.ucl.ac.uk/bsm/cath/)
• Predicted structures in SWISS-MODEL
(http://swissmodel.expasy.org//SWISSMODEL.html)
Searching MSD
http://www.ebi.ac.uk/msd -Search by PDB code
Link to CATH
Protein family databases
• Databases that produce signatures for identifying
protein families or domains
• Used for functional classification of proteins
• E.g. Pfam, PROSITE, Prints, SMART,
TIGRFAMs etc.
• Integrated into single resource InterPro
(http://www.ebi.ac.uk/interpro)
InterProScan
sequence search
Stand-alone
version available
Results for
protein acc
Example
InterPro
entry
Chemicals and small molecules
• Chemical abstracts- http://www.cas.org/
• ChEBI- http://www.ebi.ac.uk/chebi
• KEGG –part of it includes chemicals
http://www.genome.jp/kegg
• ChemID plus -chemicals cited in NLM databases
http://chem2.sis.nlm.nih.gov/chemidplus/chemi
dlite.jsp
• MSD-Chem –ligands and chemicals in MSD
CheBI example entry
Hierarchy
for
chemicals
Gene expression databases
• NCBI Gene Expression Omnibus (GEO)
http://www.ncbi.nlm.nih.gov/geo/
• ArrayExpress http://www.ncbi.nlm.nih.gov/geo/
• Stanford microarray database http://genomewww5.stanford.edu/
• Can usually search for experiments or particular
expression profiles
GEO
search
page
Profiles search results
Specific
entry and
experiment
info
ArrayExpress
search results
Metabolic Pathways
• PATHGUIDE >200 pathways
• KEGG (Kyoto encyclopedia of genes and genomes):
http://www.genome.jp/kegg -includes:
– Database of chemicals, genes and networks (metabolic,
regulatory etc.)
– Well-curated and quite specific
• EcoCyc (Encyclopedia of E. coli K12 genes and
metabolism): http://ecocyc.org –curation of entries
genome
• Reactome –curated biological pathways:
http://www.reactome.org/
• GenMAPP –pathways contributed by users
Pathway in Reactome
Example of a pathway in BioCyc
Protein-protein interaction databases
• Protein-protein interaction databases store
pairwise interactions or complexes
• IntAct http://www.ebi.ac.uk/intact
• DIP (Database of Interacting Proteins)
http://dip.doe-mbi.ucla.edu/
• BIND (Biomolecular Interaction Network
Database) http://submit.bind.ca:8080/bind/
Protein-protein interactions
Genome browsers
• Integrate sequence & functional data for a genome
• Ensembl –genome browser for major eukaryotic genomes,
e.g. human, mouse etc. http://www.ensembl.org
• UCSC browser -http://genome.ucsc.edu/
• FlyBase –Drosophila genome database:
http://www.ebi.ac.uk/flybase
• WormBase –C. elegans: http://www.wormbase.org
• PlasmoDB –Plasmodium (malaria): http://plasmodb.org
• Etc.
Ensembl genome browser
Ensembl gene view 1
Ensembl
gene view 2
Gene within context on chromosome
Download