Understanding proteins: resources for identification and annotation The Gene Ontology: Annotating protein function, role and localization Contact: Jane Lomax Coordinator, GO Editorial Office EBI-EMBL jane@ebi.ac.uk What is an ontology? What is an ontology? → Collectibles & art → Stamps → UK (Great Britain)Victoria → 1884 GREAT BRITAIN 10S SCOTT (11,999.99$) A definition... “A controlled representation of ideas, concepts or events in a given domain and the relationships between them.” Why do we need ontologies? Help with data retrieval allow grouping of annotations brain hindbrain rhombomere 20 15 10 Query ‘brain’ without ontology Query ‘brain’ with ontology 20 45 Make data (re-)usable through standards Common structure and terminology (controlled vocabulary) Avoid redundancies (single data source) Allow common tools, techniques, training, validation... Adapted from Barry Smith: http://ontology.buffalo.edu/smith/BioOntology_Course.html Gene ontology http://geneontology.org/ What is the gene ontology? Organized, controlled vocabulary of terms that describe gene products characteristics. • Represents gene product properties, not gene products themselves • Three branches (domains): Cellular component Molecular function Biological process • Species-independent (with taxonomic restrictions) • Represents physiological processes • Goes up to the level of the cell How does GO work? The Gene Ontology is like a dictionary term: transcription initiation id: GO:0006352 definition: Processes involved in the assembly of the RNA polymerase complex at the promoter region of a DNA template resulting in the subsequent synthesis of RNA from that promoter. GO tree and annotations is_a part_of Clark et al., 2005 An annotation example… GO terms for Caspase 9 Which processes are up- or downtime regulated? Defense response Immune response Response to stimulus Toll regulated genes JAK-STAT regulated genes Puparial adhesion Molting cycle hemocyanin Amino acid catabolism Lipid metobolism Peptidase activity Protein catabloism Immune response Immune response Toll regulated genes attacked control Selected Gene Tree: pearson Coloredby: by: Selected Gene Tree: pearson lw n3d ... lw n3d ... Colored Branch color classification: Set_LW_n3d_5p_... Gene List: Branch color classification: Set_LW_n3d_5p_... Gene List: Bregje Wertheim at the Centre for Evolutionary Genomics, Department of Biology, UCL and Eugene Schuster Group, EBI. Copy of Copy C5_RMA Copy ofofCopy of(Defa... C5_RMA (Defa... allall genes (14010)(14010) genes QuickGO: browsing GO Term definition http://www.ebi.ac.uk/QuickGO/ QuickGO: browsing GO Term relationships (ancestors) QuickGO: browsing GO Term relationships (children) QuickGO: browsing GO Proteins annotated to term Annotation and ontology files www.geneontology.org/GO.downloads.shtml Ontology files: • Hold ontology terms and structure • Species-independent • You can get GO-slims Annotation files: • Hold list of terms and the proteins annotated with them • You can get speciesspecific files or the whole annotation. More about GO: EBI train online www.ebi.ac.uk/training/online/course/go-quick-tour www.ebi.ac.uk/training/online/course/uniprot-goa-quick-tour Acknowledgements & questions Jane Lomax Coordinator, GO Editorial Office EBI-EMBL jane@ebi.ac.uk UniProt: A repository of annotated protein sequences Contact: Duncan Legge UniProt Content Team EBI-EMBL help@uniprot.org dlegge@ebi.ac.uk Background of UniProt Since 2002 a merger and collaboration of three databases: Swiss-Prot & TrEMBL PIR-PSD Funded mainly by NIH (US) to be the highest quality, most thoroughly annotated protein sequence database We Aim To Provide… o A high quality protein sequence database A non redundant protein database, with maximal coverage including splice isoforms, disease variant and PTMs. Sequence archiving essential. o Easy protein identification Stable identifiers and consistent nomenclature / controlled vocabularies o Thorough protein annotation Detailed information on protein function, biological processes, molecular interactions and pathways cross-referenced to external source The Two Sides of UniProtKB UniProtKB/TrEMBL UniProtKB/Swiss-Prot 1 entry per nucleotide submission 1 entry per protein Redundant, automatically annotated - unreviewed Non-redundant, high-quality manual annotation - reviewed UniProtKB/TrEMBL Computationally annotated UniProtKB/Swiss-Prot Manually annotated Data sources of UniProtKB UniProt/TrEMBL PDB ENA (EMBL) DNA database FlyBase WormBase VEGA (Sanger) mRNA Data Patent Data Sub/ Peptide Data Ensembl Curation of a UniProt/SwissProt entry References Sequence UniProt/TrEMBL Sequence variants Literature Annotations Nomenclature Ontologies UniProt/SwissProt Sequence features UniProt Website www.uniprot.org UniProt layout Annotation comments FUNCTION SUBCELLULAR LOCATION ALTERNATIVE PRODUCTS TISSUE SPECIFICITY DEVELOPMENTAL STAGE INDUCTION SIMILARITY CATALYTIC ACTIVITY COFACTOR ENZYME REGULATION BIOPHYSICOCHEMICALPROPERTIES PATHWAY SUBUNIT INTERACTION PTM RNA EDITING MASS SPECTROMETRY DOMAIN POLYMORPHISM DISRUPTION PHENOTYPE ALLERGEN DISEASE TOXIC DOSE BIOTECHNOLOGY PHARMACEUTICAL MISCELLANEOUS CAUTION SEQUENCE CAUTION WEB RESOURCE Evidence tags to show source Controlled vocabularies used whenever possible Master headline Proteomes in UniProt Complete proteomes Complete sets of proteins thought to be expressed by organisms whose genomes have been completely sequenced. Reference proteomes Some complete proteomes have been selected as reference proteome sets. These cover the proteomes of wellstudied model organisms and other proteomes of interest for biomedical research. Obtaining Proteomes Help / Feedback • Stuck? Just ask – active help and support team • Feedback – if you find something incorrect, outdated, missing etc please tell us. help@uniprot.org Find out more: EBI online courses www.ebi.ac.uk/training/online/course/uniprot-quick-tour/ Acknowledgements & questions Duncan Legge UniProt Content Team EBI-EMBL dlegge@ebi.ac.uk InterPro: An integrated protein sequence analysis resource Contact: Amaia Sangrador InterPro curation Team EBI-EMBL interhelp@ebi.ac.uk amaia@ebi.ac.uk What is InterPro? • InterPro is a sequence analysis resource that classifies sequences into protein families and predicts important domains and sites • It combines predictive models (known as signatures) from different databases to provide functional analysis of protein sequences by classifying them into families and predicting domains and important sites The aim of InterPro InterPro Protein annotation: a predictive approach • Model the pattern of conserved amino acids at specific positions within a multiple sequence alignment • We can use these models to infer relationships with the characterised sequences from which the alignment was constructed • This is the approach taken by protein signature databases Three (4) different protein signature approaches Single motif methods Patterns Full alignment methods Profiles & Hidden Markov models (HMMs) Multiple motif methods Fingerprints InterPro Consortium Hidden Markov Models Finger prints Profiles Patterns HAMAP Structural domains Functional annotation of families/domains Protein features (sites) InterPro signature integration process • Signatures are provided by member databases • They are scanned against the UniProt database to see which sequences they match • Curators manually inspect the matches before integrating the signatures into InterPro Signatures representing the same entity are integrated together Relationships between entries are traced, where possible Curators add literature referenced abstracts, cross-refs to other databases, and GO terms http://www.ebi.ac.uk/interpro/ Using InterPro Let’s find some information about T-cell surface antigen CD4 in InterPro Search using the key word: CD4 Results from the “CD4” key word search Family-centered view Type Name Identifier Contributing signatures Description References Go terms Using InterPro Search using human CD4 protein sequence Protein-centered view Identifier Type Name Family Domains Domain-centered view Type Name Contributing signatures Identifier Description References Using InterPro with unknown sequences: InterProScan Search with unknown protein sequence InterProScan is the software package that allows sequences to be scanned against InterPro's signatures InterPro entries and contributing signatures Unintegrated signatures (not reviewed) InterPro usage within the EBI •Used by UniProtKB curators in their annotation of Swiss-Prot proteins • Forms part of the automated system that adds annotation to UniProtKB/TrEMBL • Provides matches to over 80% of UniProtKB • Source of >60 million Gene Ontology (GO) mappings to >17 million distinct UniProtKB sequences outside the EBI • 50,000 unique visitors to the web site per month • > 2 million sequences searched online per month • Plus offline searches with downloadable version Remember! • We are using biologically-unaware search tools and probabilistic models • Probabilistic models != biological certainty • Ask questions, weigh the evidence Caveats • Sheer amount of data can be overwhelming • Member databases do not always agree! • InterPro entries are based on signatures supplied to us by our member databases ....this means no signature, no entry! We need your feedback! missing/additional references reporting problems requests interhelp@ebi.ac.uk Find out more: EBI online courses www.ebi.ac.uk/training/online/course-list/introduction-protein-classification-ebi www.ebi.ac.uk/training/online/course/interpro-quick-tour www.ebi.ac.uk/training/online/course/interpro-functional-and-structural-analysis-protei Acknowledgements & questions Amaia Sangrador InterPro curation team EBI-EMBL amaia@ebi.ac.uk PDBe: Protein Data Bank in Europe Contact: Gary Battle Project Leader Outreach PDBe battle@ebi.ac.uk http://www.facebook.com/proteindatabank http://twitter.com/PDBeurope PDBe overview • Mission: Bringing Structure to Biology • Major activities: • Deposition and annotation site for structural data on biomacromolecules (X-ray, NMR, EM) • Integration of macromolecular structure data with important biological and chemical data resources • Provide tools and services for accessing, exploiting and disseminating structural data to the wider biomedical community Worldwide Protein Data Bank (wwPDB) PDBeXplore Browse the PDB using familiar classification systems (enzymes, folds, families, compounds, taxonomy, sequence). Latest structures: pdbe.org/pdbexplore PDBePISA Exploration of macromolecular (protein, DNA/RNA and ligand) interfaces and prediction of probable quaternary structures. Predict quaternary structure: pdbe.org/pisa PDBeFold Interactive comparison, alignment and superposition based on protein secondary structure. Find similar structures: pdbe.org/fold PDBeMotif Flexible 3D search and analysis of protein-ligand interactions, binding environments and structural motifs. Analyse binding sites and motifs: pdbe.org/motif NMR resources and services Visualisation and validation of NMR models and data. NMR resources: pdbe.org/nmr EM resources and services Comprehensive search and analysis tools for EMDB entries. EM resources: pdbe.org/em Electron Microscopy Data Bank (EMDB) • • • • Global public repository for EM density maps of macromolecular complexes and subcellular structures Founded at EBI in 2002 Jointly operated by PDBe, RCSB and NCMI PDBe EM portal provides advanced search, visualisation and analysis services. http://pdbe.org/emdb Educational resources: Quips Interactive exploration of interesting structures from the PDB Quite interesting PDB structures: pdbe.org/quips Stay informed… http://www.facebook.com/proteindatabank http://twitter.com/PDBeurope Find out more: EBI online courses www.ebi.ac.uk/training/online/course/pdbe-quick-tour/ Acknowledgements & questions Gary Battle EBI-EMBL battle@ebi.ac.uk