Introduction to InterPro Amaia Sangrador InterPro curator amaia@ebi.ac.uk EBI is an Outstation of the European Molecular Biology Laboratory. What is InterPro? DIAGNOSTICS RESOURCE : InterPro uses signatures from several different databases (referred to as member databases) to predict information about proteins * Provides functional analysis of proteins by classifying them into families and predicting domains and important sites * Adds information about the signatures and the types of proteins they match InterPro Consortium Consortium of 11 major signature databases Why do we need predictive annotation tools? 14,000,000 12,000,000 UniProtKB Number of sequences 10,000,000 UniProtKB/Swiss-Prot 8,000,000 6,000,000 4,000,000 2,000,000 0 5-Jan-04 5-Jan-05 5-Jan-06 5-Jan-07 5-Jan-08 Date 5-Jan-09 5-Jan-10 What is UniProt? Based on the original work on PIR , Swiss-Prot and TrEMBL Collaboration between EBI, SIB and PIR The mission of UniProt is to provide the scientific community with a comprehensive, high-quality and freely accessible resource of protein sequence and functional information. UniProtKB Protein knowledgebase UniRef Sequence clusters UniProtKB/Swiss-Prot Reviewed High-quality manual annotation UniRef100 UniRef90 UniRef50 UniProtKB/TrEMBL Unreviewed Automatic annotation UniMES Metagenomic and environmental sample sequences UniParc - Sequence archive Current and obsolete sequences EMBL/GenBank/DDBJ, Ensembl, RefSeq, PDB, other resources Annotation using InterPro TrEMBL uncharacterised sequence CGCGCCTGTACGC TGAACGCTCGTGA CGTGTAGTGCGCG CGCGCCTGTACGC TGAACGCTCGTGA CGTGTAGTGCGCG manually annotated sequence Swiss-Prot automatic annotation pipeline InterPro protein signatures groups of related proteins (same family or share domains) Protein family classification • Given a set of sequences, we usually want to know: – what are these proteins; to what family do they belong? – what is their function; how can we explain this in structural terms? Protein family classification : BLAST (pairwise comparisons) Protein family classification: BLAST Limitations with Pairwise comparisons • BLAST alignment of 2 proteins: • 60S acidic ribosomal protein P0 from 2 species Limitations with Pairwise comparisons Protein family classification: signature databases • Alternatively, we can seek ‘patterns’ that will allow us to infer relationships with previously-characterised sequences • This is the approach taken by ‘signature’ databases Protein signatures • More sensitive homology searches • Each member database creates signatures using different methods and methodologies: manually-created sequence alignments automatic processes with some human input and correction entirely automatically. What are protein signatures? Protein family/domain Multiple sequence alignment Build model Search it. Protein analysis Significant match ITWKGPVCGLDGKTYRNECALL AVPRSPVCGSDDVTYANECELK Mature model UniProt Member databases METHODS Hidden Markov Models Structural Domains FingerPrints Profiles Patterns Protein features (active sites…) Functional annotation of families/domains Sequence Clusters Prediction of conserved domains Diagnostic approaches (sequence-based) Single motif methods Regex patterns (PROSITE) Full domain alignment methods Profiles (Profile Library) HMMs (Pfam) Multiple motif methods Identity matrices (PRINTS) Patterns Sequence alignment Motif Define pattern Extract pattern sequences Build regular expression xxxxxx xxxxxx xxxxxx xxxxxx C-C-{P}-x(2)-C-[STDNEKPI]-x(3)-[LIVMFS]-x(3)-C Pattern signature PS00000 Patterns Advantages • Anchoring the match to the extremity of a sequence <M-R-[DE]-x(2,4)-[ALT]-{AM} • Some aa can be forbidden at some specific positions which can help to distinguish closely related subfamilies • Short motifs handling - a pattern with very few variability and forbidden positions, can produce significant matches e.g. conotoxins: very short toxins with few conserved cysteines C-{C}(6)-C-{C}(5)-C-C-x(1,3)-C-C-x(2,4)-C-x(3,10)- C Drawbacks • Simple but less powerful Patterns are mostly directed against functional residues: active sites, PTM, disulfide bridges, binding sites Prosite patterns Pattern/motif in sequence regular expression EXAMPLE: PS00296; Chaperonins cpn60 signature (PATTERN) A-[AS]-{L}-[DEQ]-E-{A}-{Q}-{R}-x-G(2)-[GA] >sp|P29197|CH60A_ARATH Chaperonin CPN60, mitochondrial OS=Arabidopsis thaliana MYRFASNLASKARIAQNARQVSSRMSWSRNYAAKEIKFGVEARALMLKGVEDLADAVKVT MGPKGRNVVIEQSWGAPKVTKDGVTVAKSIEFKDKIKNVGASLVKQVANATNDVAGDGTT CATVLTRAIFAEGCKSVAAGMNAMDLRRGISMAVDAVVTNLKSKARMISTSEEIAQVGTI SA NGEREIGELIAKAMEKVGKEGVITIQDGKTLFNELEVVEGMKLDRGYTSPYFITNQKT QKCE LDDPLILIHEKKISSINSIVKVLELALKRQRPLLIVSEDVESDALATLILNKLRAG IKVCAIKAPGF GENRKANLQDLAALTGGEVITDELGMNLEKVDLSMLGTCKKVTVSKDDT VILDGAGDKKGI EERCEQIRSAIELSTSDYDKEKLQERLAKLSGGVAVLKIGGASEAEVG EKKDRVTDALNATK AAVEEGILPGGGVALLYAARELEKLPTANFDQKIGVQIIQNALKTP VYTIASNAGVEGA VIVGKLLEQDNPDLGYDAAKGEYVDMVKAGIIDPLKVIRTALVDAAS VSSLLTTTEAVVVDLP KDESESGAAGAGMGGMGGMDY Fingerprints Sequence alignment Motif 1 Motif 2 Motif 3 Define motifs Extract motif sequences Fingerprint signature PR00000 xxxxxx xxxxxx xxxxxx xxxxxx Weight matrices xxxxxx xxxxxx xxxxxx xxxxxx xxxxxx xxxxxx xxxxxx xxxxxx Correct order 1 2 3 Correct spacing The significance of motif context • Identify small conserved regions in proteins • Several motifs characterise family • Offer improved diagnostic reliability over single motifs by virtue of the biological context provided by motif neighbours order interval PRINTS families are hierarchical Different motifs describe subfamilies G protein-coupled receptors rhodospin-like secretin-like cAMP receptors adenosine receptors metabotropic glutamate receptors etc opsin receptors dopamine receptors somatostatin receptors histamine receptors etc somatostatin receptor type 1 somatostatin receptor type 2 somatostatin receptor type 3 etc Profiles & HMMs Whole protein Sequence alignment Define coverage Use entire alignment for domain or protein Build model Profile or HMM signature Entire domain xxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx Models insertions and deletions Profiles Built using weight matrices Hidden Markov Models (HMM) More sophisticated algorithm Models insertions and deletions More flexible (can use partial alignments) PROSITE and HAMAP profiles: a functional annotation perspective • PROSITE domains: high quality manually curated seeds (using biologically characterized UniProtKB/Swiss-Prot entries), documentation and annotation rules. Oriented toward functional domain discrimination. • HAMAP families: manually curated bacterial, archaeal and plastid protein families (represented by profiles and associated rules), covering some highly conserved proteins and functions. HMM databases Sequence-based • PIR SUPERFAMILY: families/subfamilies reflect the evolutionary relationship • PANTHER: families/subfamilies model the divergence of specific functions • TIGRFAM: microbial functional family classification • PFAM : families & domains based on conserved sequence • SMART: functional domain annotation Structure-based •SUPERFAMILY : models correspond to SCOP domains • GENE3D: models correspond to CATH domains Why we created InterPro By uniting the member databases, InterPro capitalises on their individual strengths, producing a powerful diagnostic tool & integrated database – to simplify & rationalise protein analysis – to facilitate automatic functional annotation of uncharacterised proteins – to provide concise information about the signatures and the proteins they match, including consistent names, abstracts (with links to original publications), GO terms and crossreferences to other databases InterPro entry InterPro entry The InterPro entry: types Family Proteins share a common evolutionary origin, as reflected in their related functions, sequences or structure Domain Distinct functional, structural or sequence units that may exist in a variety of biological contexts Repeats Short sequences typically repeated within a protein Sites PTM Active Site Binding Site Conserved Site InterPro Entry Groups similar signatures together AddsAdds extensive extensive annotation annotation LinksLinks to other to other databases databases Structural information and viewers Quality control Removes redundancy InterPro Entry Groups similar signatures together AddsAdds extensive extensive annotation annotation LinksLinks to other to other databases databases Structural information and viewers Hierarchical classification Interpro hierarchies: Families FAMILIES can have parent/child relationships with other Families Parent/Child relationships are based on: • Comparison of protein hits child should be a subset of parent siblings should not have matches in common • Existing hierarchies in member databases • Biological knowledge of curators Interpro hierarchies: Domains DOMAINS can have parent/child relationships with other domains Domains and Families may be linked through Domain Organisation Hierarchy InterPro Entry Groups similar signatures together AddsAdds extensive extensive annotation annotation to databases other databases Links to Links other Structural information and viewers InterPro Entry Groups similar signatures together Adds extensive annotation Adds extensive annotation LinksLinks to other to other databases databases Structural information and viewers The Gene Ontology project provides a controlled vocabulary of terms for describing gene product characteristics InterPro Entry Groups similar signatures together Adds extensive annotation Adds extensive annotation LinksLinks to other to other databases databases Structural information and viewers UniProt KEGG ... Reactome ... IntAct ... UniProt taxonomy PANDIT ... MEROPS ... Pfam clans ... Pubmed InterPro Entry Groups similar signatures together Adds extensive annotation Adds extensive annotation to databases other databases Links to Links other Structural information and viewers PDB 3-D Structures SCOP Structural domains CATH Structural domain classification Understanding signatures: Non-overlapping signatures can be describing the same thing Not always possible to use signature overlap to determine how family signatures are related e.g. High molecular weight glutenins PF03157 336 protein hits PR00210 331 protein hits Two very different signatures both describing the same thing! Some signatures give us similar, but complementary information PFAM shows domain is composed of two types of repeated sequence motifs SUPERFAMILY shows the potential domain boundaries www.ebi.ac.uk/interpro Discontinuous Signatures Require Interpretation 1) Signature method 2) Duplicated domains 3) Repeated elements 4) Non-contiguous domains www.ebi.ac.uk/interpro Discontinuous Signatures Require Interpretation 1) Signature method • e.g. PRINTS – discrete motifs 2) Duplicated domains 3) Repeated elements 4) Non-contiguous domains www.ebi.ac.uk/interpro Discontinuous Signatures Require Interpretation 1) Signature method 2) Duplicated domains • e.g. SSF - duplication consisting of 2 domains with same fold 3) Repeated elements 4) Non-contiguous domains www.ebi.ac.uk/interpro Discontinuous Signatures Require Interpretation 1) Signature method 2) Duplicated domains 3) Repeated elements • e.g. Kringle, WD40 4) Non-contiguous domains www.ebi.ac.uk/interpro Discontinuous Signatures Require Interpretation 1) Signature method 2) Duplicated domains 3) Repeats 4) Non-contiguous domains • Structural domains can consist of non-contiguous sequence www.ebi.ac.uk/interpro Discontinuous Signatures Require Interpretation 1) Signature method 2) Duplicated domains 3) Repeats 4) Non-contiguous domains www.ebi.ac.uk/interpro Searching InterPro: WHEN TO USE INTERPRO Use InterPro to predict family, domain or active site information for a given protein or amino acid sequence. You can search InterPro if you have •a protein sequence •a UniProtKB protein identifier, •a Gene Ontology term, •a protein structure code •a general search term keyword short phrase and require further information regarding your protein of interest. Search tools include: • Text Search • InterProScan (sequence search) • BioMart (builds queries) http://www.ebi.ac.uk/interpro/ Beta version: http://wwwdev.ebi.ac.uk/interpro/ InterPro Search Search using: • text • protein ID • InterPro ID • GO term wwwdev.ebi.ac.uk/interpro ID: GO:0006915 Name : apoptosis InterPro Search Search results for GO:0006915 (apoptosis ) InterPro Search protein ID wwwdev.ebi.ac.uk/interpro InterPro Search Results Family Link to PDBe Domains and sites Unintegrated signatures Structural data Structural information CATH and SCOP divide PDB structures into domains Note that one domain is discontiguous Swiss-Model and ModBase can predict structure for regions not covered by PDB Searching InterPro: InterProScan InterProScan – Searching New Sequence Additional options Paste in unknown sequence wwwdev.ebi.ac.uk/interpro InterProScan New Search Results Link to InterPro entry Links to signature database s Searching InterPro: BioMart BioMart Search BioMart allows more powerful and flexible queries • Large volumes of data can be queried efficiently • The interface is shared with many other bioinformatics resources • It allows federation with other databases PRIDE (mass spectrometry-derived proteins and peptides REACTOME (biological pathways) BioMart Search 1) Choose Dataset a. Choose InterPro BioMart BioMart Search 1) Choose Dataset a. b. Choose InterPro BioMart Choose InterPro entries or protein matches BioMart Search 2) Choose Filters Search specific entries, signatures or proteins BioMart Search 2) Choose Filters e.g. Filter by specific proteins BioMart Search 3) Choose Attributes What results you want BioMart Search 4) Choose additional Dataset (optional) This is where you link results to Pride and Reactome BioMart Search Results User manual Click to view results HTML = web-formatted table CSV = comma-separated values TSV = tab-separated values XLS = excel spreadsheet InterPro – the numbers Our member databases all have their particular niche or focus... ...but InterPro is a combination of all their areas of expertise! • InterPro 32.0: 21516 entries 101175 signatures covering 85.5% of UniProtKB • Frequent releases – both protein and method updates • 45 000 unique visitors per month • The database has grown almost 10-fold in ~11 years Caveats InterPro is a predictive protein signature database. Small changes with a large impact may not be well represented. •for example, inactive peptidases, such as Q8N3Z0, Q9W3H0 InterPro entries are based on signatures supplied to us by our member databases •....this means no signature, no entry! We need your feedback! missing/additional references reporting problems requests EBI support page. Acknowledgements InterPro Team: Alex Mitchell David Lonsdale Craig McAnulla Siew-Yit Yong Phil Jones Anthony Quinn Sebastien Pesseat Matthew Maxim Christopher Fraser Scheremetjew Hunter Prudence Amaia Mutowo Sangrador Sarah Hunter