Predicting protein structure function using InterPro and Alex Mitchell (V8.1, Sep. 2013) 1 www.ebi.ac.uk Contents Course Information ........................................................................................... 3 Course learning objectives ............................................................................... 3 An introduction to InterPro ............................................................................... 4 What are protein signatures? ........................................................................ 4 InterPro entry types ...................................................................................... 5 InterPro entry relationships ........................................................................... 5 The InterPro home page ............................................................................... 6 Searching InterPro ........................................................................................ 7 I. Searching InterPro using a text search ........................................................ 7 Learning objectives ............................................................................................. 7 What information can be found in the InterPro entry page? ................................. 8 How do I interpret an InterPro protein view? ........................................................ 9 Protein view: Overview page......................................................................... 9 Protein view: Similar Proteins page............................................................. 11 Protein view: Structures page ..................................................................... 12 Summary........................................................................................................... 13 Exercises ......................................................................................................... 14 Exercise 1 – Searching InterPro using a UniProt identifier................................. 14 Exercise 2 – Exploring InterPro entries.............................................................. 16 General annotation ..................................................................................... 16 Relationships .............................................................................................. 16 GO (Gene Ontology) terms ......................................................................... 16 Proteins matched ........................................................................................ 17 Domain organisation ................................................................................... 17 Taxonomy ................................................................................................... 17 Structures ................................................................................................... 17 II. Querying sequences using InterProScan .................................................. 18 Learning objectives ........................................................................................... 18 Summary........................................................................................................... 18 Exercises ......................................................................................................... 19 Exercise 3 – Analysing sequences using InterProScan ..................................... 19 Course summary ............................................................................................. 20 Further reading ................................................................................................ 20 Where to find out more ................................................................................... 20 Predicting protein structure and function using InterPro 2 www.ebi.ac.uk Course Information Course description This tutorial provides an introduction to InterPro, its web interface and content. You will learn how to search InterPro to obtain predicted information about protein function and sequence and structural features. Course level Suitable for graduate-level scientists and above who have not used InterPro before. Regular users may benefit from going through the tutorial to familiarise themselves with the new InterPro website Pre-requisites Basic knowledge of biology and basic computer skills Subject area Proteins and Proteomes, Genes Genomes, Sequence Analysis Target audience Scientists interested in Sequence Analysis Resources required Internet access (a current browser such as the latest Firefox or Internet Explorer) Approximate time needed 2 hours and Course learning objectives After completing this course, you should: • Understand what InterPro is and how it can be useful in your research • Know the different ways you can query InterPro • Understand the information on the InterPro entry pages • Be able to interpret the information on the InterPro protein analysis pages Predicting protein structure and function using InterPro 3 www.ebi.ac.uk An introduction to InterPro InterPro provides functional analysis of proteins by classifying them into families and predicting the presence of important domains and sites. It does this by combining predictive models known as protein signatures from a number of different databases (referred to as member databases) into a single searchable resource. InterPro integrates the signatures, providing names and descriptive abstracts and, whenever possible, adding Gene Ontology (GO) terms, structural links and external database links. Together, the protein signatures combined within InterPro provide matches to ~ 80% of proteins in the UniProt database. What are protein signatures? Protein signatures are obtained by modelling the conservation of amino acids at specific positions within a group of related proteins (i.e., a protein family), or within the domains/sites shared by a group of proteins. InterPro’s different member databases use different computational methods to produce protein signatures, and they each have their own particular focus of interest: structural and/or functional domains, protein families, or protein features such as active sites or binding sites (see Figure 1). Figure 1. InterPro member databases grouped by the methods used to construct their signatures. Their focus of interest is shown in blue text. Predicting protein structure and function using InterPro 4 www.ebi.ac.uk The protein signatures in InterPro are regularly run against the UniProt database of protein sequences, and all significant matches are reported on the InterPro web site, allowing users to check which proteins match a particular signature. This information is also used to aid UniProt curators in their annotation of SwissProt proteins and by the automatic systems that add annotation to TrEMBL (the non-reviewed protein sequences of UniProtKB; see http://www.uniprot.org/) InterPro entry types The signatures provided by the member databases are integrated into InterPro entries. Each InterPro entry is assigned a type (see Figure 2): family , repeat , or site , domain (which can be a conserved site, active site, binding site or post-translational modification site). Figure 2. InterPro entry types and definitions. More information can be found be found in the FAQ section of the InterPro web site (http://www.ebi.ac.uk/interpro/user_manual.html#faqs_04) InterPro entry relationships InterPro entries often share relationships with other entries in the database. For example, where one entry might represent a protein family (such as the sugar transporter family), another entry might represent a specific subtype of that family (such as the glucose transporter type I subfamily). These relationships are stored in InterPro as hierarchies. When classifying proteins, the level at which we can Predicting protein structure and function using InterPro 5 www.ebi.ac.uk place a protein in a hierarchy is important, since it determines the level of specific functional information that we can infer. Figure 3. Examples of hierarchical relationships between InterPro entries. Both family entries and domain entries are able to form hierarchical relationships, although family and domain hierarchies are kept separate in the database and do not overlap (for example, a subclass of a particular domain cannot be a subtype of a protein family). The InterPro home page You can find the InterPro home page at http://www.ebi.ac.uk/interpro/ (see Fig. 4 below). The home page provides: search tools documentation, such as user manuals, release notes and recent publications links to tools for visualisation, complex data queries, etc Figure 4. The InterPro web site home page. Predicting protein structure and function using InterPro 6 www.ebi.ac.uk Searching InterPro InterPro can be searched a number of different ways: Text search Via the text box in the main page or at the search bar at the top right of all other InterPro web pages Search using: UniProtKB accessions or identifiers; InterPro entry names or identifiers; SCOP, CATH or PDB structural identifiers; GO identifiers; plain text InterProScan Search using an amino acid sequence Use the sequence search box on the InterPro home page or follow the link to InterProScan for more search options BioMart Allows more flexible and powerful querying; retrieve results in HTML, plain text or Microsoft Excel spreadsheet To access the InterPro BioMart, follow the link on the InterPro home page I. Searching InterPro using a text search Learning objectives In this section, you will learn how to: query the InterPro web site using the text search option interpret the information in the InterPro protein view understand the information in the InterPro entry page The text search bar can be found in the text box in main page and at the top right of all InterPro web pages. It can be used to search with plain text, a UniProtKB protein identifier or accession, an InterPro identifier or accession, a Gene Ontology (GO) identifier, or a protein structure code. Predicting protein structure and function using InterPro 7 www.ebi.ac.uk Search Figure 5. The InterPro search bar can be found at the top right of InterPro web pages. To examine the pre-calculated analysis results that InterPro has stored for a protein in UniProtKB, perform a text search using its UniProtKB accession number or identifier (e.g.O15075). This will bring you to the protein view, which provides both the signature hits and structural information for the protein (see section “How do I interpret an InterPro protein view?”). If you have an accession number from GenBank, Xref, EMBL or Ensembl, you can convert it to a UniProtKB accession number using the EBI’s PICR service (http://www.ebi.ac.uk/Tools/picr/). Searching with an InterPro identifier (e.g., IPR020405 or 20405) will provide the entry page with all the information corresponding to that entry (see section “What information can be found in the InterPro entry page?”). You can also use a member database signature accession (such as PR01445), which will return the InterPro entry that signature is integrated into. A simple text search query (with a word, GO term, etc) will give you all the InterPro entries associated with that term. What information can be found in the InterPro entry page? The InterPro entry page consists of the following features (see Fig.6): A. Entry type and name B. Contributing signatures C. Hierarchical relationships to other InterPro entries D. Description of the entry, with links to references Predicting protein structure and function using InterPro 8 www.ebi.ac.uk E. GO terms that have been associated with that entry. GO terms are divided in three categories: biological process, molecular function and cellular component. Following the links in the side menu (on the left hand side of the page), information can be found about proteins matched by that entry, their domain organisation, pathways & interactions in which they are involved, and their taxonomic coverage (i.e., the species in which the proteins are found). Figure 6. The InterPro entry page for the endothelin receptor family (IPR000499) How do I interpret an InterPro protein view? The InterPro protein view is broken down into 3 pages: ‘Overview’, ‘Similar proteins’ and ‘Structures’ (where available). These can be accessed by selecting the appropriate link in the left hand side menu found on each of the pages. Protein view: Overview page This is the default protein view and is an interactive page that contains a large amount of information (see Fig. 7 below). It is broken down into the following sections: Predicting protein structure and function using InterPro 9 www.ebi.ac.uk Figure 7. The InterPro protein overview page Top Section (A) First of all we summarise basic UniProtKB information about the protein: its UniProt name, short name and accession (with a link to UniProtKB), its length and taxonomic information. Protein family membership (B) The family that InterPro predicts the protein belongs to is presented in a hierarchy, where appropriate. Clicking on the links takes you to the InterPro entry pages for each level of the hierarchy matched. Summary View (C) The protein is represented as a grey bar, with its length in amino acids displayed along the bottom. A graphical representation is provided, summarising the domain and repeat information predicted by InterPro for the protein. Domains and repeats are represented by coloured bars. Hovering over these bars with your mouse gives detailed information on their match locations and the different InterPro entries that they are summarising. Sequence features (D) This section contains detailed information showing all the InterPro member database signatures that match the protein. This is the raw information used to create the summary view above. It also contains site information (active sites, binding sites, etc), where available. Predicting protein structure and function using InterPro 10 www.ebi.ac.uk The type and amount of information displayed in this section can be controlled using the ‘Filter view’ floating menu on the left hand side of the page. Family, domain, repeat and/or site matches (in any combination) can be selected for display. Matches to member database signatures not yet integrated into InterPro can also be displayed by clicking the ‘Unintegrated’ checkbox in the ‘Status’ section. The colour scheme can be changed using the ‘Colour by’ option in the menu on the left hand side of the page. It is possible to colour by entry relationship (bars of the same colour are matches to entries in the same hierarchy) or by member database (bars of the same colour are from the same member database). Hovering your mouse over the coloured bars will display the signature name, accession and the member database that the signature comes from. The amino acid coordinates that the signature matches are also shown. Clicking on the linked accession will take you to the member database page for that signature. GO term prediction (E) GO terms predicted for the protein, based on the InterPro entries that it matches, are found at the bottom of the page. More information about GO terms can be found at http://geneontology.org/. Protein view: Similar Proteins page The Similar proteins page (illustrated in Figure 8) shows all the proteins in UniProt that have the same predicted domain organisation as the query protein. A cartoon summarising the domain organisation is displayed, followed by a list of matching proteins, with their UniProt accession numbers, names and species, the families to which InterPro predicts they belong, as well as their length and an indication as to whether structural information is available. Context sensitive links to UniProt pages and InterPro entry and protein pages are available. Predicting protein structure and function using InterPro 11 www.ebi.ac.uk Figure 8. The InterPro Similar sequences page, showing all the proteins in UniProt that have the same predicted domain organisation as the query protein. Protein view: Structures page The Structures page (illustrated in Figure 9) shows all of the experimentally solved structures and predicted structural information for the query protein. Like the Overview page, the Structures page displays UniProt information about the protein, followed by a graphical summary of all of the domain and repeat information predicted by InterPro, so that this can be correlated with structural information. It then contains (where available): Structural Features Representative matches from PDB, SCOP and CATH are given. PDB contains information about experimentally-determined structures and provides structures that can cover part or the whole protein, while CATH and SCOP break proteins into structural domains. Structural Predictions Two matches may be presented here, one to ModBase and the other to Swiss-Model. These are homology databases that predict protein structure based on the closest homologue. Solved structures for this protein Links are provided to PDB structural information, along with a graphical representation of the solved structure. Predicting protein structure and function using InterPro 12 www.ebi.ac.uk Figure 9. The Structures page showing the sequence features predicted by InterPro and PDB structural information. Summary The InterPro text search can be queried with UniProtKB accessions and identifiers, InterPro entry names and identifiers, GO terms, structural identifiers and plain text To return pre-calculated InterPro analysis results for a protein in UniProtKB, perform a text search using its UniProtKB accession number or identifier InterPro entry pages provide information related to that entry (including the contributing signatures) and links to the proteins matched by that entry The InterPro protein pages provide information about predicted protein family membership, sequence features, structural features and predictions, and GO terms for a particular protein. Predicting protein structure and function using InterPro 13 www.ebi.ac.uk Exercises Exercise 1 – Searching InterPro using a UniProt identifier Open the InterPro homepage in a web browser (http://www.ebi.ac.uk/interpro/). Using the “Text” Search box mid-way down the page, type in the UniProtKB accession ‘O15075’ (without the quotes. That’s a letter O at the start and a zero in the middle). Click on the purple “Search” button. You should now have an Overview page describing the signature matches for this protein. Question 1: Looking at the InterPro protein view for O15075, how many InterPro entries (not individual signatures) match the query protein sequence? (Remember to include the protein family matches!) Question 2: How many domains does InterPro predict the protein to have? Question 3: How many member database signatures contribute to InterPro entry IPR003533? Hint: You can see the contributing member database signatures to InterPro entry IPR003533 in the “Sequence features” section, or alternatively click on the link to IPR003533, which will take you to the entry page for that domain. Click on the checkbox next to Family in the ‘Filter view on’ menu, on the left hand side of the page. This should show the location of family signature matches in the Sequence features section. Question 4: Which member databases contribute signatures to InterPro entry IPR020636? Try clicking on the checkbox next to Unintegrated in the ‘Filter view on’ menu, on the left hand side of the page. This should display the match positions for signatures not integrated into InterPro. Question 5: How many unintegrated signatures match the sequence? Predicting protein structure and function using InterPro 14 www.ebi.ac.uk Scroll down the page until you reach the GO term prediction section. Question 6: What are the Biological Process GO terms predicted for this sequence? To find out more about the GO terms, including their exact definitions, click on the terms themselves, which will take you to the relevant QuickGO page. On the Overview page, click on ‘Similar proteins’ in the left hand side menu. This will take you to a page listing all the sequences in UniProt with the same predicted domain organisation as your query sequence. Question 7: Approximately how many proteins in UniProt share the same predicted domain architecture as the query protein? On the Similar proteins page, click on ‘Structures’ in the left hand side menu. This will take you to the Structures page. In the “Structural Features” section, you will find the PDB structure. Its length indicates the region of the protein for which the structure is known. You will also see bars representing a CATH database match and a SCOP database match, both of which are structural classification databases that break down the PDB structures into their constituent domains. Question 8: What region is covered by the PDB structure (i.e., which domain)? Hint: Compare it to entry IPR003533 in the sequence features summary section. Not all of the protein has been structurally characterised, shown by the fact that only a small region of this protein is covered by the PDB match. To help address this problem, there are homology models from both ModBase and Swiss-Model, found under the “Structural Predictions” section. These are models based on aligning the protein with its closest homologue whose structure has been determined. (Note: these are predictive models that provide a ‘best guess’ at the remaining structure). Question 9: Why does IPR003533 have two domain hits compared with the single domain in PDB for this protein? Predicting protein structure and function using InterPro 15 www.ebi.ac.uk Clicking on the links in the ‘Solved structures for this protein’ section will take you to the relevant PDB page, where you can find out more about the structure and visualise it in 3-D structure (click on the icon on the PDB page). Exercise 2 – Exploring InterPro entries General annotation Return to the Overview page for O15075. Under the Sequence features summary section, hover your mouse over the domain predicted to be located in the C terminus of the protein. Notice that there are several InterPro entries that cover approximately the same sequence position. Click on the hyperlink to ‘Protein kinase domain (IPR000719)’ and look at the “Contributing signatures” section. This section lists the signatures in an entry, the database they come from, and the number of proteins they match. Question 10: Which member database signatures make up this entry? Relationships InterPro links related signatures through parent/child relationships which indicate domain/family hierarchies. Child entries subdivide IPR000719 into more closely related subgroups. Question 11: What “Child” entries is IPR000719 subdivided into? Question 12: What is the name of the “Parent” of IPR000719? In this case, the parent entry represents domains with a structural fold homologous to that of the protein kinase domain (even if they have no enzyme activity), whereas IPR000719 represents a more specific form of the domain that has catalytic protein kinase activity. GO (Gene Ontology) terms Scroll down to the “GO terms annotation” section. InterPro provides mappings to GO terms based on the attributes of experimentally characterised proteins that match an entry. These are useful for the annotation of proteins that do not otherwise have GO terms associated with them. Predicting protein structure and function using InterPro 16 www.ebi.ac.uk Question 13: Which GO terms are associated with this entry? Proteins matched Now look at the proteins matched information in the menu on the left hand side of the page. Question 14: Approximately how many proteins are matched by IPR000719? Domain organisation Click on the ‘Domain organisation’ link on the left hand side menu on the entry page for IPR000719. This will take you to a page showing domain organisations of all proteins in UniProt predicted to contain a protein kinase domain. Question 15: Approximately how many proteins are predicted to possess a protein kinase catalytic domain followed by 3 PASTA domains? Hint: you may have to mouse over the domain cartoons to identify the protein kinase domain and PASTA domains, as the way different domains are coloured is not very distinct (different domains can be assigned the same colour on the web page). We are currently working to fix this. Taxonomy Click on the ‘Species’ link on the left hand side of the InterPro entry page for IPR000719. You may explore the taxonomic spread of the sequences matching this entry by expanding the table in the Taxa section. InterPro divides all the protein hits in an entry by their taxonomy. Question 16: What kind of taxonomic groups are sequences containing protein kinase domains found in? Structures Now click on the "Structures" link on the left hand side menu. InterPro provides a list of all the PDB entries associated with an entry. There are also structural links to SCOP and CATH at the bottom of the page, which provide structural classifications of the proteins that match this entry. Scroll to the bottom of the page and follow the “SCOP d.144.1.7” link to the SCOP database to find out the structural classification of this domain. Predicting protein structure and function using InterPro 17 www.ebi.ac.uk Question 17: What type of structure does the protein kinase-like fold consist of? Hint: look at the information under “Fold” in the “Linage” section. II. Querying sequences using InterProScan Learning objectives In this section, you will learn: how to perform sequence-based queries using InterProScan how to interpret the results of these queries If you are using a novel protein sequence to query InterPro (i.e., a protein sequence that is not in UniProt), the simplest way is to copy and paste the amino acid sequence into the large box on the home page and click on the search button immediately to the right. This will run your sequence through InterProScan with the default parameters selected. For more advanced search options use the InterProScan link, which allows query sequences to be either entered directly or uploaded from a file in different formats (GCG, FASTA, EMBL, GenBank, PIR, etc). InterProScan incorporates all the analysis algorithms and result post-processing steps from the member databases. InterProScan outputs the resulting matches for a sequence in a graphical format. The matches can also be viewed as a table, which lists the signature match positions. In addition to the online version of InterPro, a stand-alone version can be downloaded from the ftp server (ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/) and installed locally. Unlike the online version of InterProScan, the standalone version can accept multiple sequences as input. It can also process nucleic acid sequences as well as amino acid ones. Summary InterPro Scan can be used with sequence queries as a predictive tool for protein sequence classification. Predicting protein structure and function using InterPro 18 www.ebi.ac.uk Exercises Exercise 3 – Analysing sequences using InterProScan For the next exercise we will use the following sequence: >sequence1 MPEFVPEDLSGEEETVTECKDSLTKLLSLPYKSFSEKLHRYALSIKDKVVWETWERSGKRVRDYNLYTGVLGT AYLLFKSYQVTRNEDDLKLCLENVEACDVASRDSERVTFICGYAGVCALGAVAAKCLGDDQLYDRYLARFRGI RLPSDLPYELLYGRAGYLWACLFLNKHIGQESISSERMRSVVEEIFRAGRQLGNKGTCPLMYEWHGKRYWGAA HGLAGIMNVLMHTELEPDEIKDVKGTLSYMIQNRFPSGNYLSSEGSKSDRLVHWCHGAPGVALTLVKAAQVYN TKEFVEAAMEAGEVVWSRGLLKRVGICHGISGNTYVFLSLYRLTRNPKYLYRAKAFASFLLDKSEKLISEGQM HGGDRPFSLFEGIGGMAYMLLDMNDPTQALFPG Get the seuqences from http://www.ebi.ac.uk/~hychang/. First, we will analyse the sequence using InterProScan. Navigate to the InterPro homepage at http://www.ebi.ac.uk/interpro/ and paste the sequence into the text box labelled ‘FASTA Sequence’ located mid way down the page. Press ‘Search’. Note: depending on the load on the web servers, this step might take some time to complete. You might want to go ahead and start the next step while waiting for your analysis to finish. Next, we will analyse the sequence using BLAST. Navigate to the EBI’s NCBI BLAST homepage at http://www.ebi.ac.uk/Tools/sss/ncbiblast/ and choose UniProt Knowledgebase as the protein database in Step 1 on the website (this should be selected by default). Copy and paste the above sequence into the box highlighted in Step 2 on the website (‘enter your input sequence’) and press ‘Submit’. Question 18: Based on the InterPro matches. What family is it predicted to belong to? Note: the InterProScan results page is currently formatted differently to the InterPro protein page in that the match information is not summarised - all the signature matches are shown. We are currently working on unifying the two views, so that the InterProScan results page will appear more like the protein page. Question 19: What do your BLAST results suggest your protein to be? Question 20: Are the results consistent with those returned by InterProScan? Is there anything in the InterPro annotation that might explain any discrepancies? Hint: Examine InterPro entry IPR007822 and read the annotation. Predicting protein structure and function using InterPro 19 www.ebi.ac.uk Question 21: If you have time, you might like to like to investigate the UniProtKB/TrEMBL sequences A8T8V4 and A8UMV2. How are these sequences named and annotated in TrEMBL? What does InterPro suggest the function of these proteins to be? Course summary InterPro is a classification resource for protein families, domains and functional sites, which integrates the following protein signature databases: PROSITE, PRINTS, ProDom, Pfam, SMART, TIGRFAMs, PIRSF, SUPERFAMILY, Gene3D and PANTHER. By uniting the member databases, InterPro capitalises on their individual strengths, producing a powerful diagnostic tool and integrated resource. InterPro can help if you have a sequence or set of sequences and want to know: what they are, the protein family to which they belong what their function is and how it may be explained in structural terms Further reading InterPro in 2011: new developments in the family and domain prediction database. Hunter S, et al. Nucl. Acid Res. (2011) doi: 10.1093/nar/gkr948 InterPro: the integrative protein signature database. Hunter S, et al. Nucl. Acid Res. (2009) 37:D211-5. InterPro and InterProScan: tools for protein sequence classification and comparison. Mulder NJ, Apweiler R. Methods in Molecular Biology (2007) 396:59-70. Where to find out more You can find links to documentation about InterPro (user manual, release notes) on our main web page, and also download information from our ftp server (ftp://ftp.ebi.ac.uk/pub/databases/interpro/). More information on protein signatures and their use in protein classification can be found on the EBI’s online Predicting protein structure and function using InterPro 20 www.ebi.ac.uk training portal (see http://www.ebi.ac.uk/training/online/course/introduction- protein-classification-ebi). Predicting protein structure and function using InterPro 21