An introduction to InterPro

advertisement
Predicting protein structure
function using InterPro
and
Alex Mitchell (V8.1, Sep. 2013)
1
www.ebi.ac.uk
Contents
Course Information ........................................................................................... 3
Course learning objectives ............................................................................... 3
An introduction to InterPro ............................................................................... 4
What are protein signatures? ........................................................................ 4
InterPro entry types ...................................................................................... 5
InterPro entry relationships ........................................................................... 5
The InterPro home page ............................................................................... 6
Searching InterPro ........................................................................................ 7
I. Searching InterPro using a text search ........................................................ 7
Learning objectives ............................................................................................. 7
What information can be found in the InterPro entry page? ................................. 8
How do I interpret an InterPro protein view? ........................................................ 9
Protein view: Overview page......................................................................... 9
Protein view: Similar Proteins page............................................................. 11
Protein view: Structures page ..................................................................... 12
Summary........................................................................................................... 13
Exercises ......................................................................................................... 14
Exercise 1 – Searching InterPro using a UniProt identifier................................. 14
Exercise 2 – Exploring InterPro entries.............................................................. 16
General annotation ..................................................................................... 16
Relationships .............................................................................................. 16
GO (Gene Ontology) terms ......................................................................... 16
Proteins matched ........................................................................................ 17
Domain organisation ................................................................................... 17
Taxonomy ................................................................................................... 17
Structures ................................................................................................... 17
II. Querying sequences using InterProScan .................................................. 18
Learning objectives ........................................................................................... 18
Summary........................................................................................................... 18
Exercises ......................................................................................................... 19
Exercise 3 – Analysing sequences using InterProScan ..................................... 19
Course summary ............................................................................................. 20
Further reading ................................................................................................ 20
Where to find out more ................................................................................... 20
Predicting protein structure and function using InterPro
2
www.ebi.ac.uk
Course Information
Course description
This tutorial provides an introduction to
InterPro, its web interface and content. You
will learn how to search InterPro to obtain
predicted information about protein function
and sequence and structural features.
Course level
Suitable for graduate-level scientists and
above who have not used InterPro before.
Regular users may benefit from going
through the tutorial to familiarise themselves
with the new InterPro website
Pre-requisites
Basic knowledge of biology and basic
computer skills
Subject area
Proteins and Proteomes, Genes
Genomes, Sequence Analysis
Target audience
Scientists interested in Sequence Analysis
Resources required
Internet access (a current browser such as
the latest Firefox or Internet Explorer)
Approximate time needed
2 hours
and
Course learning objectives
After completing this course, you should:
•
Understand what InterPro is and how it can be useful in your research
•
Know the different ways you can query InterPro
•
Understand the information on the InterPro entry pages
•
Be able to interpret the information on the InterPro protein analysis pages
Predicting protein structure and function using InterPro
3
www.ebi.ac.uk
An introduction to InterPro
InterPro provides functional analysis of proteins by classifying them into families
and predicting the presence of important domains and sites. It does this by
combining predictive models known as protein signatures from a number of
different databases (referred to as member databases) into a single searchable
resource. InterPro integrates the signatures, providing names and descriptive
abstracts and, whenever possible, adding Gene Ontology (GO) terms, structural
links and external database links. Together, the protein signatures combined
within InterPro provide matches to ~ 80% of proteins in the UniProt database.
What are protein signatures?
Protein signatures are obtained by modelling the conservation of amino acids at
specific positions within a group of related proteins (i.e., a protein family), or
within the domains/sites shared by a group of proteins. InterPro’s different
member databases use different computational methods to produce protein
signatures, and they each have their own particular focus of interest: structural
and/or functional domains, protein families, or protein features such as active
sites or binding sites (see Figure 1).
Figure 1. InterPro member databases grouped by the methods used to construct their
signatures. Their focus of interest is shown in blue text.
Predicting protein structure and function using InterPro
4
www.ebi.ac.uk
The protein signatures in InterPro are regularly run against the UniProt database
of protein sequences, and all significant matches are reported on the InterPro
web site, allowing users to check which proteins match a particular signature.
This information is also used to aid UniProt curators in their annotation of SwissProt proteins and by the automatic systems that add annotation to TrEMBL (the
non-reviewed protein sequences of UniProtKB; see http://www.uniprot.org/)
InterPro entry types
The signatures provided by the member databases are integrated into InterPro
entries. Each InterPro entry is assigned a type (see Figure 2): family
, repeat
, or site
, domain
(which can be a conserved site, active site, binding
site or post-translational modification site).
Figure 2. InterPro entry types and definitions. More information can be found be found in
the
FAQ
section
of
the
InterPro
web
site
(http://www.ebi.ac.uk/interpro/user_manual.html#faqs_04)
InterPro entry relationships
InterPro entries often share relationships with other entries in the database. For
example, where one entry might represent a protein family (such as the sugar
transporter family), another entry might represent a specific subtype of that family
(such as the glucose transporter type I subfamily). These relationships are stored
in InterPro as hierarchies. When classifying proteins, the level at which we can
Predicting protein structure and function using InterPro
5
www.ebi.ac.uk
place a protein in a hierarchy is important, since it determines the level of specific
functional information that we can infer.
Figure 3. Examples of hierarchical relationships between InterPro entries. Both family
entries and domain entries are able to form hierarchical relationships, although family and
domain hierarchies are kept separate in the database and do not overlap (for example, a
subclass of a particular domain cannot be a subtype of a protein family).
The InterPro home page
You can find the InterPro home page at http://www.ebi.ac.uk/interpro/ (see Fig. 4
below). The home page provides:

search tools

documentation, such as user manuals, release notes and recent
publications

links to tools for visualisation, complex data queries, etc
Figure 4. The InterPro web site home page.
Predicting protein structure and function using InterPro
6
www.ebi.ac.uk
Searching InterPro
InterPro can be searched a number of different ways:

Text search

Via the text box in the main page or at the search bar at the top
right of all other InterPro web pages

Search using: UniProtKB accessions or identifiers; InterPro entry
names or identifiers; SCOP, CATH or PDB structural identifiers;
GO identifiers; plain text

InterProScan

Search using an amino acid sequence

Use the sequence search box on the InterPro home page or follow
the link to InterProScan for more search options

BioMart

Allows more flexible and powerful querying; retrieve results in
HTML, plain text or Microsoft Excel spreadsheet

To access the InterPro BioMart, follow the link on the InterPro
home page
I. Searching InterPro using a text search
Learning objectives
In this section, you will learn how to:

query the InterPro web site using the text search option

interpret the information in the InterPro protein view

understand the information in the InterPro entry page
The text search bar can be found in the text box in main page and at the top right
of all InterPro web pages. It can be used to search with plain text, a UniProtKB
protein identifier or accession, an InterPro identifier or accession, a Gene
Ontology (GO) identifier, or a protein structure code.
Predicting protein structure and function using InterPro
7
www.ebi.ac.uk
Search
Figure 5. The InterPro search bar can be found at the top right of InterPro web pages.
To examine the pre-calculated analysis results that InterPro has stored for a
protein in UniProtKB, perform a text search using its UniProtKB accession
number or identifier (e.g.O15075). This will bring you to the protein view, which
provides both the signature hits and structural information for the protein (see
section “How do I interpret an InterPro protein view?”). If you have an accession
number from GenBank, Xref, EMBL or Ensembl, you can convert it to a
UniProtKB
accession
number
using
the
EBI’s
PICR
service
(http://www.ebi.ac.uk/Tools/picr/).
Searching with an InterPro identifier (e.g., IPR020405 or 20405) will provide the
entry page with all the information corresponding to that entry (see section “What
information can be found in the InterPro entry page?”). You can also use a
member database signature accession (such as PR01445), which will return the
InterPro entry that signature is integrated into.
A simple text search query (with a word, GO term, etc) will give you all the
InterPro entries associated with that term.
What information can be found in the InterPro entry page?
The InterPro entry page consists of the following features (see Fig.6):
A. Entry type and name
B. Contributing signatures
C. Hierarchical relationships to other InterPro entries
D. Description of the entry, with links to references
Predicting protein structure and function using InterPro
8
www.ebi.ac.uk
E. GO terms that have been associated with that entry. GO terms are
divided in three categories: biological process, molecular function and
cellular component.
Following the links in the side menu (on the left hand side of the page),
information can be found about proteins matched by that entry, their domain
organisation, pathways & interactions in which they are involved, and their
taxonomic coverage (i.e., the species in which the proteins are found).
Figure 6. The InterPro entry page for the endothelin receptor family (IPR000499)
How do I interpret an InterPro protein view?
The InterPro protein view is broken down into 3 pages: ‘Overview’, ‘Similar
proteins’ and ‘Structures’ (where available). These can be accessed by selecting
the appropriate link in the left hand side menu found on each of the pages.
Protein view: Overview page
This is the default protein view and is an interactive page that contains a large
amount of information (see Fig. 7 below). It is broken down into the following
sections:
Predicting protein structure and function using InterPro
9
www.ebi.ac.uk
Figure 7. The InterPro protein overview page
Top Section (A)

First of all we summarise basic UniProtKB information about the protein:
its UniProt name, short name and accession (with a link to UniProtKB), its
length and taxonomic information.
Protein family membership (B)

The family that InterPro predicts the protein belongs to is presented in a
hierarchy, where appropriate. Clicking on the links takes you to the
InterPro entry pages for each level of the hierarchy matched.
Summary View (C)

The protein is represented as a grey bar, with its length in amino acids
displayed along the bottom.

A graphical representation is provided, summarising the domain and
repeat information predicted by InterPro for the protein.

Domains and repeats are represented by coloured bars. Hovering over
these bars with your mouse gives detailed information on their match
locations and the different InterPro entries that they are summarising.
Sequence features (D)

This section contains detailed information showing all the InterPro
member database signatures that match the protein. This is the raw
information used to create the summary view above. It also contains site
information (active sites, binding sites, etc), where available.
Predicting protein structure and function using InterPro
10
www.ebi.ac.uk

The type and amount of information displayed in this section can be
controlled using the ‘Filter view’ floating menu on the left hand side of the
page.

Family, domain, repeat and/or site matches (in any combination) can be
selected for display.

Matches to member database signatures not yet integrated into InterPro
can also be displayed by clicking the ‘Unintegrated’ checkbox in the
‘Status’ section.

The colour scheme can be changed using the ‘Colour by’ option in the
menu on the left hand side of the page. It is possible to colour by entry
relationship (bars of the same colour are matches to entries in the same
hierarchy) or by member database (bars of the same colour are from the
same member database).

Hovering your mouse over the coloured bars will display the signature
name, accession and the member database that the signature comes
from. The amino acid coordinates that the signature matches are also
shown. Clicking on the linked accession will take you to the member
database page for that signature.
GO term prediction (E)

GO terms predicted for the protein, based on the InterPro entries that it
matches, are found at the bottom of the page. More information about GO
terms can be found at http://geneontology.org/.
Protein view: Similar Proteins page
The Similar proteins page (illustrated in Figure 8) shows all the proteins in
UniProt that have the same predicted domain organisation as the query protein.
A cartoon summarising the domain organisation is displayed, followed by a list of
matching proteins, with their UniProt accession numbers, names and species,
the families to which InterPro predicts they belong, as well as their length and an
indication as to whether structural information is available. Context sensitive links
to UniProt pages and InterPro entry and protein pages are available.
Predicting protein structure and function using InterPro
11
www.ebi.ac.uk
Figure 8. The InterPro Similar sequences page, showing all the proteins in UniProt that
have the same predicted domain organisation as the query protein.
Protein view: Structures page
The Structures page (illustrated in Figure 9) shows all of the experimentally
solved structures and predicted structural information for the query protein. Like
the Overview page, the Structures page displays UniProt information about the
protein, followed by a graphical summary of all of the domain and repeat
information predicted by InterPro, so that this can be correlated with structural
information. It then contains (where available):
Structural Features

Representative matches from PDB, SCOP and CATH are given. PDB
contains information about experimentally-determined structures and
provides structures that can cover part or the whole protein, while CATH
and SCOP break proteins into structural domains.
Structural Predictions

Two matches may be presented here, one to ModBase and the other to
Swiss-Model. These are homology databases that predict protein
structure based on the closest homologue.
Solved structures for this protein

Links are provided to PDB structural information, along with a graphical
representation of the solved structure.
Predicting protein structure and function using InterPro
12
www.ebi.ac.uk
Figure 9. The Structures page showing the sequence features predicted by InterPro and
PDB structural information.
Summary

The InterPro text search can be queried with UniProtKB accessions and
identifiers, InterPro entry names and identifiers, GO terms, structural
identifiers and plain text

To return pre-calculated InterPro analysis results for a protein in
UniProtKB, perform a text search using its UniProtKB accession number
or identifier

InterPro entry pages provide information related to that entry (including
the contributing signatures) and links to the proteins matched by that
entry

The InterPro protein pages provide information about predicted protein
family
membership,
sequence
features,
structural
features
and
predictions, and GO terms for a particular protein.
Predicting protein structure and function using InterPro
13
www.ebi.ac.uk
Exercises
Exercise 1 – Searching InterPro using a UniProt identifier
Open the InterPro homepage in a web browser
(http://www.ebi.ac.uk/interpro/).
Using the “Text” Search box mid-way down the page, type in the UniProtKB
accession ‘O15075’ (without the quotes. That’s a letter O at the start and a zero
in the middle). Click on the purple “Search” button.
You should now have an Overview page describing the signature matches for
this protein.
Question 1: Looking at the InterPro protein view for O15075, how many
InterPro entries (not individual signatures) match the query protein
sequence? (Remember to include the protein family matches!)
Question 2: How many domains does InterPro predict the protein to have?
Question 3: How many member database signatures contribute to InterPro
entry IPR003533?
Hint: You can see the contributing member database signatures to InterPro
entry IPR003533 in the “Sequence features” section, or alternatively click on the
link to IPR003533, which will take you to the entry page for that domain.
Click on the checkbox next to Family in the ‘Filter view on’ menu, on the left
hand side of the page. This should show the location of family signature matches
in the Sequence features section.
Question 4: Which member databases contribute signatures to InterPro
entry IPR020636?
Try clicking on the checkbox next to Unintegrated in the ‘Filter view on’
menu, on the left hand side of the page. This should display the match positions
for signatures not integrated into InterPro.
Question 5: How many unintegrated signatures match the sequence?
Predicting protein structure and function using InterPro
14
www.ebi.ac.uk
Scroll down the page until you reach the GO term prediction section.
Question 6: What are the Biological Process GO terms predicted for this
sequence?
To find out more about the GO terms, including their exact definitions, click
on the terms themselves, which will take you to the relevant QuickGO page.
On the Overview page, click on ‘Similar proteins’ in the left hand side menu.
This will take you to a page listing all the sequences in UniProt with the same
predicted domain organisation as your query sequence.
Question 7: Approximately how many proteins in UniProt share the same
predicted domain architecture as the query protein?
On the Similar proteins page, click on ‘Structures’ in the left hand side menu.
This will take you to the Structures page. In the “Structural Features” section, you
will find the PDB structure. Its length indicates the region of the protein for which
the structure is known. You will also see bars representing a CATH database
match and a SCOP database match, both of which are structural classification
databases that break down the PDB structures into their constituent domains.
Question 8: What region is covered by the PDB structure (i.e., which
domain)?
Hint: Compare it to entry IPR003533 in the sequence features summary
section.
Not all of the protein has been structurally characterised, shown by the fact that
only a small region of this protein is covered by the PDB match. To help address
this problem, there are homology models from both ModBase and Swiss-Model,
found under the “Structural Predictions” section. These are models based on
aligning the protein with its closest homologue whose structure has been
determined. (Note: these are predictive models that provide a ‘best guess’ at the
remaining structure).
Question 9: Why does IPR003533 have two domain hits compared with the
single domain in PDB for this protein?
Predicting protein structure and function using InterPro
15
www.ebi.ac.uk
Clicking on the links in the ‘Solved structures for this protein’ section will
take you to the relevant PDB page, where you can find out more
about the structure and visualise it in 3-D structure (click on the
icon on the PDB page).
Exercise 2 – Exploring InterPro entries
General annotation
Return to the Overview page for O15075. Under the Sequence features summary
section, hover your mouse over the domain predicted to be located in the C
terminus of the protein. Notice that there are several InterPro entries that cover
approximately the same sequence position.
Click on the hyperlink to ‘Protein kinase domain (IPR000719)’ and look at the
“Contributing signatures” section. This section lists the signatures in an entry, the
database they come from, and the number of proteins they match.
Question 10: Which member database signatures make up this entry?
Relationships
InterPro links related signatures through parent/child relationships which indicate
domain/family hierarchies. Child entries subdivide IPR000719 into more closely
related subgroups.
Question 11: What “Child” entries is IPR000719 subdivided into?
Question 12: What is the name of the “Parent” of IPR000719?
In this case, the parent entry represents domains with a structural fold
homologous to that of the protein kinase domain (even if they have no enzyme
activity), whereas IPR000719 represents a more specific form of the domain that
has catalytic protein kinase activity.
GO (Gene Ontology) terms
Scroll down to the “GO terms annotation” section. InterPro provides
mappings to GO terms based on the attributes of experimentally characterised
proteins that match an entry. These are useful for the annotation of proteins that
do not otherwise have GO terms associated with them.
Predicting protein structure and function using InterPro
16
www.ebi.ac.uk
Question 13: Which GO terms are associated with this entry?
Proteins matched
Now look at the proteins matched information in the menu on the left hand
side of the page.
Question 14: Approximately how many proteins are matched by
IPR000719?
Domain organisation
Click on the ‘Domain organisation’ link on the left hand side menu on the
entry page for IPR000719. This will take you to a page showing domain
organisations of all proteins in UniProt predicted to contain a protein kinase
domain.
Question 15: Approximately how many proteins are predicted to possess a
protein kinase catalytic domain followed by 3 PASTA domains?
Hint: you may have to mouse over the domain cartoons to identify the
protein kinase domain and PASTA domains, as the way different domains are
coloured is not very distinct (different domains can be assigned the same colour
on the web page). We are currently working to fix this.
Taxonomy
Click on the ‘Species’ link on the left hand side of the InterPro entry page for
IPR000719. You may explore the taxonomic spread of the sequences matching
this entry by expanding the table in the Taxa section. InterPro divides all the
protein hits in an entry by their taxonomy.
Question 16: What kind of taxonomic groups are sequences containing
protein kinase domains found in?
Structures
Now click on the "Structures" link on the left hand side menu.
InterPro provides a list of all the PDB entries associated with an entry. There are
also structural links to SCOP and CATH at the bottom of the page, which provide
structural classifications of the proteins that match this entry.
Scroll to the bottom of the page and follow the “SCOP d.144.1.7” link to the
SCOP database to find out the structural classification of this domain.
Predicting protein structure and function using InterPro
17
www.ebi.ac.uk
Question 17: What type of structure does the protein kinase-like fold consist
of?
Hint: look at the information under “Fold” in the “Linage” section.
II. Querying sequences using InterProScan
Learning objectives
In this section, you will learn:

how to perform sequence-based queries using InterProScan

how to interpret the results of these queries
If you are using a novel protein sequence to query InterPro (i.e., a protein
sequence that is not in UniProt), the simplest way is to copy and paste the amino
acid sequence into the large box on the home page and click on the search
button immediately to the right. This will run your sequence through InterProScan
with the default parameters selected. For more advanced search options use the
InterProScan link, which allows query sequences to be either entered directly or
uploaded from a file in different formats (GCG, FASTA, EMBL, GenBank, PIR,
etc).
InterProScan incorporates all the analysis algorithms and result post-processing
steps from the member databases. InterProScan outputs the resulting matches
for a sequence in a graphical format. The matches can also be viewed as a table,
which lists the signature match positions.
In addition to the online version of InterPro, a stand-alone version can be
downloaded from the ftp server (ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/) and
installed locally. Unlike the online version of InterProScan, the standalone version
can accept multiple sequences as input. It can also process nucleic acid
sequences as well as amino acid ones.
Summary

InterPro Scan can be used with sequence queries as a predictive
tool for protein sequence classification.
Predicting protein structure and function using InterPro
18
www.ebi.ac.uk
Exercises
Exercise 3 – Analysing sequences using InterProScan
For the next exercise we will use the following sequence:
>sequence1
MPEFVPEDLSGEEETVTECKDSLTKLLSLPYKSFSEKLHRYALSIKDKVVWETWERSGKRVRDYNLYTGVLGT
AYLLFKSYQVTRNEDDLKLCLENVEACDVASRDSERVTFICGYAGVCALGAVAAKCLGDDQLYDRYLARFRGI
RLPSDLPYELLYGRAGYLWACLFLNKHIGQESISSERMRSVVEEIFRAGRQLGNKGTCPLMYEWHGKRYWGAA
HGLAGIMNVLMHTELEPDEIKDVKGTLSYMIQNRFPSGNYLSSEGSKSDRLVHWCHGAPGVALTLVKAAQVYN
TKEFVEAAMEAGEVVWSRGLLKRVGICHGISGNTYVFLSLYRLTRNPKYLYRAKAFASFLLDKSEKLISEGQM
HGGDRPFSLFEGIGGMAYMLLDMNDPTQALFPG
Get the seuqences from http://www.ebi.ac.uk/~hychang/.
First, we will analyse the sequence using InterProScan. Navigate to the
InterPro homepage at http://www.ebi.ac.uk/interpro/ and paste the sequence into
the text box labelled ‘FASTA Sequence’ located mid way down the page. Press
‘Search’.
Note: depending on the load on the web servers, this step might take some time
to complete. You might want to go ahead and start the next step while waiting for
your analysis to finish.
Next, we will analyse the sequence using BLAST. Navigate to the EBI’s
NCBI BLAST homepage at http://www.ebi.ac.uk/Tools/sss/ncbiblast/ and choose
UniProt Knowledgebase as the protein database in Step 1 on the website (this
should be selected by default). Copy and paste the above sequence into the box
highlighted in Step 2 on the website (‘enter your input sequence’) and press
‘Submit’.
Question 18: Based on the InterPro matches. What family is it predicted to
belong to?
Note: the InterProScan results page is currently formatted differently to the
InterPro protein page in that the match information is not summarised - all the
signature matches are shown. We are currently working on unifying the two
views, so that the InterProScan results page will appear more like the protein
page.
Question 19: What do your BLAST results suggest your protein to be?
Question 20: Are the results consistent with those returned by
InterProScan? Is there anything in the InterPro annotation that might explain any
discrepancies?
Hint: Examine InterPro entry IPR007822 and read the annotation.
Predicting protein structure and function using InterPro
19
www.ebi.ac.uk
Question 21: If you have time, you might like to like to investigate the
UniProtKB/TrEMBL sequences A8T8V4 and A8UMV2. How are these sequences
named and annotated in TrEMBL? What does InterPro suggest the function of
these proteins to be?
Course summary
InterPro is a classification resource for protein families, domains and functional
sites, which integrates the following protein signature databases: PROSITE,
PRINTS, ProDom, Pfam, SMART, TIGRFAMs, PIRSF, SUPERFAMILY, Gene3D
and PANTHER. By uniting the member databases, InterPro capitalises on their
individual strengths, producing a powerful diagnostic tool and integrated
resource.
InterPro can help if you have a sequence or set of sequences and want to know:

what they are, the protein family to which they belong

what their function is and how it may be explained in structural
terms
Further reading
InterPro in 2011: new developments in the family and domain prediction
database. Hunter S, et al. Nucl. Acid Res. (2011) doi: 10.1093/nar/gkr948
InterPro: the integrative protein signature database. Hunter S, et al. Nucl. Acid
Res. (2009) 37:D211-5.
InterPro and InterProScan: tools for protein sequence classification and
comparison.
Mulder NJ, Apweiler R. Methods in Molecular Biology (2007)
396:59-70.
Where to find out more
You can find links to documentation about InterPro (user manual, release notes)
on our main web page, and also download information from our ftp server
(ftp://ftp.ebi.ac.uk/pub/databases/interpro/).
More
information
on
protein
signatures and their use in protein classification can be found on the EBI’s online
Predicting protein structure and function using InterPro
20
www.ebi.ac.uk
training
portal
(see
http://www.ebi.ac.uk/training/online/course/introduction-
protein-classification-ebi).
Predicting protein structure and function using InterPro
21
Download