UniProt - European Bioinformatics Institute

Sandra Orchard (V1.1 05/05/09)
UniProtKB and InterPro - Annotating a protein sequence
The Starting Point – a sequence (peptide or protein)
BLAST (Basic Local Alignment Search Tool), finds regions of sequence
similarity and gives functional and evolutionary clues about the structure
and function of your novel sequence. WU-BLAST 2.0 and NCBI BLAST2
are distinctly different software packages, although they have a common
lineage for some portions of their code, so the two packages do their work
differently and obtain different results and offer different features. You can
also check for vector contamination with Blast2 EVEC.
Fasta can be very specific when identifying long regions of low similarity
especially for highly diverged sequences. You can also conduct sequence
similarity searching against complete proteome or genome databases
using the Fasta programs.
MPsrch – Smith and Waterman algorithm, capable of identifying hits in
cases where Blast and Fasta fail and also reports fewer false-positive hits.
For this exercise, we will use UniProt Blast
This work is licensed under the Creative Commons Attribution-Share Alike 3.0 License. To view a copy
of this license, visit http://creativecommons.org/licenses/by-sa/3.0/ or send a letter to Creative Commons, 543
Howard Street, 5th Floor, San Francisco, California, 94105, USA.
 Paste your original sequence into BLAST, check Database =
UniProtKB and “Blast”
Scores should suggest a 100% match to Q9VCL7_DROME
 Click on the coloured bar to display aligned sequences
 On the right hand side of the display, find the series of check boxes
headed Annotations and select Hydrophobic. Would you expect this
protein to be transmembrane, looking at the distribution of its hydrophobic
The UniProt Consortium is comprised of the European Bioinformatics
Institute, the Swiss Institute of Bioinformatics and the Protein Information
Resource. The UniProt consortium aims to support biological research by
maintaining a high quality database that serves as a stable,
comprehensive, fully classified, richly and accurately annotated protein
sequence knowledgebase, with extensive cross-references and querying
interfaces freely accessible to the scientific community.
All data stored in UniProt can be downloaded from the Download Centre
at http://www.ebi.uniprot.org/database/download.shtml.
 Click on hyperlinked Q9VCL7_DROME open UniProt entry.
This is a UniProtKB/TrEMBL entry i.e. translated from the nucleotide
sequence and only automatic annotation and additional cross-referencing
Gene Name=Orf name – this will probably change when protein is
 Click on the taxonomy link to access organism detail.
? How many proteins are there in the Drosophila melanogaster complete
 Return back to the UniProtKB entry
 Scroll down to the cross-references
Nucleotide Database – original submission data, identical underlying
information stored in EMBL, Genbank and DDBJ but slightly different
Ensembl/FlyBase – give a gene-centric view. FlyBase often contains
additional literature references which may be of use.
HSSP – Swiss Homology Model. For proteins lacking a PDB entry, gives
most similar UniProtKB entry with a 3D structure.
IntAct – Molecular Interaction Database
IntAct is a freely available, open source database system and also
provides analysis tools for protein interaction data. All interactions are
derived from literature curation or direct user submissions and are freely
 Click on the hyperlinked Accession number in IntAct cross-reference
The protein has an interaction with annotated protein of known function –
further details of the interaction could tell you more about your protein.
? Which proteins does your sequence interact with?
IntAct stores an interaction in the context of the experimental method
which described this interaction. If an interaction has been described
several times using different methodologies, this will increase the “Number
of Interactions” and also our confidence in the veracity of the interaction.
 Click on ‘Interaction AC’ to see details of experimental methodology
We will return to IntAct later, but we will continue adding information to our
existing Drosophila sequence.
Return to UniProtKB entry
Searching UniProtKB
Our protein is a kinase and a single Y2H experiment suggests it may
interact with an ARF family protein.
?Are Arf family members regulated by phosphorylation?
Search in UniProtKB for Arf family members using the text search
 Press on the ‘Fields’ button, select Field=Taxonomy and type
‘Drosophila melanogaster’ into the Term box.
 Add and Search
 Add Field = cross-reference, Database= InterPro and ‘IPR006689’ into
the Term box (this is the ARF/SAR superfamily signature).
 Add and Search
You should end up with a set of approximately 21 proteins
 Click on “reviewed” entries to limit this to Swiss-Prot entries
 Open P61209 (ARF1_DROME) –
UniProtKB/Swiss-Prot entry.
fully annotated
? Scroll down to the feature table. Can you see any evidence that the
protein is regulated by phosphorylation? (Hint – this information would be
stored in the Sequence Annotation, if it were true).
Interaction data is generated by a number of techniques, all of which are
capable of generating False Positive data. It is possible that the single
Y2H data point suggesting that there is an interaction between an Arf
protein and a kinase may be incorrect.
 In the Swiss-Prot entry ARF1_DROME click on the link ‘Arf family’
under Comments – Sequence Similarity to find orthologous proteins in
other species.
 Open Q3SXY8 (AR13B_HUMAN). This entry has two isoforms. Look
what part of the molecule is missing in Isoform 2 and speculate on its
functional significance (Hint use the Sequence annotation (Features)
Use InterproScan to assign family membership, identify functional
domains etc.
 Paste the sequence given above into InterProScan
(www.ebi.ac.uk/InterProScan), add e-mail address and “Submit Job”
Looking at the results from InterProScan
InterPro is an integrated documentation resource for protein families,
domains and sites. InterPro combines a number of databases (referred to
as member databases) that use different methodologies and a varying
degree of biological information on well-characterised proteins to derive
protein signatures. By uniting the member databases, InterPro capitalises
on their individual strengths, producing a powerful integrated diagnostic
tool. InterPro unifies:
· PROSITE regular expressions and profiles
SUPERFAMILY hidden Markov models (HMMs)
· PRINTS, provider of fingerprints (groups of aligned, un-weighted motifs)
. PRODOM who use Clustr analysis to group sequences
Signatures describing the same protein family, domain repeat or site are
grouped into unique InterPro entries. Each combined InterPro entry has a
unique accession number, an abstract describing the features of proteins
associated with the entry and literature references and has links to the
relevant member database(s). All UniProtKB protein sequences that have
matches to a particular InterPro entry are listed in the Match Table
associated with that entry. There are also links to the InterPro graphical
views. The graphical views, which can be sorted by UniProtKB accession
number, structure or taxonomy, show the position of the signatures on the
protein, mousing over the signature brings up a pop-box, giving the
accession, name and position.
InterPro graphically represents the location of a protein domain and
information pertaining to the origin of that domain and the proteins that
contain it. Families are also defined and may contain several InterPro
domains which are often, but not always, in the same order. Through the
InterPro Domain Architecture view, the composition and order of the
different domains within a family are clearly displayed for easy
comparison, as well as for simple navigation between the entries for
individual domains.
InterPro and InterProScan are accessible for interactive use over the EBI
web server (www/ebi.ac.uk/interpro), they are distributed as stand-alone
copies by anonymous ftp.
InterPro entries are linked to one another through PARENT/CHILD and
CONTAINS/FOUND IN relationships. PARENT/CHILD relationships
indicate superfamily/family/subfamily relationships, as well as domain
hierarchies, where sequences can be subdivided into more specific subsets. CONTAINS/FOUND IN relationships apply to domains, repeats and
sites within families, and are used to describe the composition of protein
?What domains/sites does this protein contain?
 Tab through to raw output, check what positions these domains map to
on the sequence (note – these will differ slightly between different
 Click on IPR000719 to see what information you can gain about this
?What GO terms can you assign to your protein on the basis of InterPro
signature recognition?
 Choose one GO term and copy/paste the GO ID into the text search
box at the top of the InterPro entry page to list all the InterPro entries
associated with this GO term.
?Returning to the IPR000719 entry page, what relationships does this
entry have with other InterPro entries?
 At the top of the page are a number of different views possible for the
proteins within this entry. Type O15075 or DCLK1_HUMAN into the
InterPro search box (NOT the EBI-eye search), and scroll down to look at
the domains contained within its various isoforms (yellow background).
? Why are isoforms -3 and -4 significantly different from -1 and -2.
Looking back at DCLK1_HUMAN (O15075), this has a PDB structure
(green striped bar) for its doublecortin domain. Note the classification of
this domain in CATH (pink striped bar) and in SCOP (black striped bar).
The kinase domain has a structural homology domain predicted by
ModBase (yellow striped bar).
 Click on the hyperlink MB_O15075 to look at the predicted structure
Have a look at the structure of the doublecortin domain using
Astexviewer®, by clicking on the
symbol adjacent to the first CATH
domain ( Remaining in line view, scroll along the sequence
to residue 20 (Y - tyrosine) and clink on ‘Y’. This will zoom in and identify
the residue on the structure.
? Is this residue buried or on the external surface of the protein?
 Click on the adjacent residue, but this time clicking on the structure
?What residue is the tyrosine adjacent to?
Now zoom out again, and have a look at the structure using the ribbon
view by clicking on ‘cartoon’.
?What is the predominant topology of this protein, alpha helix or beta
sheet (note: you can depress your left mouse button whilst moving the
mouse to rotate the structure to any angle you want in order to get a better
view of the structure).
 Returning to the InterProScan page, click on IPR003533
?What additional information can you gain about this protein? Are isoforms
–3 and –4 of DCLK1_HUMAN likely to differ functionally from –1 and –2
(hint read the abstract).
Move down to the taxonomy wheel and open the other fruit fly
sequences which contain this domain.
? Is this domain only present in kinases?
Homology to other related proteins is another powerful tool for information
on a particular protein.
Further Reading
 The UniProt Consortium (2011) Ongoing and future developments
at the Universal Protein Resource.. Nucleic acids research, (39),
Leinonen R, Nardone F, Zhu W, Apweiler R “UniSave: the
Bioinformatics (2006) 22:1284-1285
Suzek BE, Huang H, McGarvey P, Mazumder R, Wu CH
“UniRef: comprehensive and non-redundant UniProt reference
clusters.” Bioinformatics (2007) 23:1282-1288
Hunter, S et al “InterPro: the integrative protein signature
database..” Nucleic Acids Res. (2009) 36:D211-215
McDowall J , Hunter S (2011) InterPro protein classification.
Methods Mol Biol 694:37-47
Aranda, B. et al “The IntAct molecular interaction database in
2010.” Nucleic Acids Res. (2010)38:D525-531