Interpro - European Bioinformatics Institute

advertisement
Using InterPro for functional analysis of
protein sequences
Alex Mitchell
InterPro team
mitchell@ebi.ac.uk
EBI is an Outstation of the European Molecular Biology Laboratory.
Why do we need predictive annotation tools?
14,000,000
12,000,000
UniProtKB
Number of sequences
10,000,000
UniProtKB/Swiss-Prot
8,000,000
6,000,000
4,000,000
2,000,000
0
5-Jan-04
5-Jan-06
5-Jan-08
Date
5-Jan-10
• Given a set of uncharacterised sequences, we usually want to know:
– what are these proteins; to what family do they belong?
– what is their function; how can we explain this in structural terms?
Pairwise alignment approaches (e.g., BLAST)
Pairwise alignment approaches (e.g., BLAST)
Pairwise alignment approaches (e.g., BLAST)
• Good at recognising similarity between closely related
sequences
• Perform less well at detecting divergent homologues
The protein signature approach
•
Alternatively, we can model the conservation of amino acids at specific
positions within a multiple sequence alignment, seeking ‘patterns’ across
closely related proteins
•
We can then use these models to infer relationships with previously
characterised sequences
•
This is the approach taken by protein signature databases
•
They go about this in 3 different ways...
Protein signature methods
(patterns)
(fingerprints)
(profiles &
HMMs)
Families
Domains
Sequence
features
What are protein signatures?
Protein family/domain
Multiple sequence alignment
Build model
Search
UniProt
Protein analysis
Significant
match
ITWKGPVCGLDGKTYRNECALL
AVPRSPVCGSDDVTYANECELK
Mature
model
Diagnostic approaches (sequence-based)
Single motif
methods
Regex patterns
(PROSITE)
Full domain
alignment methods
Profiles
(Profile Library)
HMMs
(Pfam)
Multiple motif
methods
Identity matrices
(PRINTS)
Patterns
Sequence alignment
Motif
Define pattern
Extract pattern sequences
Build regular
expression
xxxxxx
xxxxxx
xxxxxx
xxxxxx
C-C-{P}-x(2)-C-[STDNEKPI]-x(3)-[LIVMFS]-x(3)-C
Pattern signature
PS00000
Patterns
Advantages
• Anchoring the match to the extremity of a sequence
<M-R-[DE]-x(2,4)-[ALT]-{AM}
• Some aa can be forbidden at some specific positions which can help to
distinguish closely related subfamilies
• Short motifs handling - a pattern with very few variability and forbidden
positions, can produce significant matches e.g. conotoxins: very short toxins with few conserved
cysteines C-{C}(6)-C-{C}(5)-C-C-x(1,3)-C-C-x(2,4)-C-x(3,10)- C
Drawbacks
• Simple but less powerful
Patterns are mostly directed against functional residues:
active sites, PTM, disulfide bridges, binding sites
Fingerprints
Sequence alignment
Motif 1
Motif 2
Motif 3
Define motifs
Extract motif
sequences
Fingerprint
signature
PR00000
xxxxxx
xxxxxx
xxxxxx
xxxxxx
Weight
matrices
xxxxxx
xxxxxx
xxxxxx
xxxxxx
xxxxxx
xxxxxx
xxxxxx
xxxxxx
Correct order
1
2
3
Correct spacing
The significance of motif context
• Identify small conserved regions in proteins
• Several motifs  characterise family
• Offer improved diagnostic reliability over single motifs by virtue of the
biological context provided by motif neighbours
order
interval
Profiles
&
HMMs
Whole protein
Sequence alignment
Define coverage
Use entire alignment
for domain or protein
Build model
Profile or HMM
signature
Entire domain
xxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Models insertions
and deletions
PROSITE and HAMAP profiles:
a functional annotation perspective
• PROSITE domains: high quality manually curated seeds
(using biologically characterized UniProtKB/Swiss-Prot
entries), documentation and annotation rules. Oriented
toward functional domain discrimination.
• HAMAP families: manually curated bacterial, archaeal and
plastid protein families (represented by profiles and
associated rules), covering some highly conserved proteins
and functions.
HMM databases
Sequence-based
• PIR SUPERFAMILY: families/subfamilies reflect the evolutionary relationship
• PANTHER: families/subfamilies model the divergence of specific functions
• TIGRFAM: microbial functional family classification
• PFAM : families & domains based on conserved sequence
• SMART: functional domain annotation
Structure-based
•SUPERFAMILY : models correspond to SCOP domains
• GENE3D: models correspond to CATH domains
Why we created InterPro
By uniting the member databases, InterPro capitalises on their individual
strengths, producing a powerful diagnostic tool & integrated database
– to simplify & rationalise protein analysis
– to facilitate automatic functional annotation of
uncharacterised proteins
– to provide concise information about the signatures and the
proteins they match, including consistent names, abstracts
(with links to original publications), GO terms and crossreferences to other databases
Hidden Markov Models
Structural
domains
FingerPrints
Profiles
Functional annotation of families/domains
InterPro
Patterns
Protein
features
(sites)
InterPro integration process
Member databases
InterPro
+ annotation
Protein
signatures
InterPro Entry
Groups similar signatures together
AddsAdds
extensive
extensive
annotation
annotation
LinksLinks
to other
to other
databases
databases
Structural information and viewers
 Hierarchical classification
Interpro hierarchies: Families
FAMILIES can have parent/child relationships with other Families
Parent/Child relationships are based on:
• Comparison of protein hits

child should be a subset of parent

siblings should not have matches in common
• Existing hierarchies in member databases
• Biological knowledge of curators
Interpro hierarchies: Domains
DOMAINS can have
parent/child relationships
with other domains
Domains and Families may be linked through
Domain Organisation
Hierarchy
InterPro Entry
Groups similar signatures together
AddsAdds
extensive
extensive
annotation
annotation
to databases
other databases
Links to Links
other
Structural information and viewers
InterPro Entry
Groups similar signatures together
Adds extensive
annotation
Adds extensive
annotation
LinksLinks
to other
to other
databases
databases
Structural information and viewers
The Gene Ontology project provides a
controlled vocabulary of terms for
describing gene product characteristics
InterPro Entry
Groups similar signatures together
Adds extensive
annotation
Adds extensive
annotation
LinksLinks
to other
to other
databases
databases
Structural information and viewers
UniProt
KEGG ... Reactome ... IntAct ...
UniProt taxonomy
PANDIT ... MEROPS ... Pfam clans ...
Pubmed
InterPro Entry
Groups similar signatures together
Adds extensive
annotation
Adds extensive
annotation
to databases
other databases
Links to Links
other
Structural information and viewers
PDB 3-D Structures
SCOP Structural
domains
CATH Structural
domain classification
Searching InterPro
Searching InterPro
Protein family membership
Domain organisation
Domains, repeats
& sites
GO terms
Searching InterPro
Searching InterPro
InterProScan access
Interactive:
http://www.ebi.ac.uk/Tools/pfa/iprscan/
Webservice (SOAP and REST):
http://www.ebi.ac.uk/Tools/webservices/services/pfa/iprscan_rest
http://www.ebi.ac.uk/Tools/webservices/services/pfa/iprscan_soap
Downloadable:
ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/
Searching InterPro
Searching InterPro:
BioMart
BioMart Search
BioMart allows more powerful and flexible
queries
• Large volumes of data can be queried efficiently
• The interface is shared with many other bioinformatics
resources
• It allows federation with other databases:
PRIDE (mass spectrometry-derived proteins and
peptides
REACTOME (biological pathways)
BioMart Search
1) Choose Dataset
a.
Choose InterPro BioMart
BioMart Search
1) Choose Dataset
a.
b.
Choose InterPro BioMart
Choose InterPro entries or protein matches
BioMart Search
2) Choose Filters
 Search specific entries, signatures or proteins
BioMart Search
2) Choose Filters
 e.g. Filter by specific proteins
BioMart Search
3) Choose Attributes
 What results you want
BioMart Search
4) Choose additional Dataset (optional)
 This is where you link results to Pride and Reactome
BioMart Search Results
User manual
Click to view
results
HTML = web-formatted table
CSV = comma-separated values
TSV = tab-separated values
XLS = excel spreadsheet
Download