interpro_slides_riga

advertisement
Introduction to InterPro
Amaia Sangrador
InterPro curator
amaia@ebi.ac.uk
EBI is an Outstation of the European Molecular Biology Laboratory.
What is InterPro?
DIAGNOSTICS RESOURCE :
InterPro uses signatures from several different databases
(referred to as member databases) to predict information
about proteins
*
Provides functional analysis of proteins by classifying them
into families and predicting domains and important sites
*
Adds information about the signatures and the types of
proteins they match
InterPro Consortium
Consortium of 11 major
signature databases
Why do we need predictive annotation tools?
14,000,000
12,000,000
UniProtKB
Number of sequences
10,000,000
UniProtKB/Swiss-Prot
8,000,000
6,000,000
4,000,000
2,000,000
0
5-Jan-04
5-Jan-05
5-Jan-06
5-Jan-07
5-Jan-08
Date
5-Jan-09
5-Jan-10
What is UniProt?
Based on the original work on PIR , Swiss-Prot and TrEMBL
Collaboration between EBI, SIB and PIR
The mission of UniProt is to provide the scientific community with a
comprehensive, high-quality and freely accessible resource of
protein sequence and functional information.
UniProtKB
Protein knowledgebase
UniRef
Sequence clusters
UniProtKB/Swiss-Prot
Reviewed
High-quality manual annotation
UniRef100
UniRef90
UniRef50
UniProtKB/TrEMBL
Unreviewed
Automatic annotation
UniMES
Metagenomic
and environmental
sample sequences
UniParc - Sequence archive
Current and obsolete sequences
EMBL/GenBank/DDBJ, Ensembl,
RefSeq, PDB, other resources
Annotation using InterPro
TrEMBL
uncharacterised
sequence
CGCGCCTGTACGC
TGAACGCTCGTGA
CGTGTAGTGCGCG
CGCGCCTGTACGC
TGAACGCTCGTGA
CGTGTAGTGCGCG
manually annotated
sequence
Swiss-Prot
automatic
annotation
pipeline
InterPro
protein
signatures
groups of related
proteins
(same family or
share domains)
Protein family classification
• Given a set of sequences, we usually want to know:
– what are these proteins; to what family do they belong?
– what is their function; how can we explain this in structural terms?
Protein family classification :
BLAST (pairwise comparisons)
Protein family classification:
BLAST
Limitations with Pairwise comparisons
• BLAST alignment of 2 proteins:
• 60S acidic ribosomal protein P0 from 2 species
Limitations with Pairwise comparisons
Protein family classification:
signature databases
•
Alternatively, we can seek ‘patterns’ that will allow us to infer relationships
with previously-characterised sequences
•
This is the approach taken by ‘signature’ databases
Protein signatures
• More sensitive homology searches
• Each member database creates signatures using different methods and
methodologies:
 manually-created sequence alignments
 automatic processes with some human input and correction
 entirely automatically.
What are protein signatures?
Protein family/domain
Multiple sequence alignment
Build model
Search
it.
Protein analysis
Significant
match
ITWKGPVCGLDGKTYRNECALL
AVPRSPVCGSDDVTYANECELK
Mature
model
UniProt
Member databases
METHODS
Hidden Markov Models
Structural Domains
FingerPrints
Profiles
Patterns
Protein features
(active sites…)
Functional annotation of families/domains
Sequence
Clusters
Prediction of
conserved
domains
Diagnostic approaches (sequence-based)
Single motif
methods
Regex patterns
(PROSITE)
Full domain
alignment methods
Profiles
(Profile Library)
HMMs
(Pfam)
Multiple motif
methods
Identity matrices
(PRINTS)
Patterns
Sequence alignment
Motif
Define pattern
Extract pattern sequences
Build regular
expression
xxxxxx
xxxxxx
xxxxxx
xxxxxx
C-C-{P}-x(2)-C-[STDNEKPI]-x(3)-[LIVMFS]-x(3)-C
Pattern signature
PS00000
Patterns
Advantages
• Anchoring the match to the extremity of a sequence
<M-R-[DE]-x(2,4)-[ALT]-{AM}
• Some aa can be forbidden at some specific positions which can help to
distinguish closely related subfamilies
• Short motifs handling - a pattern with very few variability and forbidden
positions, can produce significant matches e.g. conotoxins: very short toxins with few conserved
cysteines C-{C}(6)-C-{C}(5)-C-C-x(1,3)-C-C-x(2,4)-C-x(3,10)- C
Drawbacks
• Simple but less powerful
Patterns are mostly directed against functional residues:
active sites, PTM, disulfide bridges, binding sites
Prosite patterns
Pattern/motif in sequence  regular expression
EXAMPLE: PS00296; Chaperonins cpn60 signature (PATTERN)
A-[AS]-{L}-[DEQ]-E-{A}-{Q}-{R}-x-G(2)-[GA]
>sp|P29197|CH60A_ARATH Chaperonin CPN60, mitochondrial OS=Arabidopsis thaliana
MYRFASNLASKARIAQNARQVSSRMSWSRNYAAKEIKFGVEARALMLKGVEDLADAVKVT
MGPKGRNVVIEQSWGAPKVTKDGVTVAKSIEFKDKIKNVGASLVKQVANATNDVAGDGTT
CATVLTRAIFAEGCKSVAAGMNAMDLRRGISMAVDAVVTNLKSKARMISTSEEIAQVGTI SA
NGEREIGELIAKAMEKVGKEGVITIQDGKTLFNELEVVEGMKLDRGYTSPYFITNQKT QKCE
LDDPLILIHEKKISSINSIVKVLELALKRQRPLLIVSEDVESDALATLILNKLRAG IKVCAIKAPGF
GENRKANLQDLAALTGGEVITDELGMNLEKVDLSMLGTCKKVTVSKDDT VILDGAGDKKGI
EERCEQIRSAIELSTSDYDKEKLQERLAKLSGGVAVLKIGGASEAEVG EKKDRVTDALNATK
AAVEEGILPGGGVALLYAARELEKLPTANFDQKIGVQIIQNALKTP VYTIASNAGVEGA
VIVGKLLEQDNPDLGYDAAKGEYVDMVKAGIIDPLKVIRTALVDAAS VSSLLTTTEAVVVDLP
KDESESGAAGAGMGGMGGMDY
Fingerprints
Sequence alignment
Motif 1
Motif 2
Motif 3
Define motifs
Extract motif
sequences
Fingerprint
signature
PR00000
xxxxxx
xxxxxx
xxxxxx
xxxxxx
Weight
matrices
xxxxxx
xxxxxx
xxxxxx
xxxxxx
xxxxxx
xxxxxx
xxxxxx
xxxxxx
Correct order
1
2
3
Correct spacing
The significance of motif context
• Identify small conserved regions in proteins
• Several motifs  characterise family
• Offer improved diagnostic reliability over single motifs by virtue of the
biological context provided by motif neighbours
order
interval
PRINTS families are hierarchical
Different motifs describe subfamilies
G protein-coupled receptors
rhodospin-like
secretin-like
cAMP
receptors
adenosine receptors
metabotropic
glutamate
receptors
etc
opsin receptors
dopamine receptors
somatostatin receptors
histamine
receptors
etc
somatostatin receptor type 1
somatostatin receptor type 2
somatostatin receptor type 3
etc
Profiles
&
HMMs
Whole protein
Sequence alignment
Define coverage
Use entire alignment
for domain or protein
Build model
Profile or HMM
signature
Entire domain
xxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Models insertions
and deletions
Profiles
Built using weight matrices
Hidden Markov Models (HMM)
More sophisticated algorithm
Models insertions and deletions
More flexible (can use partial alignments)
PROSITE and HAMAP profiles:
a functional annotation perspective
• PROSITE domains: high quality manually curated seeds
(using biologically characterized UniProtKB/Swiss-Prot
entries), documentation and annotation rules. Oriented
toward functional domain discrimination.
• HAMAP families: manually curated bacterial, archaeal and
plastid protein families (represented by profiles and
associated rules), covering some highly conserved proteins
and functions.
HMM databases
Sequence-based
• PIR SUPERFAMILY: families/subfamilies reflect the evolutionary relationship
• PANTHER: families/subfamilies model the divergence of specific functions
• TIGRFAM: microbial functional family classification
• PFAM : families & domains based on conserved sequence
• SMART: functional domain annotation
Structure-based
•SUPERFAMILY : models correspond to SCOP domains
• GENE3D: models correspond to CATH domains
Why we created InterPro
By uniting the member databases, InterPro capitalises on their individual
strengths, producing a powerful diagnostic tool & integrated database
– to simplify & rationalise protein analysis
– to facilitate automatic functional annotation of
uncharacterised proteins
– to provide concise information about the signatures and the
proteins they match, including consistent names, abstracts
(with links to original publications), GO terms and crossreferences to other databases
InterPro entry
InterPro entry
The InterPro entry: types
Family
Proteins share a common evolutionary origin, as reflected in their
related functions, sequences or structure
Domain
Distinct functional, structural or sequence units that may exist in a
variety of biological contexts
Repeats
Short sequences typically repeated within a protein
Sites
PTM
Active
Site
Binding
Site
Conserved
Site
InterPro Entry
Groups similar signatures together
AddsAdds
extensive
extensive
annotation
annotation
LinksLinks
to other
to other
databases
databases
Structural information and viewers
Quality control
Removes redundancy
InterPro Entry
Groups similar signatures together
AddsAdds
extensive
extensive
annotation
annotation
LinksLinks
to other
to other
databases
databases
Structural information and viewers
 Hierarchical classification
Interpro hierarchies: Families
FAMILIES can have parent/child relationships with other Families
Parent/Child relationships are based on:
• Comparison of protein hits

child should be a subset of parent

siblings should not have matches in common
• Existing hierarchies in member databases
• Biological knowledge of curators
Interpro hierarchies: Domains
DOMAINS can have
parent/child relationships
with other domains
Domains and Families may be linked through
Domain Organisation
Hierarchy
InterPro Entry
Groups similar signatures together
AddsAdds
extensive
extensive
annotation
annotation
to databases
other databases
Links to Links
other
Structural information and viewers
InterPro Entry
Groups similar signatures together
Adds extensive
annotation
Adds extensive
annotation
LinksLinks
to other
to other
databases
databases
Structural information and viewers
The Gene Ontology project provides a
controlled vocabulary of terms for
describing gene product characteristics
InterPro Entry
Groups similar signatures together
Adds extensive
annotation
Adds extensive
annotation
LinksLinks
to other
to other
databases
databases
Structural information and viewers
UniProt
KEGG ... Reactome ... IntAct ...
UniProt taxonomy
PANDIT ... MEROPS ... Pfam clans ...
Pubmed
InterPro Entry
Groups similar signatures together
Adds extensive
annotation
Adds extensive
annotation
to databases
other databases
Links to Links
other
Structural information and viewers
PDB 3-D Structures
SCOP Structural
domains
CATH Structural
domain classification
Understanding signatures:
Non-overlapping signatures can be describing the same thing
Not always possible to use signature overlap to determine how family signatures are
related
e.g. High molecular weight glutenins
PF03157
336 protein hits
PR00210
331 protein hits
Two very different signatures both describing the same thing!
Some signatures give us similar, but complementary information
PFAM shows domain is
composed of two types of
repeated sequence motifs
SUPERFAMILY shows the
potential domain
boundaries
www.ebi.ac.uk/interpro
Discontinuous Signatures Require Interpretation
1) Signature method
2) Duplicated domains
3) Repeated elements
4) Non-contiguous domains
www.ebi.ac.uk/interpro
Discontinuous Signatures Require Interpretation
1) Signature method
• e.g. PRINTS – discrete motifs
2) Duplicated domains
3) Repeated elements
4) Non-contiguous domains
www.ebi.ac.uk/interpro
Discontinuous Signatures Require Interpretation
1) Signature method
2) Duplicated domains
• e.g. SSF - duplication consisting of
2 domains with same fold
3) Repeated elements
4) Non-contiguous domains
www.ebi.ac.uk/interpro
Discontinuous Signatures Require Interpretation
1) Signature method
2) Duplicated domains
3) Repeated elements
• e.g. Kringle, WD40
4) Non-contiguous domains
www.ebi.ac.uk/interpro
Discontinuous Signatures Require Interpretation
1) Signature method
2) Duplicated domains
3) Repeats
4) Non-contiguous domains
• Structural domains can consist of
non-contiguous sequence
www.ebi.ac.uk/interpro
Discontinuous Signatures Require Interpretation
1) Signature method
2) Duplicated domains
3) Repeats
4) Non-contiguous domains
www.ebi.ac.uk/interpro
Searching InterPro:
WHEN TO USE INTERPRO
Use InterPro to predict family, domain or active site information for a
given protein or amino acid sequence.
You can search InterPro if you have
•a protein sequence
•a UniProtKB protein identifier,
•a Gene Ontology term,
•a protein structure code
•a general search term
keyword
short phrase
and require further information regarding your protein of interest.
Search tools include:
• Text Search
• InterProScan (sequence search)
• BioMart (builds queries)
http://www.ebi.ac.uk/interpro/
Beta version: http://wwwdev.ebi.ac.uk/interpro/
InterPro Search
Search using:
• text
• protein ID
• InterPro ID
• GO term
wwwdev.ebi.ac.uk/interpro
ID: GO:0006915
Name : apoptosis
InterPro Search
Search results for GO:0006915 (apoptosis )
InterPro Search
protein ID
wwwdev.ebi.ac.uk/interpro
InterPro Search Results
Family
Link to PDBe
Domains
and sites
Unintegrated
signatures
Structural
data
Structural information
CATH and SCOP divide PDB structures into domains
Note that one domain
is discontiguous
Swiss-Model and ModBase can predict structure for regions not
covered by PDB
Searching InterPro:
InterProScan
InterProScan – Searching New Sequence
Additional options
Paste in unknown
sequence
wwwdev.ebi.ac.uk/interpro
InterProScan New Search Results
Link to InterPro entry
Links to
signature
database
s
Searching InterPro:
BioMart
BioMart Search
BioMart allows more powerful and flexible
queries
• Large volumes of data can be queried efficiently
• The interface is shared with many other bioinformatics
resources
• It allows federation with other databases
PRIDE (mass spectrometry-derived proteins and
peptides
REACTOME (biological pathways)
BioMart Search
1) Choose Dataset
a.
Choose InterPro BioMart
BioMart Search
1) Choose Dataset
a.
b.
Choose InterPro BioMart
Choose InterPro entries or protein matches
BioMart Search
2) Choose Filters
 Search specific entries, signatures or proteins
BioMart Search
2) Choose Filters
 e.g. Filter by specific proteins
BioMart Search
3) Choose Attributes
 What results you want
BioMart Search
4) Choose additional Dataset (optional)
 This is where you link results to Pride and Reactome
BioMart Search Results
User manual
Click to view
results
HTML = web-formatted table
CSV = comma-separated values
TSV = tab-separated values
XLS = excel spreadsheet
InterPro – the numbers
Our member databases all have their particular niche or focus...
...but InterPro is a combination of all their areas of expertise!
• InterPro 32.0:
21516 entries
101175 signatures covering 85.5% of UniProtKB
• Frequent releases – both protein and method updates
• 45 000 unique visitors per month
• The database has grown almost 10-fold in ~11 years
Caveats
InterPro is a predictive protein signature database. Small changes
with a large impact may not be well represented.
•for example, inactive peptidases, such as Q8N3Z0, Q9W3H0
InterPro entries are based on signatures supplied to us by our
member databases
•....this means no signature, no entry!
We need your feedback!
missing/additional references
reporting problems
requests
EBI support page.
Acknowledgements
InterPro
Team:
Alex
Mitchell
David
Lonsdale
Craig
McAnulla
Siew-Yit
Yong
Phil
Jones
Anthony
Quinn
Sebastien
Pesseat
Matthew
Maxim
Christopher
Fraser Scheremetjew Hunter
Prudence
Amaia
Mutowo Sangrador
Sarah
Hunter
Download