Protein function and classification www.ebi.ac.uk/interpro Hsin-Yu Chang www.ebi.ac.uk Protein classification could help scientists to gain information about protein functions. Greider and Blackburn discovered telomerase in 1984 and were awarded Nobel prize in 2009. Which model organism they used for this study ? 2. Saccharomyces cerevisiae 3. Mouse 1. Tetrahymena 4. Human 1989 Telomere hypothesis of cell senescence Szostak 1984 Discovery of telomerase Greider and Blackburn 1995 Clone hTR 1995/1997 Clone hTERT 1997 Telomerase knockout mouse 1999/2000… Telomerase/telomere dysfunctions and cancer 1998 Ectopic expression of telomerase in normal human epithelial cells cause the extension of their lifespan A single Tetrahymena cell has 40,000 telomeres, whereas a human cell only has 92. Gilson and Ségal-Bendirdjian, Biochimie, 2010. Therefore, classify proteins into families and identify protein homologues can help scientists to gather more information about their favourite proteins. However, in the lab, what do we usually do to analyse protein sequences and find out their functions? How can we annotate ProteinA ? >ProteinA MNRGVPFRHLLLVLQLALLPAATQGKKVVLGKKGDTVEL TCTASQKKSIQFHWKNSNQIKILGNQGSFLTKGPSKLND RADSRRSLWDQGNFPLIIKNLKIEDSDTYICEVEDQKEEV QLLVFGLTANSDTHLLQGQSLTLTLESPPGSSPSVQCRS PRGKNIQGGKTLSVSQLELQDSGTWTCTVLQNQKKVEF KIDIVVLAFQKASSIVYKKEGEQVEFSFPLAFTVEKLTGS GELWWQAERASSSKSWITFDLKNKEVSVKRVTQDPKLQ MGKKLPLHLTLPQALPQYAGSGNLTLALEAKTGKLHQEV NLVVMRATQLQKNLTCEVWGPTSPKLMLSLKLENKEAK VSKREKAVWVLNPEAGMWQCLLSDSGQVLLESNIKVLP TWSTPVQPMALIVLGGVAGLLLFIGLGIFFCVRCRHRRR QAERMSQIKRLLSEKKTCQCPHRFQKTCSPI What I used to do: • Protein BLAST • Publications - text books or papers • UniProt • PDB • Specialized protein databases such as SGD, the human protein atlas, etc. BLAST (Basic Local Alignment Tool) : compares protein sequences to sequence databases and calculates the statistical significance of matches. BLAST Advantages: • Relatively fast Drawbacks: • User friendly • sometimes struggle with multi-domain proteins • Very good at recognising similarity between closely related sequences • less useful for weaklysimilar sequences (e.g., divergent homologues) Using BLAST to find clues of protein functions -when it goes well Pairwise alignment of two proteins: CD4 from two closely-related species Using BLAST to find clues of protein functions -when it does not give you much information Using BLAST to find clues of protein functions -when it does not give you much information Because BLAST performs local pairwise alignment, it: •Cannot encode the information found in a multiple sequence alignment that show you conserved sites. Using pairwise alignment could miss out on conserved residues 60S acidic ribosomal protein P0: multiple sequence alignment An alternative approach: protein signature search An alternative approach: protein signature search • Construction of a multiple sequence alignment (MSA) from characterised protein sequences. • Modelling the pattern of conserved amino acids at specific positions within a MSA. • Use these models to infer relationships with the characterised sequences • This is the approach taken by protein signature databases Three different protein signature approaches Patterns Sequence alignment Single motif methods Profiles & Hidden Markov Models (HMMs) Full alignment methods Fingerprints Multiple motif methods Protein databases that use signature approaches Hidden Markov Models Finger prints Profiles Patterns HAMAP Structural domains Functional annotation of families/domains Protein features (sites) Patterns Patterns Patterns are usually directed against functional sequence features such as: active sites, binding sites, etc. Sequence alignment Motif ALVKLISG AIVHESAT CHVRDLSC CPVESTIS Pattern sequences [AC] – x -V- x(4) - {ED} Regular expression Pattern signature PS00000 Patterns Advantages: • Strict - a pattern with very little variability and can produce highly accurate matches Drawbacks: • Simple but less flexible Fingerprints Fingerprints: a multiple motif approach Sequence alignment Motif 1 Motif 2 Motif 3 Define motifs xxxxxx xxxxxx xxxxxx xxxxxx Motif sequences Fingerprint signature PR00000 Weight matrices xxxxxx xxxxxx xxxxxx xxxxxx xxxxxx xxxxxx xxxxxx xxxxxx The significance of motif context • Identify small conserved regions in proteins • Several motifs characterise family order 1 2 3 interval Fingerprints • Good at modeling the often small differences between closely related proteins • Distinguish individual subfamilies within protein families, allowing functional characterisation of sequences at a high level of specificity Profiles & HMMs Profiles & HMMs Whole protein Sequence alignment Define coverage Use entire alignment of domain or protein family Build model (Profile or HMMs) Profile or HMM signature Entire domain xxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx Profiles Start with a multiple sequence alignment Amino acids at each position in the alignment are scored according to the frequency with which they occur Scores are weighted according to evolutionary distance using a BLOSUM matrix • Good at identifying homologues HMMs Start with a multiple sequence alignment Amino acid frequency at each position in the alignment and their transition probabilities are encoded Insertions and deletions are also modelled • Can model very divergent regions of alignment • Very good at identifying evolutionarily distant homologues Three different protein signature approaches Patterns Single motif methods Profiles & HMMs hidden Markov models Full alignment methods Fingerprints Multiple motif methods www.ebi.ac.uk/interpro Hidden Markov Models Finger prints Profiles Patterns HAMAP Structural domains Functional annotation of families/domains Protein features (sites) The aim of InterPro Protein sequences Family entry: description, proteins matched and more information. Domain entry: description, proteins matched and more information. Site entry: description, proteins matched and more information. What is InterPro? • InterPro is an integrated sequence analysis resource • It combines predictive models (known as signatures) from different databases • It provides functional analysis of protein sequences by classifying them into families and predicting domains and important sites Facts about InterPro • First release in 1999 • 11 partner databases • Add annotation to UniProtKB/TrEMBL • Provides matches to over 80% of UniProtKB • Source of >85 million Gene Ontology (GO) mappings to >24 million distinct UniProtKB sequences • 50,000 unique visitors to the web site per month> 2 million sequences searched online per month. Plus offline searches with downloadable version of software InterPro signature integration process • Signatures are provided by member databases • They are scanned against the UniProt database to see which sequences they match • Curators manually inspect the matches before integrating the signatures into InterPro InterPro curators InterPro signature integration process • Signatures representing the same entity are integrated together • Relationships between entries are traced, where possible • Curators add literature referenced abstracts, cross-refs to other databases, and GO terms http://www.ebi.ac.uk/interpro/ How can we annotate ProteinA by using InterPro? >ProteinA MNRGVPFRHLLLVLQLALLPAATQGKKVVLGKKGDTVEL TCTASQKKSIQFHWKNSNQIKILGNQGSFLTKGPSKLND RADSRRSLWDQGNFPLIIKNLKIEDSDTYICEVEDQKEEV QLLVFGLTANSDTHLLQGQSLTLTLESPPGSSPSVQCRS PRGKNIQGGKTLSVSQLELQDSGTWTCTVLQNQKKVEF KIDIVVLAFQKASSIVYKKEGEQVEFSFPLAFTVEKLTGS GELWWQAERASSSKSWITFDLKNKEVSVKRVTQDPKLQ MGKKLPLHLTLPQALPQYAGSGNLTLALEAKTGKLHQEV NLVVMRATQLQKNLTCEVWGPTSPKLMLSLKLENKEAK VSKREKAVWVLNPEAGMWQCLLSDSGQVLLESNIKVLP TWSTPVQPMALIVLGGVAGLLLFIGLGIFFCVRCRHRRR QAERMSQIKRLLSEKKTCQCPHRFQKTCSPI Search using protein sequences Family Type InterPro entry types Family Proteins share a common evolutionary origin, as reflected in their related functions, sequences or structure. Ex. Telomerase family. Domain Distinct functional, structural or sequence units that may exist in a variety of biological contexts. Ex. DNA binding domain. Repeats Short sequences typically repeated within a protein. Ex. Tubulin binding repeats in microtubule associated protein Tau. Sites PTM Ex. Phosphorylation sites, ion binding sites, tubulin conserved site. Active Site Binding Site Conserved Site Type Name Identifier Contributing signatures Description References GO terms Type Name Contributing signatures Identifier Relationships Description References InterPro family and domain relationships Family relationships in InterPro: Interleukin-15/Interleukin-21 family (IPR003443) Interleukin-15 (IPR020439) Interleukin-15 Avian (IPR020451) Interleukin-15 Fish (IPR020410) Interleukin-21 (IPR028151) Interleukin-15 Mammal (IPR020466) Relationships InterPro relationships: domains Protein kinase-like domain Protein kinase catalytic domain Serine/threonine kinase catalytic domain Tyrosine kinase catalytic domain A brief diversion into the Gene Ontology... Inconsistency in naming of biological concepts English is not a very precise language • Same name for different concepts • Different names for the same concept An example … Taction Tactition Tactile sense Sensory perception of touch ? ; GO:0050975 Gene Ontology • Unify the representation of gene and gene product attributes across species • Allow cross-species and/or cross-database comparisons The Gene Ontology Less specific concepts • A way to capture biological knowledge in a written and computable form • A set of concepts and their relationships to each other arranged as a hierarchy More specific concepts www.ebi.ac.uk/QuickGO The Concepts in GO • • 1. Molecular Function protein kinase activity insulin receptor activity 2. Biological Process • • 3. Cellular Component Cell cycle Microtubule cytoskeleton organisation GO:0006955 Immune response GO:0016020 membrane Search using keywords Summary • Protein classification could help scientists to gain information about protein functions. • Blast is fast and easy to use but has its drawbacks. • Alternative approach: protein signature databases build models (protein signatures) by using different methods (patterns, fingerprints, profile and HMMs). • InterPro integrates these signatures from 11 member databases. It serves as a sequence analysis resource that classifies sequences into protein families and predicts important domains and sites. Why use InterPro? • Large amounts of manually curated data • 35,634 signatures integrated into 25,214 entries • Cites 38,877 PubMed publications • Large coverage of protein sequence space • Regularly updated • ~ 8 week release schedule • New signatures added • Scanned against latest version of UniProtKB Caution • InterPro is a predictive protein signature database - results are predictions, and should be treated as such • InterPro entries are based on signatures supplied to us by our member databases ....this means no signature, no entry! And one more thing….. We need your feedback! missing/additional references reporting problems requests EBI support page. The InterPro Team: Alex Mitchell Craig McAnulla Siew-Yit Yong Amaia Sangrador Hsin-Yu Chang Sarah Hunter Sebastien Pesseat Gift Matthew Maxim Fraser Scheremetjew Nuka Louise Daugherty Database Basis Institution Built from Focus URL Pfam HMM Sanger Institute Sequence alignment Family & Domain based on conserved sequence http://pfam.sanger.ac.uk/ Gene3D HMM UCL Structure alignment Structural Domain http://gene3d.biochem.ucl.a c.uk/Gene3D/ Evolutionary domain relationships http://supfam.cs.bris.ac.uk/ SUPERFAMILY/ Superfamily HMM Uni. of Bristol Structure alignment SMART HMM EMBL Heidelberg Sequence alignment Functional domain annotation http://smart.emblheidelberg.de/ Microbial Functional Family Classification http://www.jcvi.org/cms/rese arch/projects/tigrfams/overv iew/ TIGRFAM HMM J. Craig Venter Inst. Sequence alignment Panther HMM Uni. S. California Sequence alignment Family functional classification http://www.pantherdb.org/ PIRSF HMM PIR, Georgetown, Washington D.C. Sequence alignment Functional classification http://pir.georgetown.edu/pir www/dbinfo/pirsf.shtml PRINTS Fingerprints Uni. of Manchester Sequence alignment Family functional classification http://www.bioinf.mancheste r.ac.uk/dbbrowser/PRINTS/i ndex.php PROSITE Patterns & Profiles SIB Sequence alignment Functional annotation http://expasy.org/prosite/ HAMAP Profiles SIB Sequence alignment Microbial protein family classification http://expasy.org/sprot/ham ap/ ProDom Sequence clustering PRABI : Rhône-Alpes Sequence alignment Conserved domain prediction http://prodom.prabi.fr/prodo m/current/html/home.php Bioinformatics Center Thank you! www.ebi.ac.uk Twitter: @emblebi Facebook: EMBLEBI YouTube: EMBLMedia The BLOSUM (BLOcks SUbstitution Matrix) matrix is a substitution matrix used for sequence alignment of proteins. BLOSUM matrices are used to score alignments between evolutionarily divergent protein sequences.