Lecture - MisPred

advertisement
Prediction of protein function
Lars Juhl Jensen
EMBL Heidelberg
Overview
• Part 1
– Homology-based transfer of annotation
– Function prediction from protein domains
• Part 2
– Prediction of functional motifs from sequence
– Feature-based prediction of protein function
• Part 3
– Prediction of functional interaction networks
Why do we need to predict function?
What do we mean by function?
• The concept “function” is not clearly defined
– A structural biologist, a cell biologist, and a medical
doctor will have very different views
• Many levels of granularity
– For the overall definition of “function”, the knowledge
and description can be more or less specific
• Functional categories are somewhat artificial
– People like to put things in boxes …
Descriptions of protein function
• Controlled vocabularies
–
–
–
–
Gene Ontology
SwissProt keywords
KEGG pathways
EcoCyc pathways
• Interaction networks
• More accurate data models
– Reactome
– Systems Biology Markup Language (SBML)
Molecular function
• Molecular function describes activities, such as
catalytic or binding activities, at the molecular
level
• GO molecular function terms represent activities
rather than the entities that perform the actions,
and do not specify where or when, or in what
context, the action takes place
• Examples of broad functional terms are catalytic
activity or transporter activity; an example of a
narrower term is adenylate cyclase activity
Biological process
• A biological process is series of events
accomplished by one or more ordered assemblies
of molecular functions
• An example of a broad GO biological process
terms is signal transduction; examples of more
specific terms are pyrimidine metabolism or alphaglucoside transport
• It can be difficult to distinguish between a
biological process and a molecular function
Cellular component
• A cellular component is just that, a component of a
cell that is part of some larger object
• It may be an anatomical structure (for example,
the rough endoplasmic reticulum or the nucleus)
or a gene product group (for example, the
ribosome, the proteasome or a protein dimer)
• The cellular component categories are probably
the best defined categories since they correspond
to actual entities
Homology-based
transfer of annotation
Lars Juhl Jensen
EMBL Heidelberg
Detection of homologs
• Pairwise sequence similarity searches
– BLAST (fastest)
– FASTA
– Full Smith-Waterman (most sensitive)
• Profile-based similarity searches
– PSI-BLAST
– Hidden Markov Models (HMMs)
• Sequence similarity should always be evaluated at
the protein level
Sequence similarity, sequence
homology, and functional homology
• Sequence similarity means that the sequences
are similar – no more, no less
• Sequence homology implies that the proteins are
encoded by genes that share a common ancestry
• Functional homology means that two proteins
from two organisms have the same function
• Sequence similarity or sequence homology does
not guarantee functional homology
Orthologs vs. paralogs
Functional consequences
of gene duplication
• Neofunctionalization
– One copy has retained the ancestral function and can
be treated as a 1–to–1 ortholog (functional homolog)
– The other copy have changed their function and behave
much like paralogs
• Subfunctionalization
– Each copy has taken on a part of the ancestral function
– A functional homolog cannot be defined
– Each ortholog typically has the same molecular function
in a different sub-process or location
1–to–1 orthology
• A single gene in one organism corresponds to a
single gene in another organism
• These can generally be assumed to encode
functionally equivalent proteins
– Same molecular function
– Same biological process
– Same localization
• 1–to–1 orthology is fairly common in prokaryotes
and among very closely related organisms
1–to–many orthology
• A single gene in one organism corresponds to
multiple genes in another organism
• Any mixture of neo- and sub-functionalizations
can have occurred
– Typically same molecular function
– Often different biological process or sub-process
– Often different sub-cellular localization or tissue
• 1–to–many orthology is very common between
simple model organisms and higher eukaryotes
Many–to–many orthology
• Many genes in each organism have arisen from a
single gene in their last common ancestor
• Different neo- and sub-functionalizations have
likely taken place in each lineage
– Typically same molecular function
– Often different biological process or sub-process
– Often different sub-cellular localization or tissue
• Many–to–many orthology is common between
higher eukaryotes that are distantly related
Detection of orthologs
• Reconstruction of phylogenetic trees
– The theoretically most correct way
– Works for analyzing particular genes of interest
• Methods based on reciprocal matches
– What currently works at the genomic scale
• Manual curation
– Detection of very remote orthologs may require that
knowledge on gene synteny and/or protein function is
taken into account
Construction of gene trees
• Identify the relevant proteins
– Sequence similarity and possibly additional information
• Construct a blocked multiple sequence alignment
– Use, for example, Muscle and Gblocks
• Reconstruct the most likely phylogenetic tree
– Use, for example, PhyML
• Orthologs and paralogs can be trivially extracted
based on a gene tree
Reciprocal matches
• Simple “best reciprocal match” is a bad choice
– Can only deal with one-to-one orthology
• Detection of in-paralogs
– Similarity higher with species than between species
• Orthologs can now be detected based on best
reciprocal matches between in-paralogous groups
• One or more out-group organisms can optionally
be used to improve the definition of orthologs
Orthologous groups
• Orthologs and paralogs are in principle always
defined with respect to two organisms
• Orthologous groups instead try to encompass an
entire set of organisms
• The “inclusiveness” of the orthologous groups
depends on how broad a set of organisms the
groups cover
Definition of orthologous groups
COGs, KOGs, and NOGs
• The COGs and KOGs were manually curated
– These were automatically expanded to more species
• Tri-clustering
– Detection of in-paralogs
– Identification of triangles of best reciprocal matches
– Merging of triangles that share an edge
• Broad phylogenetics coverage
– COGs and NOGs cover all three domains of life
– KOGs cover all eukaryotes
Clustering based on similarity
• All-against-all sequence similarity is calculated
• A standard clustering method is applied to define
groups of homologous genes
– TribeMCL
– Hierarchical clustering
• These methods generally detect groups of
homologous genes, but are not good for
distinguishing between orthologs and paralogs
Meta-servers
• Since numerous methods exist for identifying
groups of orthologous proteins, meta-servers have
begun to emerge
• These can be very useful for “fishing expeditions”
where one is looking for a remote ortholog of a
particular protein of interest
• However, such meta-servers do not attempt to
unify the different orthologous groups and are thus
not useful for genome-wide studies
Function prediction
from protein domains
Lars Juhl Jensen
EMBL Heidelberg
When homology searches fail
• Sometimes no orthologs or even paralogs can be
identified by sequence similarity searches, or they
are all of unknown function
• No functional information can thus be transferred
based on simple sequence homology
• By instead analyzing the various parts that make
up the complete protein, it is nonetheless often
possible to predict the protein function
Protein domains
• Many eukaryotic proteins consist of multiple
globular domains that can fold independently
• These domains have been mixed and matched
through evolution
• Each type of domain contributes towards the
molecular function of the complete protein
• Numerous resources are able to identify such
domains from sequence alone using HMMs
Which domain resource should I use?
• SMART is focused on signal transduction domains
• Pfam is very actively developed and thus tends to
have the most up-to-date domain collection
• InterPro is useful for genome annotation since the
domains are annotated with GO terms
• CDD is conveniently integrated with the NCBI
BLAST web interface
Predicting globular domains and
intrinsically disordered regions
• Not all globular domains have been discovered
and the databases are thus not comprehensive
• Methods exist for predicting from sequence which
regions are globular and which are disordered
– GlobPlot uses a simple propensity scale
– DisEMBL, DISOPRED, and PONDR all use ensembles
of artificial neural networks
• Many disordered regions are important for protein
function and they should thus not be ignored
Summary
• Functional annotation
– Molecular function vs. biological process
– Inference of molecular function by sequence similarity
– Biological process only transferable between orthologs
• Detection of orthologs
– In-depth studies: phylogenetic trees
– Automated analysis: InParanoid and COG/KOG/NOG
• Profile searches for protein domains
– Each domains contributes a different molecular function
Acknowledgments
Christian von Mering
Christopher Creevey
Ivica Letunic
Rune Linding
Tobias Doerks
Francesca Ciccarelli
Berend Snel
Martijn Huynen
Toby Gibson
Rob Russell
Peer Bork
Prediction of functional
motifs from sequence
Lars Juhl Jensen
EMBL Heidelberg
Proteins – more than just
globular domains
• Transmembrane helices
• Disordered regions
• Eukaryotic linear motifs (ELMs)
– Modification sites, e.g. phosphorylation sites
– Ligand peptides, e.g. SH3 binding sites
– Targeting signals, e.g nuclear localization sequences
• The short functional motifs are as important as the
globular domains
Insulin Receptor Substrate 1
Databases of functional motifs
• Fewer and smaller databases
– General databases of motifs: ProSite and ELM
– Phosphorylation sites: Phospho.ELM and PhosphoSite
– These databases contain much fewer instances that
protein domain databases
• Curation is more difficult
– Protein domain databases can be constructed based on
analysis of protein sequences alone
– Short functional motifs must be curated based on
experimental evidence
Prediction of ELMs
• Most functional motifs are “information poor”
– Weak/short consensus sequences for ELMs
– The typical ELM only has three conserved residues
– Some variance is often allowed even for these
• ELMs are very hard to predict from sequence
– Simply consensus sequences match everywhere
– Even more advanced methods like PSSMs, ANNs, or
SVMs give poor specificity
– The full information is not in the site itself
Construction of data sets
• Compiling an initial data set
– Positive examples can be obtained from existing
databases or curated from the literature
– Good negative examples are often harder to get
• Separate training and test sets
– A method may be able to learn the training examples
but to generalize to new examples
• Homology reduction!
– It is crucial that there is no significant sequence
similarity between examples in the training and test sets
Machine learning
• Numerous algorithms exist
– Artificial neural networks
– Support vector machines
– Decision trees
• The choice of algorithm is not so important
• Providing the relevant input is important
• Having high-quality training data is crucial
Kinase-specific prediction of
phosphorylation sites (NetPhosK)
• Artificial neural networks
(ANNs) were trained
several different kinases
• The sequence logos show
only the positive examples
• Negative examples also
provide information
• Also, ANNs and SVMs can
capture correlations
between positions
Prediction of signal peptides
from sequence (SignalP)
• Function
– Eukaryotic proteins are
targeted to the ER
– Prokaryotic proteins are
targeted for secretion
• Architecture
– Positively charged Nterminus
– Hydrophobic core
– Short, more polar region
– Cleavage site
• Signal peptides can be
accurately predicted
Machine learning can help identify
errors in curated databases
• Some of the manually curated databases contain
obvious errors that can be eliminated
• General “SIGNAL” errors
–
–
–
–
Wrong signal peptide cleavage site
The secreted protein is processed by proteases
Signal peptide include propeptide
Wrong start codon used
Signal peptide or propeptide
N–
Signal peptide
Mature protein
Propeptide
Signal peptide or propeptide
Signal peptide
cleavage
Propeptide
cleavage
Wrong start codon
Use of short linear motifs
for function prediction
• Only a few motifs (mostly localization signals) can
be predicted with high accuracy
– Even in these cases advanced machine learning
methods are typically needed
– These can be treated in the same way as domains
• Most motifs are weak, and predictions should be
approached with care
– To tell if these sites are likely to be true, one needs to
consider the context
– An experiment is needed to prove that it is functional
Feature-based prediction
of protein function
Lars Juhl Jensen
EMBL Heidelberg
Function prediction from post
translational modifications
• Proteins with similar function
may not be related in sequence
• Still they must perform their
function in the context of the
same cellular machinery
• Similarities in features such like
PTMs and physical/chemical
properties could be expected for
proteins
with similar function
Henrik Nielsen, CBS, DTU Lyngby
The concept of ProtFun
Function prediction on the
human prion sequence
############## ProtFun 1.1 predictions ##############
>PRIO_HUMAN
# Functional category
Amino_acid_biosynthesis
Biosynthesis_of_cofactors
Cell_envelope
Cellular_processes
Central_intermediary_metabolism
Energy_metabolism
Fatty_acid_metabolism
Purines_and_pyrimidines
Regulatory_functions
Replication_and_transcription
Translation
Transport_and_binding
Prob
0.020
0.032
0.146
0.053
0.130
0.029
0.017
0.528
0.013
0.020
0.035
=> 0.831
Odds
0.909
0.444
2.393
0.726
2.063
0.322
1.308
2.173
0.081
0.075
0.795
2.027
# Enzyme/nonenzyme
Enzyme
Nonenzyme
Prob
0.250
=> 0.750
Odds
0.873
1.051
Prob
0.070
0.031
0.057
0.020
0.010
0.017
Odds
0.336
0.090
0.180
0.426
0.313
0.334
# Enzyme class
Oxidoreductase
Transferase
Hydrolase
Isomerase
Ligase
Lyase
(EC
(EC
(EC
(EC
(EC
(EC
1.-.-.-)
2.-.-.-)
3.-.-.-)
4.-.-.-)
5.-.-.-)
6.-.-.-)
ProtFun data sets
• Labeling of training and test data
– Cellular role categories: human SwissProt sequences
were categorizes using EUCLID
– Enzyme categories: top-level enzyme classifications
were extract from human SwissProt description lines
– Gene Ontology terms were transferred from InterPro
• The sequences were divided into training and test
sets without significant sequence similarity
• Binary predictors were for each category
Prediction performance on
cellular role categories
Prediction performance on
enzyme categories
Predictive performance on
Gene Ontology categories
Non-classical secretion
• Some proteins without N-terminal signal peptides
are secreted via alternative secretion pathways
– Several growth factors, i.e. FGF1 and FGF2
– Interleukine 1 beta
– HIV-1 tat
• No consensus sequence motif is known
• Maybe they have some features in common with
other secreted proteins …
SecretomeP data sets
• Training and test set
– Positive examples: 3321 extracellular mammalian
proteins with their signal peptides removed
– Negative examples: 3654 mammalian proteins from
cytoplasm or nucleus
• Validation set
– 14 known non-classically secreted proteins
Secreted proteins are typically small
ROC plot for SecretomeP
Similar properties of classically and
non-classically secreted proteins
A look into the black box
• Neural networks are often criticized for being a
“black box” method
• However, there are several ways to investigate
what a neural network ensemble has learned
– Which fraction of the ensemble use a certain feature?
– How good performance can be attained using each of
the features individually?
– How much does performance decrease if the neural
networks are retrained without a certain feature (or
combination of features)?
SecretomeP feature usage
ProtFun performance for
other organisms
• Our predictors work in
general for eukaryotes
– Best performance on
metazoan proteins
• Some categories work
quite well for prokaryotes
– Most metabolism categories
– Transport and binding
• While other categories fail
– Energy metabolism
– Regulatory functions
Mapping category performances
onto input features
Performance contribution of
sequence derived features
• The correlations between
features and function is
conserved for eukaryotes
• Some correlations extend
to archaea and bacteria
– Physical/chemical properties
– Secondary structure and
transmembrane helices
• Other correlations only
hold for eukaryotes
– PTMs and subcellular
localization features
Evolution conserves protein
features and function
• Protein features are more
conserved between
orthologs than paralogs
• This leads to ProtFun
predicting orthologs to be
more likely to share
function than paralogs
• That prediction is fully
consistent with the notion
that it is best to infer
function from orthologous
proteins
Conclusions
• Short linear motifs are likely equally important for
protein function as the large well-studied domains
• These are much harder to predict from sequence
– Reasonable accuracy can be obtained by applying
machine learning methods on high-quality datasets
• Many classes of proteins can be predicted based
on such sequence derived-protein features
– These methods a not nearly as reliable as homology
– However, often they are the only option
Acknowledgments
Ramneek Gupta
Can Kesmir
Jannick Dyrløv Bendtsen
Henrik Nielsen
Nikolaj Blom
Francesca Diella
Rune Linding
Damien Devos
Alfonso Valencia
Søren Brunak
Toby Gibson
Prediction of functional
interaction networks
Lars Juhl Jensen
EMBL Heidelberg
What is an interaction?
• Physical protein interactions
– Proteins that physically touch each other
– Members of the same stable complex
– Transient interactions, e.g. a kinase and its substrate
• The pragmatic definition – whatever the assay in
question can measure
• Functional interactions
– Neighbors in metabolic networks
– Members of the same pathway
The use of interaction networks
for function prediction
• A functional interaction implies that two proteins
are involved in the same biological process
• However, the networks do not divide proteins into
a predefined set of functional classes such as the
Gene Ontology terms
• Functional associations do not require homology
to proteins of know function, and can complement
the predictions even when homology is present
Functional interaction networks
Evidence types
• Genomic context methods
– Phylogenetic profiles, gene neighborhood, and fusion
• Primary experimental data
– Physical protein interactions and gene expression data
• Manually curated databases
– Pathways and protein complexes
• Automatic literature mining
– Co-ocurrence and Natural Language Processing
Phylogenetic profiles
Cell
Cellulosomes
Cellulose
Formalizing the phylogenetic
profile method
Align all proteins against all
Calculate best-hit profile
Join similar species by PCA
Calculate PC profile distances
Calibrate against KEGG maps
Gene neighbourhood
Gene neighborhood
Identify runs of adjacent genes
with the same direction
Score each gene pair based on
intergenic distances
Calibrate against KEGG maps
Infer associations
in other species
Gene fusion
Gene fusion
Calculate all-against-all
pairwise alignments
Find in A genes that match
a the same gene in B
Exclude overlapping
alignments
Calibrate against
KEGG maps
Calibration of quality scores
•
Different pieces of evidence are not
directly comparable
– A different raw quality score is used
for each evidence type
– Quality differences exist among
data sets of the same type
•
Solved by calibrating all scores
against a common reference
– The accuracy relative to a “gold
standard” is calculated within score
intervals
– The resulting points are
approximated by a sigmoid
Data integration
Protein-protein interaction databases
• Imported databases
–
–
–
–
–
BIND, Biomolecular Interaction Network Database
DIP, Database of Interacting Proteins
GRID, General Repository for Interaction Datasets
HPRD, Human Protein Reference Database
MINT, Molecular Interactions Database
• Databases to be added
– IntAct
– PDB
Physical protein interactions
Make binary
representation
of complexes
Yeast two-hybrid
data sets are
inherently binary
Calculate score
from number of
(co-)occurrences
Calculate score
from non-shared
partners
Calibrate against KEGG maps
Combine evidence from experiments
Infer associations in other species
Binary representations
of purification data
Topology based quality scores
• Scoring scheme for yeast two-hybrid data:
– S1 = -log((N1+1)·(N2+1))
– N1 and N2 are the numbers of non-shared interaction partners
– Similar scoring schemes have been published by Saito et al.
• Scoring scheme for complex pull-down data:
–
–
–
–
S2 = log[(N12·N)/((N1+1)·(N2+1))]
N12 is the number of purifications containing both proteins
N1 is the number containing protein 1, N2 is defined similarly
N is the total number of purifications
• Both schemes aim at identifying ubiquitous interactors
Mining microarray
expression databases
Re-normalize arrays
by modern method
to remove biases
Build
expression
matrix
Combine
similar arrays
by PCA
Construct predictor
by Gaussian kernel
density estimation
Calibrate
against
KEGG maps
Infer
associations in
other species
Databases of curated knowledge
• Pathway databases
–
–
–
–
BioCarta
KEGG, Kyoto Encyclopedia of Genes and Genomes
Reactome
STKE, Signal Transduction Knowledge Environment
• Curated protein complexes
– MIPS, Munich Information center for Protein Sequences
• Databases to be added
– Gene Ontology annotation
Co-occurrence in the scientific texts
Associate abstracts with species
Identify gene names in title/abstract
Count (co-)occurrences of genes
Test significance of associations
Calibrate against KEGG maps
Infer associations in other species
Databases used for text mining
• Corpora
– Medline
– OMIM, Online Mendelian
Inheritance in Man
– SGD, Saccharomyces
Genome Database
– The Interactive Fly
• These text sources are all
parsed and converted into
a unified format
• Gene synonyms
–
–
–
–
–
–
Ensembl
SwissProt
HUGO
LocusLink
SGD
TAIR
• Cross references and
sequence comparison is
used for merging
Natural Language Processing
Gene and protein names
Cue words for entity recognition
Verbs for relation extraction
[nxgene The GAL4 gene]
[nxexpr The expression of
[nxgene the cytochrome genes
[nxpg CYC1 and CYC7]]]
is controlled by
[nxpg HAP1]
Multiple types of interactions
Transfer of evidence
• STRING “red” – COG mode
– Each node in the network represents a COG
– For each pair of COGs, the highest confidence score for
each evidence type counts from each clade
– The scores are combined using naïve Bayes
• STRING “blue” – protein mode
– Each node in the network represents a single locus
– Evidence from other organisms are transferred based
on fuzzy orthology
– The scores are combined using naïve Bayes
Evidence transfer based
on “fuzzy orthology”
• Orthology transfer is tricky
– Correct assignment of
orthology is difficult for
distant species
– Functional equivalence is
not guaranteed for paralogs
Target species
?
• These problems are
addressed by our “fuzzy
orthology” scheme
– Functional equivalence
scores are calculated from
all-against-all alignment
– Evidence is distributed
across possible pairs
Source species
The power of cross-species transfer
and evidence integration
The power of cross-species transfer
and evidence integration
The power of cross-species transfer
and evidence integration
The power of cross-species transfer
and evidence integration
The power of cross-species transfer
and evidence integration
The power of cross-species transfer
and evidence integration
The big challenge
Prediction of “mode of action”
Summary
• Functional interaction networks are useful for
predicting the biological role of a protein
• Many algorithms and types of data can be used
for predicting functional interactions
– Each method must be benchmarked
– The different types of evidence should be integrated in
a probabilistic scoring scheme
• To make the most of the available data, evidence
should also be transferred between organisms
Acknowledgments
Christian von Mering
Jasmin Saric
Berend Snel
Sean Hooper
Rossitza Ouzounova
Samuel Chaffron
Julien Lagarde
Mathilde Foglierini
Isabel Rojas
Martijn Huynen
Peer Bork
Download