Computational Proteomics: Structure/Function Prediction & the Protein Interactome (

advertisement
Computational Proteomics:
Structure/Function Prediction
& the Protein Interactome
Jaime Carbonell (jgc@cs.cmu.edu), with Betty Cheng, Yan Liu,
Eric Xing, Yanjun Qi, Judith Klein-Seetharaman, and Oznur Tastan
Carnegie Mellon University
Pittsburgh PA, USA
December, 2008
Simplified View of Biology
Nobelprize.org
Protein sequence
Protein structure
2
© 2003, Jaime Carbonell
PROTEINS
(Borrowed from: Judith
Klein-Seetharaman)
Sequence  Structure  Function
Primary Sequence
MNGTEGPNFY
PLNYILLNLA
KPMSNFRFGE
HFIIPLIVIF
SDFGPIFMTI
VPFSNKTGVV
VADLFMVFGG
NHAIMGVAFT
FCYGQLVFTV
PAFFAKTSAV
RSPFEAPQYY
FTTTLYTSLH
WVMALACAAP
KEAAAQQQES
YNPVIYIMMN
LAEPWQFSML
GYFVFGPTGC
PLVGWSRYIP
ATTQKAEKEV
KQFRNCMVTT
AAYMFLLIML
NLEGFFATLG
EGMQCSCGID
TRMVIIMVIA
LCCGKNPLGD
GFPINFLTLY
GEIALWSLVV
YYTPHEETNN
FLICWLPYAG
DEASTTVSKT
VTVQHKKLRT
LAIERYVVVC
ESFVIYMFVV
VAFYIFTHQG
ETSQVAPA
Folding
3D Structure
Complex function within
network of proteins
Normal
3
© 2003, Jaime Carbonell
PROTEINS
Sequence  Structure  Function
Primary Sequence
MNGTEGPNFY
PLNYILLNLA
KPMSNFRFGE
HFIIPLIVIF
SDFGPIFMTI
VPFSNKTGVV
VADLFMVFGG
NHAIMGVAFT
FCYGQLVFTV
PAFFAKTSAV
RSPFEAPQYY
FTTTLYTSLH
WVMALACAAP
KEAAAQQQES
YNPVIYIMMN
LAEPWQFSML
GYFVFGPTGC
PLVGWSRYIP
ATTQKAEKEV
KQFRNCMVTT
AAYMFLLIML
NLEGFFATLG
EGMQCSCGID
TRMVIIMVIA
LCCGKNPLGD
GFPINFLTLY
GEIALWSLVV
YYTPHEETNN
FLICWLPYAG
DEASTTVSKT
VTVQHKKLRT
LAIERYVVVC
ESFVIYMFVV
VAFYIFTHQG
ETSQVAPA
Folding
3D Structure
Complex function within
network of proteins
Disease
4
© 2003, Jaime Carbonell
Motivation: Protein Structure and
Function Prediction
• Ultimate goal: Sequence  Function
– …and Function  Sequence (drug design, …)
– Potential active binding sites are a good start, but how
about stability, external accessibility, energetics, …
• Intermediate goal: Sequence  Structure
– Only 1.2% of proteins have been structurally resolved
– What-if analysis (precursor of mutagenesis exp’s)
• Machine Learning & Lang Tech methods
– Powertools to model and predict structure & function
– ComBio challenges are starting to drive new research
in Machine Learning & Language Technologies
5
© 2003, Jaime Carbonell
OUTLINE
• Motivation: sequencestructurefunction
• Vocabulary-based classification approaches
(Betty Cheng, Jaime Carbonell, Judith Klein-Seetharaman)
– GPRC Subfamily classification
– Protein-protein coupling specificity
• Solving the “Folding Problem” Machine
Learning Approaches to Structure Prediction
(Yan Liu, Jaime Carbonell, et al)
– Teriary folds: β-helix prediction via segmented CRFs
– Quaternary Folds: Viral adhesin and capsid complexes
• Conclusions and future directions
6
© 2003, Jaime Carbonell
GPRC Super-family:
G-Protein Coupled Receptors
• Transmembrane protein
• Target of 60% drugs
•
(Moller, 2002)
Involved in cancer,
cardiovascular disease,
Alzheimer’s and
Parkinson’s diseases,
stroke, diabetes, and
inflammatory and
respiratory diseases
Intracellular
Loops
C-Terminus
I VII VI
II III IV
V
Membrane
Extracellular
Loops
N-Terminus
7
© 2003, Jaime Carbonell
Protein Family & Subfamily
Classification (applied to GPCRs)
Subfamily classification based on pharmaceutical properties
8
© 2003, Jaime Carbonell
Comparative Study – Karchin et al., 2002
Traditionally,
hidden Markov
Support Vector
Machines,
models,
k-nearest
neighbours
Neural
Nets, Clustering
and BLAST have been used.
Complex
Hypothesis:
Bio-vocabulary
But
SVM
what
is
the
about
best
those
for
Hidden Markov Models
selection
is crucial
for sub-family
simple
subfamily
classifiers
classification
at the
Recently, more complicated
classification
(and
proteinother
- Karchin
end
of
etbeen
the
al.,
scale?
2002
classifiers
have
used.
protein interaction prediction)
K-Nearest Neighbours,
BLAST
KarchinDecision
et al. (2002)
studied a range
Trees,
of classifiers
varied complexity in
NaïveofBayes
GPCR subfamily classification.
9
Simple
© 2003, Jaime Carbonell
Study “segments” with
different vocabulary
10
© 2003, Jaime Carbonell
AA, chemical groups, properties of AA
Computing Chi-Square
Observed # of
sequences with
feature x
 ( x)  
2
Expected # of
sequences
with feature x
e(c, x)  o(c, x)
2
e(c, x)
cC
tx
e(c, x)  nc 
N
# of
sequences in
class c
11
# of sequences
with feature x
Total # of
sequences
© 2003, Jaime Carbonell
Level I Subfamily Optimization
Decision Trees
Naïve Bayes
Accuracy
Binary
Features
N-gram
Counts
Number of Features
12
© 2003, Jaime Carbonell
Level I Subfamily Results
Classifier
Naïve Bayes
SVM
# of Features
SAM-T2K HMM
kernNN
Accuracy
5500-7700
Binary
93.0 %
3300-6900
N-gram counts
90.6 %
All (9702)
N-gram counts
90.0 %
Gradient of the log-likelihood that the
sequence is generated by the given HMM
model
88.4 %
9 per match state in the
HMM
BLAST
Decision Tree
Type of Features
Local sequence alignment
83.3 %
900-2800
Binary
77.3 %
700-5600
N-gram counts
77.3 %
All (9723)
N-gram counts
77.2 %
A HMM model built for each protein subfamily
69.9 %
9 per match state in the
HMM
Gradient of the log-likelihood that the
sequence is generated by the given HMM
model
13
64.0 %
© 2003, Jaime Carbonell
Level II Subfamily Results
Classifier
Naïve Bayes
SVM
Naïve Bayes
SVMtree
Naïve Bayes
# of Features
8100
9 per match state in the
HMM
5600
9 per match state in the
HMM
All (9702)
BLAST
Type of Features
Accuracy
Binary
92.4 %
Gradient of the log-likelihood that the sequence is
generated by the given HMM model
86.3 %
N-gram counts
84.2 %
Gradient of the log-likelihood that the sequence is
generated by the given HMM model
82.9 %
N-gram counts
81.9 %
Local sequence alignment
74.5 %
Decision Tree
1200
N-gram counts
70.8 %
Decision Tree
2300
Binary
70.2 %
SAM-T2K HMM
Decision Tree
kernNN
A HMM model built for each protein subfamily
All (9723)
9 per match state in the
HMM
70.0 %
N-gram counts
66.0 %
Gradient of the log-likelihood that the sequence is
generated by the given HMM model
51.0 %
14
© 2003, Jaime Carbonell
Helix 3 and 7 known to be
important for signal
transduction
Top 20 selected
“words” for Class
B GPCRs. They
correlate with
identified motifs.
Loop 1 is
suspected
common
binding site
Generalization to Other
Superfamilies: Nuclear Receptors
Dataset
Feature Type
# of Features
Accuracy
Validation
Family
Level I
Subfamily
Level II
Subfamily
Testing
Binary
1500-4200
96.96%
94.53%
N-grams counts
400-4900
95.75%
91.79%
Binary
1500-3100
98.09%
97.77%
N-gram counts
500-1100
93.95%
91.40%
Binary
1500-2100
95.32%
93.62%
N-gram counts
3100-5600
86.39%
85.54%
16
© 2003, Jaime Carbonell
G-Protein Coupling Specificity Problem
• Predict which one or more families of G-proteins a
•
GPCR can couple with, given the GPCR sequence
Locate regions in the GPCR sequence where the
majority of coupling specificity information lies
G-Protein Family
Function
Gs
Activates adenylyl cyclase
Gi/o
Inhibits adenylyl cyclase
Gq/11
Activates phospholipase C
G12/13
Unknown
17
© 2003, Jaime Carbonell
N-gram Based Component
• Extract n-grams from all
MGNASNDSQSEDCETRQWLPPGESPAI …
possible reading frames
Test Sequence
• Use a set of binary k-NN, one
25
12
7
15
for each G-protein family to
predict whether the receptor
couples to the family
5
………
1
0
0
1
Counts of all n-grams
K-NN Classifier
• Predict coupling if k-NN
outputs a probability higher
than trained threshold
0
Pr(coupling to family C) ≥
threshold?
Yes
Predict coupling
to18family C
No
Predict no coupling
© 2003, Jaime
to family
C Carbonell
Alignment-Based Component
MGNASNDSQSEDCETRQWLPPGESPAI …
Test Sequence
• A set of binary k-NN, one for
BLAST
each G-protein family to
predict whether the receptor
couples to the family
K1 most similar
sequences
• Predict coupling if more than
MDNTSNDSQSENREEPLWLPSGESPAIS …
x% of retrieved sequences
couple to the family
MDNFLNDSKLMEDCKSRQWLLSGESPAI …
MNESYRCQTSTWVERGSSATMGAVLFG …
• 2 parameters:
– Number of neighbours, K
– Threshold x%
x% of the K1 sequences
couple to family C?
Yes
19
Predict coupling
to family C
No
Predict no coupling
to family C
© 2003, Jaime Carbonell
Our Hybrid Method:
Combining Alignment and N-grams
MGNASNDSQSEDCETRQWLPPGESPAI …
Test Sequence
BLAST K-NN,
x% = 100%
N-gram K-NN
No
Yes
Predict coupling
to family C
Yes
Predict coupling
to family C
20
No
Predict no
coupling to family C
© 2003, Jaime Carbonell
Evaluation Metrics & Dataset
A
AB
A
Re call 
AC
2 PR
F1 
PR
A D
Accuracy 
A B C  D
Pr ecision 
Truth
Couplings
Predict
NonCouplings
Couplings
A
B
NonCouplings
C
D
(Cao et al., 2003)
81.3% training set
Same test set
21
© 2003, Jaime Carbonell
Results on Cao et al. Dataset
Method
N-gram Threshold
Prec
Recall
F1
Hybrid
0.66
0.698
0.952
0.805
N-gram
0.34
0.658
0.794
0.719
0.577
0.889
0.700
Cao et al.
•
•
Hybrid method outperformed Cao et al. in precision, recall and F1
Suggests alignment contains information not found in n-grams
Method
•
Max
Prec
Recall
F1
Whole Seq Alignment
F1
0.779
0.841
0.809
Hybrid
F1
0.775
0.873
0.821
Whole Seq Alignment
Precision
0.793
0.730
0.760
Hybrid
Precision
0.803
0.778
0.790
Suggests n-grams contain information not found in alignment
22
© 2003, Jaime Carbonell
Feature Selection of N-grams
MGNASNDSQSEDCETRQWLPPGESPAI …
Test Sequence
• Pre-processing step to
25 12
7
15
5
………
1
0
0
1
0
Counts of all n-grams
remove noisy or redundant
features that may confuse
classifier
Chi-Square
Feature Selection
Selected n-gram counts
• Many feature selection
0
algorithms available
• Chi-square was used
0
2
0
1
……
1
0
0
K-NN Classifier
because of success in GPCR
subfamily classification
Pr(coupling to family C) =
threshold?
Yes
Predict coupling
to family C
23
No
Predict no coupling
to family C
© 2003, Jaime Carbonell
IC Domain Combination Analysis
•
Of the 4 domains, 2nd domain yielded best F1 followed by 1st, 3rd
and 4th domains
•
Most information in IC1 already found in IC2
IC
Prec
Rec
F1
Acc
2, 3
0.837
0.825
0.828
0.861
1
0.782
0.703
0.739
0.796
2, 4
0.828
0.816
0.821
0.853
2
0.820
0.799
0.808
0.845
3, 4
0.773
0.807
0.788
0.821
3
0.661
0.721
0.682
0.730
1, 2, 3
0.822
0.814
0.816
0.850
4
0.632
0.755
0.670
0.694
1, 2, 4
0.807
0.809
0.807
0.843
1, 2
0.820
0.805
0.811
0.847
1, 3, 4
0.792
0.807
0.797
0.832
1, 3
0.799
0.765
0.780
0.825
2, 3, 4
0.839
0.820
0.828
0.861
1, 4
0.780
0.755
0.765
0.807
1, 2, 3, 4
0.824
0.813
0.817
0.853
24
© 2003, Jaime Carbonell
Tertiary Protein Fold Prediction
• Protein function strongly modulated by structure
• Predicting folds, domains and other regular structures
•
•
requires modeling local and long distance interactions in
low-homology sequences
– Long distance: Not addressed by n-grams, HMMs, etc.
– Low homology: Not address by BLAST algorithms
We focus on minimal mathematical structural modeling
– Segmented conditional random fields
– Layered graphical models
– Fully trainable to recognize new instances of structures
First acid-test: β-helix super-secondary structural prediction
(with data and guidance from Prof. J. King at MIT)
25
© 2003, Jaime Carbonell
Protein Structure Determination
• Lab experiments: time, cost, uncertainty, …
– X-ray crystallography (months to crystalize, uncertain outcome)
Nobel Prize, Kendrew & Perutz, 1962
– NMR spectroscopy (only works for small proteins or domains)
Nobel Prize, Kurt Wuthrich, 2002
• The gap between sequence and structure necessitates computational
methods of protein structure determination
– 3,023,461 sequences v.s. 36,247 resolved structures (1.2%)
1MBN
1BUS
26
© 2003, Jaime Carbonell
Predicting Protein Structures
• Protein Structure is a key determinant of protein function
• Crystalography to resolve protein structures experimentally in-vitro is
•
very expensive, NMR can only resolve very-small proteins
The gap between the known protein sequences and structures:
– 3,023,461 sequences v.s. 36,247 resolved structures (1.2%)
– Therefore we need to predict structures in-silico
Predicting Tertiary Folds
• Super-secondary structures
– Common protein domains and scaffolding patterns such as regular
combinations of β-sheets and/or -helices
• Out task
– Given a protein sequence, predict supersecondary structures and their
components (e.g. β-helices and the location of each rung therein)
• Examples:
– Parallel Right-handed β-helix
Leucine-rich repeats
28
© 2003, Jaime Carbonell
Parallel Right-handed β-Helix
• Structure
– A regular super-secondary structure with an
an elongated helix whose successive rungs
are composed of beta-strands
– Highly-conserved T2 turn
• Computational importance
– Long-range interactions
– Repeat patterns
• Biological importance
– functions such as the bacterial infection of
plants, binding the O-antigen and etc.
29
© 2003, Jaime Carbonell
Conditional Random Fields
• Hidden Markov model (HMM) [Rabiner, 1989]
N
P(x, y )   P( xi | yi ) P( yi | yi 1 )
i 1
• Conditional random fields (CRFs) [Lafferty et al, 2001]
N K
1
P( y | x)  exp( k f k (x, i, yi 1, y i ))
Z0
i 1 k 1
– Model conditional probability directly (discriminative
models, directly optimizable)
– Allow arbitrary dependencies in observation
– Adaptive to different loss functions and regularizers
– Promising results in multiple applications
– But, need to scale up (computationally) and extend to longdistance dependencies
30
© 2003, Jaime Carbonell
Our Solution: Conditional Graphical Models
Local dependency
Long-range dependency
• Outputs Y = {M, {Wi} }, where Wi = {pi, qi, si}
• Feature definition
– Node feature
f k ( wi , x)  f k '( x, pi , qi ) I ( si  s ', qi  pi  1  d ')
– Local interaction feature
– Long-range interaction feature
f k ( wi 1, wi , x )  I ( si  s, si 1  s ', pi  qi 1  1)
f k ( wi , w j , x )  g k '( x, pi , qi , p j , q j ) I ( si  s, si 1  s ')
31
© 2003, Jaime Carbonell
Linked Segmentation CRF
• Node: secondary structure elements and/or simple fold
• Edges: Local interactions and long-range inter-chain and intra-chain interactions
• L-SCRF: conditional probability of y given x is defined as
P( y1,..., y R | x1 ,..., x R ) 
1
Z
 exp( 
y i , j VG
k
f k ( x i , y i , j ))

y i , j , y a ,b EG
k
exp(  l g k ( x i , x a , y i , j , ya ,b ))
l
Joint Labels
32
© 2003, Jaime Carbonell
Linked Segmentation CRF (II)
• Classification:
y*  arg max
K
 
cCG k 1
k
f k (x, Yc )
• Training : learn the model parameters λ
– Minimizing regularized negative log loss
L 
K
 
cCG k 1
k
f k (x, y c )  log Z  (  2 )
– Iterative search algorithms by seeking the direction whose empirical
values agree with the expectation
L
  ( f k (x, y c )  E p ( y|x ) [ f k (x, y c )])  (  )  0
k cCG
• Complex graphs results in huge computational complexity
33
© 2003, Jaime Carbonell
Model Roadmap
Generalized discriminative graphical models
Conditional random fields [lafferty et al, 2001]
Beyond Markov dependencies
Semi-markov CRFs [Sarawagi & Cohen, 2005]
Trade-off between local
and long-range
Long-range
Segmentation CRFs
(Liu & Carbonell 2005)
Chain graph model
(Liu, Xing & Carbonell, 2006)
Inter-chain long-range
Linked segmentation CRFs
(Liu & Carbonell, 2007)
34
© 2003, Jaime Carbonell
Tertiary Fold Recognition: β-Helix fold
• Histogram and ranks for known β-helices against PDB-minus dataset
5
Chain graph model reduces the real running time of SCRFs model by around 50 times
35
© 2003, Jaime Carbonell
Fold Alignment Prediction: β-Helix
• Predicted alignment for known β -helices on cross-family validation
36
© 2003, Jaime Carbonell
Discovery of New Potential β-helices
• Run structural predictor seeking potential β-helices from Uniprot
(structurally unresolved) databases
– Full list (98 new predictions) can be accessed at
www.cs.cmu.edu/~yanliu/SCRF.html
• Verification on 3 proteins with later experimentally resolved
structures from different organisms
– 1YP2: Potato Tuber ADP-Glucose Pyrophosphorylase
– 1PXZ: The Major Allergen From Cedar Pollen
– GP14 of Shigella bacteriophage as a β-helix protein
– No single false positive!
37
© 2003, Jaime Carbonell
Predicting Quaternary Folds
• Triple beta-spirals [van Raaij et al. Nature 1999]
– Virus fibers in adenovirus, reovirus and PRD1
• Double barrel trimer [Benson et al, 2004]
– Coat protein of adenovirus, PRD1, STIV, PBCV
38
© 2003, Jaime Carbonell
Features for Protein Fold Recognition
39
© 2003, Jaime Carbonell
Experiment Results: Quaternary
Fold Recognition
Triple beta-spirals
Double barrel-trimers
40
© 2003, Jaime Carbonell
Experiment Results: Alignment
Prediction
Triple beta-spirals
Four states: B1, B2, T1 and
T2
Correct Alignment:
B1: i – o
B2: a - h
Predicted Alignment
B1
B2
41
© 2003, Jaime Carbonell
Experiment Results:
Discovering New Membership Proteins
• Predicted membership proteins of triple beta-spirals can be accessed at
http://www.cs.cmu.edu/~yanliu/swissprot_list.xls
• Membership proteins of double barrel-trimer suggested by biologists
[Benson, 2005] compared with L-SCRF predictions
42
© 2003, Jaime Carbonell
Conclusions & Challenges for Protein
Structure/Function Prediction
• Methods from modern Machine Learning and Language
•
•
Technologies really work in Computational Proteomics
– Family/subfamily/sub-subfamily predictions
– Protein-protein interactions (GPCRs G-proteins)
– Accurate tertiary & quaternary fold structural predictions
Next generation of model sophistication…
Addressing new challenges
– Structure  Function: Structural predictions combined
with binding-site & specificity analysis
– Predictive Inversion: Function  Structure  Sequence
for new hyper-specific drug design (anti-viral, oncology)
43
© 2003, Jaime Carbonell
Proteins and Interactions
• Every function in the living cell depends on
proteins
• Proteins are made of a linear sequence of
amino acids and folded into unique 3D
structures
• Proteins can bind to other proteins
physically
– Enables them to carry out diverse cellular
functions
44
© 2003, Jaime Carbonell
Protein-Protein Interaction (PPI) Network
• PPIs play key roles in many biological systems
• A complete PPI network (naturally a graph)
– Critical for analyzing protein functions & understanding the cell
– Essential for diseases studies & drug discoveries
45
© 2003, Jaime Carbonell
PPI Biological Experiments
• Small-scale PPI experiments



One protein or several proteins at a time
Small amount of available data
Expensive and slow lab process
• Large-scale PPI experiments



Hundreds / thousands of proteins at a time
Noisy and incomplete data
Little overlap among different sets
 Large portion of the PPIs still missing or noisy !
46
© 2003, Jaime Carbonell
Learning of PPI Networks
• Goal I: Pairwise PPI (links of PPI graph)
– Most protein-protein interactions (pairwise) have not been
identified or noisy
–  Missing link prediction !
• Goal II: “Complex” (important groups)
– Proteins often interact stably and perform functions together as
one unit (“complex” )
– Most complexes have not be discovered
–  Important group detection !
Pairwise
Interactions
Link Prediction
PPI Network
47
Protein
Complex
© 2003,
Jaime Carbonell
Group Detection
Goal I: Missing Link Prediction
Pairwise
Interactions
PPI Network
48
© 2003, Jaime Carbonell
48
Related Biological Data
• Overall, four categories:
– Direct high-throughput experimental data:
Two-hybrid screens (Y2H) and mass
spectrometry (MS)
– Indirect high throughput data: Gene
expression, protein-DNA binding, etc.
direct
Indirect
– Functional annotation data: Gene ontology
annotation, MIPS annotation, etc.
– Sequence based data sources: Domain
information, gene fusion, homology based
PPIs, etc.
 Utilize implicit evidence and available
direct experimental results together
49
© 2003, Jaime Carbonell
Related Data Evidence
Attribute
Evidence of
Each Protein
Sequence
Expression
Structure
Relational Evidence
Between Proteins
1
1
……
Synthetic lethal
Annotation
……
Relation expanding
50
© 2003, Jaime Carbonell
Feature Vector for (Pairwise) Pairs
– For data representing protein-protein pairs, use directly
– For data representing single protein (gene), calculate the (biologically
meaningful) similarity between two proteins for each evidence
Sequence: mtaaqaagee…
Sequence: mrpsgtagaa…
GeneExp: 233.94, 162.85, ...
GeneExp: 109.4, 975.3, ...
….
…
Protein A
Protein B
Sequence Similarity
GeneExp CorrelationCoeff
Synthetic lethal: 1
……
…
Pair A-B
Pair A-B: fea1,
51 fea2, fea3, …….
© 2003, Jaime Carbonell
Problem Setting
• For each protein-protein pair:
– Target function: interacts or not ?
– Treat as a binary classification task
• Feature Set
– Feature are heterogeneous
– Most features are noisy
– Most features have missing values
• Reference Set:
– Small-scale PPI set as positive training (hundreds  thousands)
– No negative set (non-interacting pairs) available
– Highly skewed class distribution
» Much more non-interacting pairs than interacting pairs
» Estimated: 1 out of ~600 yeast; 1 out of ~1000 human
52
© 2003, Jaime Carbonell
PPI Inference via ML Methods
• Jansen,R., et al., Science 2003
– Bayes Classifier
• Lee,I., et al., Science 2004
– Sum of Log-likelihood Ratio
• Zhang,L., et al., BMC Bioinformatics 2004
– Decision Tree
• Bader J., et al., Nature Biotech 2004
– Logistic Regression
• Ben-Hur,A. et al., ISMB 2005
– Kernel Method
• Rhodes DR. et al., Nature Biotech 2005
– Naïve Bayes
Present focus:
Y. Qi, Z. Bar-Joseph, J. Klein-Seetharaman, Proteins 2006
53
© 2003, Jaime Carbonell
Predicting Pairwise PPIs
– Prediction target (three types)
» Pphysical interaction,
» Co-complex relationship,
» Pathway co-membership inference
– Feature encoding
» (1) “detailed” style, and (2) “summary” style
» Feature importance varies
– Classification methods
» Random Forest & Support Vector Machine
Details in the paper
54
Y. Qi, Z. Bar-Joseph, J. Klein-Seetharaman, Proteins 2006
© 2003, Jaime Carbonell
Human Membrane Receptors
Ligands
Type I
Type II (GPCR)
extracellular
Other Membrane
Proteins
transmembrane
cytoplasmic
Signal Transduction Cascades
55
© 2003, Jaime Carbonell
PPI Predictions for Human Membrane Receptors
• A combined approach
– Binary classification
– Global graph analysis
– Biological feedback &
validation
Y. Qi, et al 2008
Binary Classification
• Random Forest Classifier
– A collection of independent decision trees ( ensemble classifier)
– Each tree is grown on a bootstrap sample of the training set
– Within each tree’s training, for each node, the split is chosen from a bootstrap
sample of the attributes
TAP
GeneExpress
Y2H
GOProcess
GeneOccur
Gene Express
HMS_PCI
N
Y
GOLocalization
ProteinExpress
GeneExpress
N
SynExpress
HMS-PCI
Domain
ProteinExpress
Y
• Robust to noisy features
• Can handle different types of features
57
© 2003, Jaime Carbonell
Classifier Comaparison
• Compare Classifiers
• Receptor PPI (sub-network) to general
human PPI prediction
(27 features extracted from 8
different data sources, modified
with biological feedbacks)
58
© 2003, Jaime Carbonell
Global Graph Analysis
• Degree distribution / Hub analysis / Disease checking
• Graph modules analysis (from bi-clustering study)
• Protein-family based graph patterns (receptors / receptors subclasses / ligands / etc )
59
Global Graph Analysis
Network analysis reveals interesting features of the human membrane
receptor PPI graph
For instance:
• Two types of receptors
(GPCR and non-GPCR (Type I))
• GPCRs less densely
connected than non-GPCRs
(Green: non-GPCR
receptors; blue: GPCR)
60
© 2003, Jaime Carbonell
60
Experimental Validation
• FFive predictions were chosen for experiments and three were verified
– EGFR with HCK (pull-down assay)
– EGFR with Dynamin-2 (pull-down assay)
– RHO with CXCL11 (functional assays, fluorescence spectroscopy,
docking)
– Experiments @ U.Pitt School of Medicine
Details in the paper
Y. Qi, et al 2008
61
© 2003, Jaime Carbonell
61
Motivation
• Current situation of PPI task
– Only a small positive (interacting) set available
– No negative (not interacting) set available
– Highly skewed class distribution
» Much more non-interacting pairs than interacting pairs
– The cost for misclassifying an interacting pair is higher than for a noninteracting pair
– Accuracy measure is not appropriate here
• Try to handle this task with ranking
– Rank the known positive pairs as high as possible
– At the same time, have the ability to rank the unknown positive pairs
as high as possible
62
© 2003, Jaime Carbonell
62
Split Features into Multi-View
• Overall, four feature groups:
– P: Direct highthroughput experimental data:
Direct
Two-hybrid screens (Y2H) and mass
spectrometry (MS)
Genomic
– E: Indirect high throughput data: Gene
expression, protein-DNA binding, etc.
Functional
– F: Functional annotation data: Gene ontology
annotation, MIPS annotation, etc.
Sequence
– S: Sequence based data sources: Domain
information, gene fusion, homology based PPIs,
etc.
63
© 2003, Jaime Carbonell
Y. Qi, J. Klein-Seetharaman, Z. Bar-Joseph, BMC Bioinformatics 2007
Mixture of Feature Experts (MFE)
F
S
P
E
Interact ?
• Make protein interaction prediction by
•
– Weighted voting from the four roughly homogeneous feature categories
– Treat each feature group as a prediction expert
– The weights are also dependent on the input example
Hidden variable, M modulates the choice of expert
p(Y | X )   p(Y | X , M ) p( M | X )
M
64
© 2003, Jaime Carbonell
Mixture of Four Feature Experts

Expert P
Expert E
Direct PPI High throughput
Experiment Data
Indirect High throughput
Experimental Data
Expert F
Expert S
Function Annotation
of Proteins
Sequence or Structure
based Evidence
4
p( y
(n)
| x )   p ( mi
(n)
i 1
(n)
 1 | x ( n ) , v ) * p ( y ( n ) | x ( n ) , mi
(n)
 1, wi )
• Parameters ( wi , v)are trained using EM
• Experts and root gate use logistic regression (ridge estimator)
65
© 2003, Jaime Carbonell
Mixture of Four Feature Experts
• Handling missing value
– Add additional feature column for each feature having low feature
coverage
– MFE uses present / absent information when weighting different feature
groups
• The posterior weight for expert i in predicting pair n
– The weight can be used to indicate the importance of that feature view
( expert ) for this specific pair
(n)
i
h
 P(mi
(n)
 1| y , x , v , w ) 
(n)
(n)
t
P(mi
t
(n)
4
 P(m
j 1
 1 | x ( n ) , v t ) * p( y ( n ) | x ( n ) , mi
(n)
j
(n)
 1 | x ( n ) , v t ) * p( y ( n ) | x ( n ) , m j
66
 1, wi )
t
(n)
 1, w j )
t
© 2003, Jaime Carbonell
Performance
• 162 features for
yeast physical PPI
prediction task
• Features extracted in
“detail” encoding
• Under “detail”
encoding, the ranking
method is almost the
same as RF (not shown)
67
© 2003, Jaime Carbonell
Functional Expert Dominates

300 candidate
protein pairs

51 predicted
interactions

33 validated
already

18 newly
predicted
Figure: The frequency at which each of the four
experts has maximum contribution among
validated and predicted pairs
68
© 2003, Jaime Carbonell
Protein Complex
 Group detection within the PPI network
• Proteins form associations with multiple protein
binding partners stably (termed “complex”)
• Complex member interacts with part of the
group and work as an unit together
• Identification of these important sub-structures
is essential to understand activities in the cell
69
© 2003, Jaime Carbonell
Identify Complex in PPI Graph
• PPI network as a weighted undirected graph
– Edge weights derived from supervised PPI
predictions:
• Previous work
– Unsupervised graph clustering style
– All rely on the assumption that complexes
correspond to the dense regions of the network
• Related facts
– Many other possible topological structures
– A small number of complexes available from
reliable experiments
– Complexes also have functional /biological
properties (like weight / size / …)
70
© 2003, Jaime Carbonell
Possible topological structures
• Make use of the small number of known complexes  supervised
• Model the possible topological structures  subgraph statistics
• Model the biological properties of complexes  subgraph features
Edge weight color coded
71
© 2003, Jaime Carbonell
Properties of Subgraph
• Subgraph properties as
features in BN
– Various topological
properties from graph
– Biological attributes of
complexes
5/14/2008
No.
Sub-Graph Property
1
Vertex Size
2
Graph Density
3
Edge Weight Ave / Var
4
Node degree Ave / Max
5
Degree Correlation Ave / Max
6
Clustering Coefficient Ave / Max
7
Topological Coefficient Ave / Max
8
First Two Eigen Value
9
Fraction of Edge Weight > Certain Cutoff
10
Complex Member Protein Size Ave / Max
11
Complex Member Protein Weight Ave / Max
72
© 2003, Jaime Carbonell
Model Complex Probabilistically
 Assume a probabilistic model (Bayesian Network)
for representing complex sub-graphs
• Bayesian Network (BN)
C
N
– C : If this subgraph is a complex
(1) or not (0)
X
X
X
X
– N : Number of nodes in subgraph
– Xi : Properties of subgraph
L  log
73
p (c  1 | n, x1 , x2 ,..., xm )
p (c  0 | n, x1 , x2 ,..., xm )
© 2003, Jaime Carbonell
Model Complex Probabilistically
• BN parameters trained with MLE
– Trained from known complexes and random sampled noncomplexes
– Discretize continuous features
– Bayesian Prior to smooth the multinomial parameters
• Evaluate candidate subgraphs with the log ratio score L
m
p (c  1 | n, x1 , x2 ,..., xm )
L  log
 log
p (c  0 | n, x1 , x2 ,..., xm )
p (c  1) p ( n | c  1) p ( xk | n, c  1)
k 1
m
p (c  0) p ( n | c  0) p ( xk | n, c  0)
k 1
74
© 2003, Jaime Carbonell
Experimental Setup
• Positive training data:
– Set1: MIPS Yeast complex catalog: a curated set of ~100 protein
complexes
– Set2: TAP05 Yeast complex catalog: a reliable experimental set of
~130 complexes
– Complex size (nodes’ num.) follows a power law
• Negative training data
– Generate from randomly selected nodes in the graph
– Size distribution follows the same power law as the positive
complexes
75
© 2003, Jaime Carbonell
Evaluation
• Train-Test style (Set1 & Set2)
• Precision / Recall / F1 measures
• A cluster “detects” a complex if
A
C
B
A : Number of proteins only in cluster
B : Number of proteins only in complex
C : Number of proteins shared
If overlapping threshold p set as 50%
Detected
Cluster
C
p
AC
Known
complex
76
&
C
p
BC
© 2003, Jaime Carbonell
Performance Comparison
• On yeast predicted PPI graph (~2000 nodes)
• Compare to a popular complex detection package: MCODE (search for
•
•
highly interconnected regions)
Compare to local search relying on density evidence only
Compared to local search with complex score from SVM (also supervised)
Methods
Precision
Recall
F1
Density MCODE
SVM
BN
0.180
0.219
0.211
0.266
0.462
0.075
0.377
0.513
0.253
0.111
0.269
0.346
77
© 2003, Jaime Carbonell
Learning PPI Networks
PSB 05
PROTEINS 06
BMC Bioinfo 07
CCR 08
Protein Complex
ISMB 08
Pathway
Pairwise
Interactions
Prepare
PPI Network
Human-PPI (Revise 08)
HIV-Human PPI (Revise)
Domain/Motif
Interactions
Function
Func A Func ?
Implication
Genome Biology 08
78
© 2003, Jaime Carbonell
Inter species interactome
What are the interacting proteins between two organisms?
79
© 2003, Jaime Carbonell
HIV-1 host protein interactions
Fusion
Reverse
transcription
HIV-1 depends on the
cellular machinery in every
aspect of its life cycle.
Transcription
Budding
Maturation
Peterlin and Torono, Nature Rev Immu 2003.
80
© 2003, Jaime Carbonell
HIV-1 host protein interactions
Human protein
HIV protein
81
© 2003, Jaime Carbonell
FIN
Questions ?
82
© 2003, Jaime Carbonell
Download