Computational Proteomics: Structure/Function Prediction & the Protein Interactome Jaime Carbonell (jgc@cs.cmu.edu), with Betty Cheng, Yan Liu, Eric Xing, Yanjun Qi, Judith Klein-Seetharaman, and Oznur Tastan Carnegie Mellon University Pittsburgh PA, USA December, 2008 Simplified View of Biology Nobelprize.org Protein sequence Protein structure 2 © 2003, Jaime Carbonell PROTEINS (Borrowed from: Judith Klein-Seetharaman) Sequence Structure Function Primary Sequence MNGTEGPNFY PLNYILLNLA KPMSNFRFGE HFIIPLIVIF SDFGPIFMTI VPFSNKTGVV VADLFMVFGG NHAIMGVAFT FCYGQLVFTV PAFFAKTSAV RSPFEAPQYY FTTTLYTSLH WVMALACAAP KEAAAQQQES YNPVIYIMMN LAEPWQFSML GYFVFGPTGC PLVGWSRYIP ATTQKAEKEV KQFRNCMVTT AAYMFLLIML NLEGFFATLG EGMQCSCGID TRMVIIMVIA LCCGKNPLGD GFPINFLTLY GEIALWSLVV YYTPHEETNN FLICWLPYAG DEASTTVSKT VTVQHKKLRT LAIERYVVVC ESFVIYMFVV VAFYIFTHQG ETSQVAPA Folding 3D Structure Complex function within network of proteins Normal 3 © 2003, Jaime Carbonell PROTEINS Sequence Structure Function Primary Sequence MNGTEGPNFY PLNYILLNLA KPMSNFRFGE HFIIPLIVIF SDFGPIFMTI VPFSNKTGVV VADLFMVFGG NHAIMGVAFT FCYGQLVFTV PAFFAKTSAV RSPFEAPQYY FTTTLYTSLH WVMALACAAP KEAAAQQQES YNPVIYIMMN LAEPWQFSML GYFVFGPTGC PLVGWSRYIP ATTQKAEKEV KQFRNCMVTT AAYMFLLIML NLEGFFATLG EGMQCSCGID TRMVIIMVIA LCCGKNPLGD GFPINFLTLY GEIALWSLVV YYTPHEETNN FLICWLPYAG DEASTTVSKT VTVQHKKLRT LAIERYVVVC ESFVIYMFVV VAFYIFTHQG ETSQVAPA Folding 3D Structure Complex function within network of proteins Disease 4 © 2003, Jaime Carbonell Motivation: Protein Structure and Function Prediction • Ultimate goal: Sequence Function – …and Function Sequence (drug design, …) – Potential active binding sites are a good start, but how about stability, external accessibility, energetics, … • Intermediate goal: Sequence Structure – Only 1.2% of proteins have been structurally resolved – What-if analysis (precursor of mutagenesis exp’s) • Machine Learning & Lang Tech methods – Powertools to model and predict structure & function – ComBio challenges are starting to drive new research in Machine Learning & Language Technologies 5 © 2003, Jaime Carbonell OUTLINE • Motivation: sequencestructurefunction • Vocabulary-based classification approaches (Betty Cheng, Jaime Carbonell, Judith Klein-Seetharaman) – GPRC Subfamily classification – Protein-protein coupling specificity • Solving the “Folding Problem” Machine Learning Approaches to Structure Prediction (Yan Liu, Jaime Carbonell, et al) – Teriary folds: β-helix prediction via segmented CRFs – Quaternary Folds: Viral adhesin and capsid complexes • Conclusions and future directions 6 © 2003, Jaime Carbonell GPRC Super-family: G-Protein Coupled Receptors • Transmembrane protein • Target of 60% drugs • (Moller, 2002) Involved in cancer, cardiovascular disease, Alzheimer’s and Parkinson’s diseases, stroke, diabetes, and inflammatory and respiratory diseases Intracellular Loops C-Terminus I VII VI II III IV V Membrane Extracellular Loops N-Terminus 7 © 2003, Jaime Carbonell Protein Family & Subfamily Classification (applied to GPCRs) Subfamily classification based on pharmaceutical properties 8 © 2003, Jaime Carbonell Comparative Study – Karchin et al., 2002 Traditionally, hidden Markov Support Vector Machines, models, k-nearest neighbours Neural Nets, Clustering and BLAST have been used. Complex Hypothesis: Bio-vocabulary But SVM what is the about best those for Hidden Markov Models selection is crucial for sub-family simple subfamily classifiers classification at the Recently, more complicated classification (and proteinother - Karchin end of etbeen the al., scale? 2002 classifiers have used. protein interaction prediction) K-Nearest Neighbours, BLAST KarchinDecision et al. (2002) studied a range Trees, of classifiers varied complexity in NaïveofBayes GPCR subfamily classification. 9 Simple © 2003, Jaime Carbonell Study “segments” with different vocabulary 10 © 2003, Jaime Carbonell AA, chemical groups, properties of AA Computing Chi-Square Observed # of sequences with feature x ( x) 2 Expected # of sequences with feature x e(c, x) o(c, x) 2 e(c, x) cC tx e(c, x) nc N # of sequences in class c 11 # of sequences with feature x Total # of sequences © 2003, Jaime Carbonell Level I Subfamily Optimization Decision Trees Naïve Bayes Accuracy Binary Features N-gram Counts Number of Features 12 © 2003, Jaime Carbonell Level I Subfamily Results Classifier Naïve Bayes SVM # of Features SAM-T2K HMM kernNN Accuracy 5500-7700 Binary 93.0 % 3300-6900 N-gram counts 90.6 % All (9702) N-gram counts 90.0 % Gradient of the log-likelihood that the sequence is generated by the given HMM model 88.4 % 9 per match state in the HMM BLAST Decision Tree Type of Features Local sequence alignment 83.3 % 900-2800 Binary 77.3 % 700-5600 N-gram counts 77.3 % All (9723) N-gram counts 77.2 % A HMM model built for each protein subfamily 69.9 % 9 per match state in the HMM Gradient of the log-likelihood that the sequence is generated by the given HMM model 13 64.0 % © 2003, Jaime Carbonell Level II Subfamily Results Classifier Naïve Bayes SVM Naïve Bayes SVMtree Naïve Bayes # of Features 8100 9 per match state in the HMM 5600 9 per match state in the HMM All (9702) BLAST Type of Features Accuracy Binary 92.4 % Gradient of the log-likelihood that the sequence is generated by the given HMM model 86.3 % N-gram counts 84.2 % Gradient of the log-likelihood that the sequence is generated by the given HMM model 82.9 % N-gram counts 81.9 % Local sequence alignment 74.5 % Decision Tree 1200 N-gram counts 70.8 % Decision Tree 2300 Binary 70.2 % SAM-T2K HMM Decision Tree kernNN A HMM model built for each protein subfamily All (9723) 9 per match state in the HMM 70.0 % N-gram counts 66.0 % Gradient of the log-likelihood that the sequence is generated by the given HMM model 51.0 % 14 © 2003, Jaime Carbonell Helix 3 and 7 known to be important for signal transduction Top 20 selected “words” for Class B GPCRs. They correlate with identified motifs. Loop 1 is suspected common binding site Generalization to Other Superfamilies: Nuclear Receptors Dataset Feature Type # of Features Accuracy Validation Family Level I Subfamily Level II Subfamily Testing Binary 1500-4200 96.96% 94.53% N-grams counts 400-4900 95.75% 91.79% Binary 1500-3100 98.09% 97.77% N-gram counts 500-1100 93.95% 91.40% Binary 1500-2100 95.32% 93.62% N-gram counts 3100-5600 86.39% 85.54% 16 © 2003, Jaime Carbonell G-Protein Coupling Specificity Problem • Predict which one or more families of G-proteins a • GPCR can couple with, given the GPCR sequence Locate regions in the GPCR sequence where the majority of coupling specificity information lies G-Protein Family Function Gs Activates adenylyl cyclase Gi/o Inhibits adenylyl cyclase Gq/11 Activates phospholipase C G12/13 Unknown 17 © 2003, Jaime Carbonell N-gram Based Component • Extract n-grams from all MGNASNDSQSEDCETRQWLPPGESPAI … possible reading frames Test Sequence • Use a set of binary k-NN, one 25 12 7 15 for each G-protein family to predict whether the receptor couples to the family 5 ……… 1 0 0 1 Counts of all n-grams K-NN Classifier • Predict coupling if k-NN outputs a probability higher than trained threshold 0 Pr(coupling to family C) ≥ threshold? Yes Predict coupling to18family C No Predict no coupling © 2003, Jaime to family C Carbonell Alignment-Based Component MGNASNDSQSEDCETRQWLPPGESPAI … Test Sequence • A set of binary k-NN, one for BLAST each G-protein family to predict whether the receptor couples to the family K1 most similar sequences • Predict coupling if more than MDNTSNDSQSENREEPLWLPSGESPAIS … x% of retrieved sequences couple to the family MDNFLNDSKLMEDCKSRQWLLSGESPAI … MNESYRCQTSTWVERGSSATMGAVLFG … • 2 parameters: – Number of neighbours, K – Threshold x% x% of the K1 sequences couple to family C? Yes 19 Predict coupling to family C No Predict no coupling to family C © 2003, Jaime Carbonell Our Hybrid Method: Combining Alignment and N-grams MGNASNDSQSEDCETRQWLPPGESPAI … Test Sequence BLAST K-NN, x% = 100% N-gram K-NN No Yes Predict coupling to family C Yes Predict coupling to family C 20 No Predict no coupling to family C © 2003, Jaime Carbonell Evaluation Metrics & Dataset A AB A Re call AC 2 PR F1 PR A D Accuracy A B C D Pr ecision Truth Couplings Predict NonCouplings Couplings A B NonCouplings C D (Cao et al., 2003) 81.3% training set Same test set 21 © 2003, Jaime Carbonell Results on Cao et al. Dataset Method N-gram Threshold Prec Recall F1 Hybrid 0.66 0.698 0.952 0.805 N-gram 0.34 0.658 0.794 0.719 0.577 0.889 0.700 Cao et al. • • Hybrid method outperformed Cao et al. in precision, recall and F1 Suggests alignment contains information not found in n-grams Method • Max Prec Recall F1 Whole Seq Alignment F1 0.779 0.841 0.809 Hybrid F1 0.775 0.873 0.821 Whole Seq Alignment Precision 0.793 0.730 0.760 Hybrid Precision 0.803 0.778 0.790 Suggests n-grams contain information not found in alignment 22 © 2003, Jaime Carbonell Feature Selection of N-grams MGNASNDSQSEDCETRQWLPPGESPAI … Test Sequence • Pre-processing step to 25 12 7 15 5 ……… 1 0 0 1 0 Counts of all n-grams remove noisy or redundant features that may confuse classifier Chi-Square Feature Selection Selected n-gram counts • Many feature selection 0 algorithms available • Chi-square was used 0 2 0 1 …… 1 0 0 K-NN Classifier because of success in GPCR subfamily classification Pr(coupling to family C) = threshold? Yes Predict coupling to family C 23 No Predict no coupling to family C © 2003, Jaime Carbonell IC Domain Combination Analysis • Of the 4 domains, 2nd domain yielded best F1 followed by 1st, 3rd and 4th domains • Most information in IC1 already found in IC2 IC Prec Rec F1 Acc 2, 3 0.837 0.825 0.828 0.861 1 0.782 0.703 0.739 0.796 2, 4 0.828 0.816 0.821 0.853 2 0.820 0.799 0.808 0.845 3, 4 0.773 0.807 0.788 0.821 3 0.661 0.721 0.682 0.730 1, 2, 3 0.822 0.814 0.816 0.850 4 0.632 0.755 0.670 0.694 1, 2, 4 0.807 0.809 0.807 0.843 1, 2 0.820 0.805 0.811 0.847 1, 3, 4 0.792 0.807 0.797 0.832 1, 3 0.799 0.765 0.780 0.825 2, 3, 4 0.839 0.820 0.828 0.861 1, 4 0.780 0.755 0.765 0.807 1, 2, 3, 4 0.824 0.813 0.817 0.853 24 © 2003, Jaime Carbonell Tertiary Protein Fold Prediction • Protein function strongly modulated by structure • Predicting folds, domains and other regular structures • • requires modeling local and long distance interactions in low-homology sequences – Long distance: Not addressed by n-grams, HMMs, etc. – Low homology: Not address by BLAST algorithms We focus on minimal mathematical structural modeling – Segmented conditional random fields – Layered graphical models – Fully trainable to recognize new instances of structures First acid-test: β-helix super-secondary structural prediction (with data and guidance from Prof. J. King at MIT) 25 © 2003, Jaime Carbonell Protein Structure Determination • Lab experiments: time, cost, uncertainty, … – X-ray crystallography (months to crystalize, uncertain outcome) Nobel Prize, Kendrew & Perutz, 1962 – NMR spectroscopy (only works for small proteins or domains) Nobel Prize, Kurt Wuthrich, 2002 • The gap between sequence and structure necessitates computational methods of protein structure determination – 3,023,461 sequences v.s. 36,247 resolved structures (1.2%) 1MBN 1BUS 26 © 2003, Jaime Carbonell Predicting Protein Structures • Protein Structure is a key determinant of protein function • Crystalography to resolve protein structures experimentally in-vitro is • very expensive, NMR can only resolve very-small proteins The gap between the known protein sequences and structures: – 3,023,461 sequences v.s. 36,247 resolved structures (1.2%) – Therefore we need to predict structures in-silico Predicting Tertiary Folds • Super-secondary structures – Common protein domains and scaffolding patterns such as regular combinations of β-sheets and/or -helices • Out task – Given a protein sequence, predict supersecondary structures and their components (e.g. β-helices and the location of each rung therein) • Examples: – Parallel Right-handed β-helix Leucine-rich repeats 28 © 2003, Jaime Carbonell Parallel Right-handed β-Helix • Structure – A regular super-secondary structure with an an elongated helix whose successive rungs are composed of beta-strands – Highly-conserved T2 turn • Computational importance – Long-range interactions – Repeat patterns • Biological importance – functions such as the bacterial infection of plants, binding the O-antigen and etc. 29 © 2003, Jaime Carbonell Conditional Random Fields • Hidden Markov model (HMM) [Rabiner, 1989] N P(x, y ) P( xi | yi ) P( yi | yi 1 ) i 1 • Conditional random fields (CRFs) [Lafferty et al, 2001] N K 1 P( y | x) exp( k f k (x, i, yi 1, y i )) Z0 i 1 k 1 – Model conditional probability directly (discriminative models, directly optimizable) – Allow arbitrary dependencies in observation – Adaptive to different loss functions and regularizers – Promising results in multiple applications – But, need to scale up (computationally) and extend to longdistance dependencies 30 © 2003, Jaime Carbonell Our Solution: Conditional Graphical Models Local dependency Long-range dependency • Outputs Y = {M, {Wi} }, where Wi = {pi, qi, si} • Feature definition – Node feature f k ( wi , x) f k '( x, pi , qi ) I ( si s ', qi pi 1 d ') – Local interaction feature – Long-range interaction feature f k ( wi 1, wi , x ) I ( si s, si 1 s ', pi qi 1 1) f k ( wi , w j , x ) g k '( x, pi , qi , p j , q j ) I ( si s, si 1 s ') 31 © 2003, Jaime Carbonell Linked Segmentation CRF • Node: secondary structure elements and/or simple fold • Edges: Local interactions and long-range inter-chain and intra-chain interactions • L-SCRF: conditional probability of y given x is defined as P( y1,..., y R | x1 ,..., x R ) 1 Z exp( y i , j VG k f k ( x i , y i , j )) y i , j , y a ,b EG k exp( l g k ( x i , x a , y i , j , ya ,b )) l Joint Labels 32 © 2003, Jaime Carbonell Linked Segmentation CRF (II) • Classification: y* arg max K cCG k 1 k f k (x, Yc ) • Training : learn the model parameters λ – Minimizing regularized negative log loss L K cCG k 1 k f k (x, y c ) log Z ( 2 ) – Iterative search algorithms by seeking the direction whose empirical values agree with the expectation L ( f k (x, y c ) E p ( y|x ) [ f k (x, y c )]) ( ) 0 k cCG • Complex graphs results in huge computational complexity 33 © 2003, Jaime Carbonell Model Roadmap Generalized discriminative graphical models Conditional random fields [lafferty et al, 2001] Beyond Markov dependencies Semi-markov CRFs [Sarawagi & Cohen, 2005] Trade-off between local and long-range Long-range Segmentation CRFs (Liu & Carbonell 2005) Chain graph model (Liu, Xing & Carbonell, 2006) Inter-chain long-range Linked segmentation CRFs (Liu & Carbonell, 2007) 34 © 2003, Jaime Carbonell Tertiary Fold Recognition: β-Helix fold • Histogram and ranks for known β-helices against PDB-minus dataset 5 Chain graph model reduces the real running time of SCRFs model by around 50 times 35 © 2003, Jaime Carbonell Fold Alignment Prediction: β-Helix • Predicted alignment for known β -helices on cross-family validation 36 © 2003, Jaime Carbonell Discovery of New Potential β-helices • Run structural predictor seeking potential β-helices from Uniprot (structurally unresolved) databases – Full list (98 new predictions) can be accessed at www.cs.cmu.edu/~yanliu/SCRF.html • Verification on 3 proteins with later experimentally resolved structures from different organisms – 1YP2: Potato Tuber ADP-Glucose Pyrophosphorylase – 1PXZ: The Major Allergen From Cedar Pollen – GP14 of Shigella bacteriophage as a β-helix protein – No single false positive! 37 © 2003, Jaime Carbonell Predicting Quaternary Folds • Triple beta-spirals [van Raaij et al. Nature 1999] – Virus fibers in adenovirus, reovirus and PRD1 • Double barrel trimer [Benson et al, 2004] – Coat protein of adenovirus, PRD1, STIV, PBCV 38 © 2003, Jaime Carbonell Features for Protein Fold Recognition 39 © 2003, Jaime Carbonell Experiment Results: Quaternary Fold Recognition Triple beta-spirals Double barrel-trimers 40 © 2003, Jaime Carbonell Experiment Results: Alignment Prediction Triple beta-spirals Four states: B1, B2, T1 and T2 Correct Alignment: B1: i – o B2: a - h Predicted Alignment B1 B2 41 © 2003, Jaime Carbonell Experiment Results: Discovering New Membership Proteins • Predicted membership proteins of triple beta-spirals can be accessed at http://www.cs.cmu.edu/~yanliu/swissprot_list.xls • Membership proteins of double barrel-trimer suggested by biologists [Benson, 2005] compared with L-SCRF predictions 42 © 2003, Jaime Carbonell Conclusions & Challenges for Protein Structure/Function Prediction • Methods from modern Machine Learning and Language • • Technologies really work in Computational Proteomics – Family/subfamily/sub-subfamily predictions – Protein-protein interactions (GPCRs G-proteins) – Accurate tertiary & quaternary fold structural predictions Next generation of model sophistication… Addressing new challenges – Structure Function: Structural predictions combined with binding-site & specificity analysis – Predictive Inversion: Function Structure Sequence for new hyper-specific drug design (anti-viral, oncology) 43 © 2003, Jaime Carbonell Proteins and Interactions • Every function in the living cell depends on proteins • Proteins are made of a linear sequence of amino acids and folded into unique 3D structures • Proteins can bind to other proteins physically – Enables them to carry out diverse cellular functions 44 © 2003, Jaime Carbonell Protein-Protein Interaction (PPI) Network • PPIs play key roles in many biological systems • A complete PPI network (naturally a graph) – Critical for analyzing protein functions & understanding the cell – Essential for diseases studies & drug discoveries 45 © 2003, Jaime Carbonell PPI Biological Experiments • Small-scale PPI experiments One protein or several proteins at a time Small amount of available data Expensive and slow lab process • Large-scale PPI experiments Hundreds / thousands of proteins at a time Noisy and incomplete data Little overlap among different sets Large portion of the PPIs still missing or noisy ! 46 © 2003, Jaime Carbonell Learning of PPI Networks • Goal I: Pairwise PPI (links of PPI graph) – Most protein-protein interactions (pairwise) have not been identified or noisy – Missing link prediction ! • Goal II: “Complex” (important groups) – Proteins often interact stably and perform functions together as one unit (“complex” ) – Most complexes have not be discovered – Important group detection ! Pairwise Interactions Link Prediction PPI Network 47 Protein Complex © 2003, Jaime Carbonell Group Detection Goal I: Missing Link Prediction Pairwise Interactions PPI Network 48 © 2003, Jaime Carbonell 48 Related Biological Data • Overall, four categories: – Direct high-throughput experimental data: Two-hybrid screens (Y2H) and mass spectrometry (MS) – Indirect high throughput data: Gene expression, protein-DNA binding, etc. direct Indirect – Functional annotation data: Gene ontology annotation, MIPS annotation, etc. – Sequence based data sources: Domain information, gene fusion, homology based PPIs, etc. Utilize implicit evidence and available direct experimental results together 49 © 2003, Jaime Carbonell Related Data Evidence Attribute Evidence of Each Protein Sequence Expression Structure Relational Evidence Between Proteins 1 1 …… Synthetic lethal Annotation …… Relation expanding 50 © 2003, Jaime Carbonell Feature Vector for (Pairwise) Pairs – For data representing protein-protein pairs, use directly – For data representing single protein (gene), calculate the (biologically meaningful) similarity between two proteins for each evidence Sequence: mtaaqaagee… Sequence: mrpsgtagaa… GeneExp: 233.94, 162.85, ... GeneExp: 109.4, 975.3, ... …. … Protein A Protein B Sequence Similarity GeneExp CorrelationCoeff Synthetic lethal: 1 …… … Pair A-B Pair A-B: fea1, 51 fea2, fea3, ……. © 2003, Jaime Carbonell Problem Setting • For each protein-protein pair: – Target function: interacts or not ? – Treat as a binary classification task • Feature Set – Feature are heterogeneous – Most features are noisy – Most features have missing values • Reference Set: – Small-scale PPI set as positive training (hundreds thousands) – No negative set (non-interacting pairs) available – Highly skewed class distribution » Much more non-interacting pairs than interacting pairs » Estimated: 1 out of ~600 yeast; 1 out of ~1000 human 52 © 2003, Jaime Carbonell PPI Inference via ML Methods • Jansen,R., et al., Science 2003 – Bayes Classifier • Lee,I., et al., Science 2004 – Sum of Log-likelihood Ratio • Zhang,L., et al., BMC Bioinformatics 2004 – Decision Tree • Bader J., et al., Nature Biotech 2004 – Logistic Regression • Ben-Hur,A. et al., ISMB 2005 – Kernel Method • Rhodes DR. et al., Nature Biotech 2005 – Naïve Bayes Present focus: Y. Qi, Z. Bar-Joseph, J. Klein-Seetharaman, Proteins 2006 53 © 2003, Jaime Carbonell Predicting Pairwise PPIs – Prediction target (three types) » Pphysical interaction, » Co-complex relationship, » Pathway co-membership inference – Feature encoding » (1) “detailed” style, and (2) “summary” style » Feature importance varies – Classification methods » Random Forest & Support Vector Machine Details in the paper 54 Y. Qi, Z. Bar-Joseph, J. Klein-Seetharaman, Proteins 2006 © 2003, Jaime Carbonell Human Membrane Receptors Ligands Type I Type II (GPCR) extracellular Other Membrane Proteins transmembrane cytoplasmic Signal Transduction Cascades 55 © 2003, Jaime Carbonell PPI Predictions for Human Membrane Receptors • A combined approach – Binary classification – Global graph analysis – Biological feedback & validation Y. Qi, et al 2008 Binary Classification • Random Forest Classifier – A collection of independent decision trees ( ensemble classifier) – Each tree is grown on a bootstrap sample of the training set – Within each tree’s training, for each node, the split is chosen from a bootstrap sample of the attributes TAP GeneExpress Y2H GOProcess GeneOccur Gene Express HMS_PCI N Y GOLocalization ProteinExpress GeneExpress N SynExpress HMS-PCI Domain ProteinExpress Y • Robust to noisy features • Can handle different types of features 57 © 2003, Jaime Carbonell Classifier Comaparison • Compare Classifiers • Receptor PPI (sub-network) to general human PPI prediction (27 features extracted from 8 different data sources, modified with biological feedbacks) 58 © 2003, Jaime Carbonell Global Graph Analysis • Degree distribution / Hub analysis / Disease checking • Graph modules analysis (from bi-clustering study) • Protein-family based graph patterns (receptors / receptors subclasses / ligands / etc ) 59 Global Graph Analysis Network analysis reveals interesting features of the human membrane receptor PPI graph For instance: • Two types of receptors (GPCR and non-GPCR (Type I)) • GPCRs less densely connected than non-GPCRs (Green: non-GPCR receptors; blue: GPCR) 60 © 2003, Jaime Carbonell 60 Experimental Validation • FFive predictions were chosen for experiments and three were verified – EGFR with HCK (pull-down assay) – EGFR with Dynamin-2 (pull-down assay) – RHO with CXCL11 (functional assays, fluorescence spectroscopy, docking) – Experiments @ U.Pitt School of Medicine Details in the paper Y. Qi, et al 2008 61 © 2003, Jaime Carbonell 61 Motivation • Current situation of PPI task – Only a small positive (interacting) set available – No negative (not interacting) set available – Highly skewed class distribution » Much more non-interacting pairs than interacting pairs – The cost for misclassifying an interacting pair is higher than for a noninteracting pair – Accuracy measure is not appropriate here • Try to handle this task with ranking – Rank the known positive pairs as high as possible – At the same time, have the ability to rank the unknown positive pairs as high as possible 62 © 2003, Jaime Carbonell 62 Split Features into Multi-View • Overall, four feature groups: – P: Direct highthroughput experimental data: Direct Two-hybrid screens (Y2H) and mass spectrometry (MS) Genomic – E: Indirect high throughput data: Gene expression, protein-DNA binding, etc. Functional – F: Functional annotation data: Gene ontology annotation, MIPS annotation, etc. Sequence – S: Sequence based data sources: Domain information, gene fusion, homology based PPIs, etc. 63 © 2003, Jaime Carbonell Y. Qi, J. Klein-Seetharaman, Z. Bar-Joseph, BMC Bioinformatics 2007 Mixture of Feature Experts (MFE) F S P E Interact ? • Make protein interaction prediction by • – Weighted voting from the four roughly homogeneous feature categories – Treat each feature group as a prediction expert – The weights are also dependent on the input example Hidden variable, M modulates the choice of expert p(Y | X ) p(Y | X , M ) p( M | X ) M 64 © 2003, Jaime Carbonell Mixture of Four Feature Experts Expert P Expert E Direct PPI High throughput Experiment Data Indirect High throughput Experimental Data Expert F Expert S Function Annotation of Proteins Sequence or Structure based Evidence 4 p( y (n) | x ) p ( mi (n) i 1 (n) 1 | x ( n ) , v ) * p ( y ( n ) | x ( n ) , mi (n) 1, wi ) • Parameters ( wi , v)are trained using EM • Experts and root gate use logistic regression (ridge estimator) 65 © 2003, Jaime Carbonell Mixture of Four Feature Experts • Handling missing value – Add additional feature column for each feature having low feature coverage – MFE uses present / absent information when weighting different feature groups • The posterior weight for expert i in predicting pair n – The weight can be used to indicate the importance of that feature view ( expert ) for this specific pair (n) i h P(mi (n) 1| y , x , v , w ) (n) (n) t P(mi t (n) 4 P(m j 1 1 | x ( n ) , v t ) * p( y ( n ) | x ( n ) , mi (n) j (n) 1 | x ( n ) , v t ) * p( y ( n ) | x ( n ) , m j 66 1, wi ) t (n) 1, w j ) t © 2003, Jaime Carbonell Performance • 162 features for yeast physical PPI prediction task • Features extracted in “detail” encoding • Under “detail” encoding, the ranking method is almost the same as RF (not shown) 67 © 2003, Jaime Carbonell Functional Expert Dominates 300 candidate protein pairs 51 predicted interactions 33 validated already 18 newly predicted Figure: The frequency at which each of the four experts has maximum contribution among validated and predicted pairs 68 © 2003, Jaime Carbonell Protein Complex Group detection within the PPI network • Proteins form associations with multiple protein binding partners stably (termed “complex”) • Complex member interacts with part of the group and work as an unit together • Identification of these important sub-structures is essential to understand activities in the cell 69 © 2003, Jaime Carbonell Identify Complex in PPI Graph • PPI network as a weighted undirected graph – Edge weights derived from supervised PPI predictions: • Previous work – Unsupervised graph clustering style – All rely on the assumption that complexes correspond to the dense regions of the network • Related facts – Many other possible topological structures – A small number of complexes available from reliable experiments – Complexes also have functional /biological properties (like weight / size / …) 70 © 2003, Jaime Carbonell Possible topological structures • Make use of the small number of known complexes supervised • Model the possible topological structures subgraph statistics • Model the biological properties of complexes subgraph features Edge weight color coded 71 © 2003, Jaime Carbonell Properties of Subgraph • Subgraph properties as features in BN – Various topological properties from graph – Biological attributes of complexes 5/14/2008 No. Sub-Graph Property 1 Vertex Size 2 Graph Density 3 Edge Weight Ave / Var 4 Node degree Ave / Max 5 Degree Correlation Ave / Max 6 Clustering Coefficient Ave / Max 7 Topological Coefficient Ave / Max 8 First Two Eigen Value 9 Fraction of Edge Weight > Certain Cutoff 10 Complex Member Protein Size Ave / Max 11 Complex Member Protein Weight Ave / Max 72 © 2003, Jaime Carbonell Model Complex Probabilistically Assume a probabilistic model (Bayesian Network) for representing complex sub-graphs • Bayesian Network (BN) C N – C : If this subgraph is a complex (1) or not (0) X X X X – N : Number of nodes in subgraph – Xi : Properties of subgraph L log 73 p (c 1 | n, x1 , x2 ,..., xm ) p (c 0 | n, x1 , x2 ,..., xm ) © 2003, Jaime Carbonell Model Complex Probabilistically • BN parameters trained with MLE – Trained from known complexes and random sampled noncomplexes – Discretize continuous features – Bayesian Prior to smooth the multinomial parameters • Evaluate candidate subgraphs with the log ratio score L m p (c 1 | n, x1 , x2 ,..., xm ) L log log p (c 0 | n, x1 , x2 ,..., xm ) p (c 1) p ( n | c 1) p ( xk | n, c 1) k 1 m p (c 0) p ( n | c 0) p ( xk | n, c 0) k 1 74 © 2003, Jaime Carbonell Experimental Setup • Positive training data: – Set1: MIPS Yeast complex catalog: a curated set of ~100 protein complexes – Set2: TAP05 Yeast complex catalog: a reliable experimental set of ~130 complexes – Complex size (nodes’ num.) follows a power law • Negative training data – Generate from randomly selected nodes in the graph – Size distribution follows the same power law as the positive complexes 75 © 2003, Jaime Carbonell Evaluation • Train-Test style (Set1 & Set2) • Precision / Recall / F1 measures • A cluster “detects” a complex if A C B A : Number of proteins only in cluster B : Number of proteins only in complex C : Number of proteins shared If overlapping threshold p set as 50% Detected Cluster C p AC Known complex 76 & C p BC © 2003, Jaime Carbonell Performance Comparison • On yeast predicted PPI graph (~2000 nodes) • Compare to a popular complex detection package: MCODE (search for • • highly interconnected regions) Compare to local search relying on density evidence only Compared to local search with complex score from SVM (also supervised) Methods Precision Recall F1 Density MCODE SVM BN 0.180 0.219 0.211 0.266 0.462 0.075 0.377 0.513 0.253 0.111 0.269 0.346 77 © 2003, Jaime Carbonell Learning PPI Networks PSB 05 PROTEINS 06 BMC Bioinfo 07 CCR 08 Protein Complex ISMB 08 Pathway Pairwise Interactions Prepare PPI Network Human-PPI (Revise 08) HIV-Human PPI (Revise) Domain/Motif Interactions Function Func A Func ? Implication Genome Biology 08 78 © 2003, Jaime Carbonell Inter species interactome What are the interacting proteins between two organisms? 79 © 2003, Jaime Carbonell HIV-1 host protein interactions Fusion Reverse transcription HIV-1 depends on the cellular machinery in every aspect of its life cycle. Transcription Budding Maturation Peterlin and Torono, Nature Rev Immu 2003. 80 © 2003, Jaime Carbonell HIV-1 host protein interactions Human protein HIV protein 81 © 2003, Jaime Carbonell FIN Questions ? 82 © 2003, Jaime Carbonell