Exciting Bioinformatics Adventures Limsoon Wong Institute for Infocomm Research Plan • • • • • Treatment optimization of childhood ALL Treatment prognosis of DLBC lymphoma Prediction of translation initiation site Prediction of vaccine target Reliability Assessment of Y2H expts Treatment Optimization of Childhood Leukemia Image credit: FEER Childhood ALL • Major subtypes are: TALL, E2A-PBX, TEL-AML, MLL genome rearrangements, Hyperdiploid>50, BCR-ABL • Diff subtypes respond differently to same Tx • Over-intensive Tx – Development of secondary cancers – Reduction of IQ • Under-intensiveTx – Relapse Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong • The subtypes look similar • Conventional diagnosis – Immunophenotyping – Cytogenetics – Molecular diagnostics • Unavailable in most ASEAN countries Single-Test Platform of Microarray & Machine Learning Image credit: Affymetrix Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Multidimensional Scaling Plot Subtype Diagnosis Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Is there a new subtype? • Hierarchical clustering of gene expression profiles reveals a novel subtype of childhood ALL Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong Conclusions Conventional Tx: • intermediate intensity to everyone 10% suffers relapse 50% suffers side effects costs US$150m/yr Our optimized Tx: • high intensity to 10% • intermediate intensity to 40% • low intensity to 50% • costs US$100m/yr Copyright © 2004, 2005 by Jinyan Li and Limsoon Wong •High cure rate of 80% • Less relapse • Less side effects • Save US$51.6m/yr References • E.-J. Yeoh et al., “Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling”, Cancer Cell, 1:133--143, 2002 Treatment Prognosis for DLBC Lymphoma Image credit: Rosenwald et al, 2002 Diffuse Large B-Cell Lymphoma • DLBC lymphoma is the most common type of lymphoma in adults • Can be cured by anthracycline-based chemotherapy in 35 to 40 percent of patients DLBC lymphoma comprises several diseases that differ in responsiveness to chemotherapy Copyright © 2005 by Limsoon Wong. Adapted from Huiqing Liu • Intl Prognostic Index (IPI) – age, “Eastern Cooperative Oncology Group” Performance status, tumor stage, lactate dehydrogenase level, sites of extranodal disease, ... • Not very good for stratifying DLBC lymphoma patients for therapeutic trials Use gene-expression profiles to predict outcome of chemotherapy? Knowledge Discovery from Gene Expression of “Extreme” Samples 240 samples “extreme” sample selection: < 1 yr vs > 8 yrs knowledge discovery from gene expression 47 shortterm survivors 26 longterm survivors 84 genes T is long-term if S(T) < 0.3 T is short-term if S(T) > 0.7 Copyright © 2005 by Jinyan Li, Huiqing Liu, and Limsoon Wong 7399 genes 80 samples Kaplan-Meier Plot for 80 Test Cases p-value of log-rank test: < 0.0001 Risk score thresholds: 0.7, 0.3 Copyright © 2005 by Jinyan Li, Huiqing Liu, and Limsoon Wong Improvement Over IPI (A) IPI low, p-value = 0.0063 Copyright © 2005 by Jinyan Li, Huiqing Liu, and Limsoon Wong (B) IPI intermediate, p-value = 0.0003 Merit of “Extreme” Samples (A) W/o sample selection (p =0.38) (B) With sample selection (p=0.009) No clear difference on the overall survival of the 80 samples in the validation group of DLBCL study, if no training sample selection conducted Copyright © 2005 by Jinyan Li, Huiqing Liu, and Limsoon Wong References • H. Liu et al, “Selection of patient samples and genes for outcome prediction”, Proc. CSB2004, pages 382-392 Protein Translation Initiation Site Recognition A Sample cDNA 299 HSU27655.1 CAT U27655 Homo sapiens CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT ............................................................ ................................iEEEEEEEEEEEEEEEEEEEEEEEEEEE EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE • What makes the second ATG the TIS? Copyright © 2005 by Limsoon Wong 80 160 240 80 160 240 Approach • Training data gathering • Signal generation – k-grams, distance, domain know-how, ... • Signal selection – Entropy, 2, CFS, t-test, domain know-how... • Signal integration – SVM, ANN, PCL, CART, C4.5, kNN, ... Copyright © 2005 by Limsoon Wong Amino-Acid Features Copyright © 2005 by Jinyan Li, Huiqing Liu, and Limsoon Wong Amino-Acid Features Copyright © 2005 by Jinyan Li, Huiqing Liu, and Limsoon Wong Amino Acid K-grams Discovered (by entropy) Copyright © 2005 by Jinyan Li, Huiqing Liu, and Limsoon Wong Validation Results (on Hatzigeorgiou’s) • Using top 100 features selected by entropy and trained on Pedersen & Nielsen’s dataset Copyright © 2005 by Limsoon Wong. Adapted from Huiqing Liu Validation Results (on Chr X and Chr 21) Our method ATGpr • Using top 100 features selected by entropy and trained on Pedersen & Nielsen’s Copyright © 2005 by Limsoon Wong. Adapted from Huiqing Liu References • L. Wong et al., “Using feature generation and feature selection for accurate prediction of translation initiation sites”, GIW 13:192--200, 2002 Vaccine Target Prediction Image credit: Asif Khan T-Cell Epitope Prediction • Why? • Challenges: – Only 1%-5% of peptides from a protein bind to any one HLA molecule – Traditional approaches are slow, & inapplicable to large-scale screening – There are ~2000 variants of HLA classified in ~20 supertypes – Relatively small number of expt data on peptides that bind HLA molecules – for majority of HLA molecules expt data do not exist Computer Modeling – Enable systematic screening for HLA binders – Minimize number of expts – Reduce cost 10x P1 P2 P3 P4 Promiscuous peptides One supertype H1 Copyright © 2005 by Limsoon Wong. Adapted from Asif Khan. H2 H3 H4 Multipred Approach Copyright © 2005 by Asif Khan, Guanglan Zhang, Vladimir Brusic Expt Validation FP FN Cut-off Threshold HCV IB protein sequence DR supertype Copyright © 2005 by Asif Khan, Guanglan Zhang, Vladimir Brusic Accuracy of Multipred 1.00 0.90 0.80 0.70 0.60 ANN HMM 0.50 SVM 0.40 0.30 0.20 0.10 0.00 A-0201 A-0202 A-0204 A-0205 A-0206 avearage ANN 0.87 0.76 0.88 0.93 0.91 0.87 HMM 0.93 0.73 0.92 0.94 0.88 0.88 SVM 0.90 0.81 0.93 0.97 0.85 0.89 Copyright © 2005 by Asif Khan, Guanglan Zhang, Vladimir Brusic Conclusions • Computer models are necessary to aid in identification of vaccine targets • Prediction models built are both sensitive and specific • MULTIPRED can identify promiscuous peptides and immunological hot-spots which are useful for vaccine design • Hot-spots are ideal for development of epitopebased vaccines References • K.N. Srinivasan, et al. “Predictions of Class I Tcell epitopes: Evidence of presence of immunological hot spots inside antigens”, Bioinformatics, 20:i297-i302, 2004. Assessing Reliability of Protein-Protein Interaction Expts % of TP based on shared cellular role (I = .95) % of TP based on shared cellular role (I = 1) % of TP based on co-localization TP = ~50% Image credit: Sprinzak et al, 2003 Some Protein Interaction Data Sets Large disagreement betw methods • Can we find a way to rank candidate interacting pairs according to their reliability? Copyright © 2005 by Limsoon Wong. Adapted from Sprinzak et al, 2003 Some “Reasonable” Speculations • A true interacting pair is often connected by at least one alternative path (reason: a biological function is performed by a highly interconnected network of interactions) • The shorter the alternative path, the more likely the interaction (reason: evolution of life is through “add-on” interactions of other or newer folds onto existing ones) Existence of a strong short alternative path connecting an interacting pair indicates that the interaction is “reliable” Copyright © 2005 by Limsoon Wong. Adapted from Chen et al, 2004 Interaction Pathway Reliability Copyright © 2005 by Limsoon Wong. Adapted from Chen et al, 2004 Evaluation wrt Reproducible Interactions The number of pairs not in the intersection of Ito & Uetz is not changed much wrt the ipr value of the pairs The number of pairs in the intersection of Ito & Uetz increases wrt the ipr value of the pairs • “ipr” correlates well to “reproducible” interactions • “ipr” seems to work Copyright © 2005 by Limsoon Wong. Adapted from Chen et al, 2004 Evaluation wrt Common Cellular Role, etc • “ipr” correlates well At the ipr threshold to common cellular that eliminated 80% of pairs, ~85% of the roles, localization, & of the remaining pairs have common cellular expression roles Copyright © 2005 by Limsoon Wong. Adapted from Chen et al, 2004 Evaluation wrt “Many-few” Interactions Part of the network of physical interactions reported by Ito et al., PNAS, 2001 • Number of “Many-few” interactions increases when more “reliable” IPR threshold is used to filter interactions • Consistent with the Maslov-Sneppen prediction Copyright © 2005 by Limsoon Wong. Adapted from Chen et al., 2004 Evaluation wrt “Cross-Talkers” • A MIPS functional cat: – – – – | 02 | ENERGY | 02.01 | glycolysis and gluconeogenesis | 02.01.01 | glycolysis methylglyoxal bypass | 02.01.03 | regulation of glycolysis & gluconeogenesis • First 2 digits is top cat • Other digits add more granularity to the cat Compare non-colocalized high- & low- IPR pairs to find number that fall into same cat. More high-IPR pairs in same cat, then IPR works Copyright © 2005 by Limsoon Wong. • For top cat – 148/257 high-IPR pairs are in same cat – 65/260 low-IPR pairs are in same cat • For fine-granularity cat – 135/257 high-IPR pairs are in same cat. 37/260 low-IPR pairs are in same cat IPR works IPR pairs that are not co-localized are real cross-talkers! Conclusions • There are latent local & global “motifs” that indicate the likelihood of protein interactions • These motifs can be exploited in computational elimination of false positives from highthroughput Y2H expts Copyright © 2005 by Limsoon Wong. References • J. Chen et al, “Mining high-throughput experimental data for reliable protein interaction data using using network”, 16th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2004), Florida, November 15-17, 2004 Acknowledgements • Childhood ALL: – Jinyan Li, Huiqing Liu – Allen Yeoh • DLBC Lymphoma: – Jinyan Li, Huiqing Liu • Translation Initiation: – Fanfan Zeng, Roland Yap – Huiqing Liu • T-Cell Epitopes: – Vladimir Brusic, Asif Khan, Guanglan Zhang – Tom August, KN Srinivasan • Protein Interaction Reliability: – Jin Chen, Mong Li Lee, Wynne Hsu – See-Kiong Ng – Prasanna Kolatkar, JerMing Chia