From Informatics to Bioinformatics: The Knowledge Discovery Perspective Limsoon Wong Institute for Infocomm Research Singapore Copyright 2003 limsoon wong Plan • Overview of recent knowledge discovery successes in bioinformatics • Risk assignment of childhood ALL patients to optimize risk-benefit ratio of therapy • Recognition of translation intiation sites from DNA sequences Copyright 2003 limsoon wong overview of recent knowledge discovery successes in bioinformatics Copyright 2003 limsoon wong What is Datamining? Jonathan’s blocks Jessica’s blocks Whose block is this? Jonathan’s rules Jessica’s rules : Blue or Circle : All the rest Copyright 2003 limsoon wong What is Datamining? Question: Can you explain how? Copyright 2003 limsoon wong What is Bioinformatics? Copyright 2003 limsoon wong Bioinformatics brings benefits To the patient: Better drug, better treatment To the pharma: Save time, save cost, make more $ To the scientist: Better science Copyright 2003 limsoon wong To figure these out, we bet on... “solution” = Data Mgmt + Knowledge Discovery Data Mgmt = Integration + Transformation + Cleansing Knowledge Discovery = Statistics + Algorithms + Databases Copyright 2003 limsoon wong History 8 years of bioinformatics R&D in Singapore Integration Technology (Kleisli) MHC-Peptide Protein Interactions Extraction (PIES) Binding (PREDICT) Gene Expression Molecular Cleansing & Connections & Medical Record Warehousing Datamining (PCL) (FIMM) Gene Feature Recognition (Dragon) Venom Informatics GeneticXchange 1994 ISS 1996 1998 KRDL 2000 Biobase 2002 LIT/I2R Copyright 2003 limsoon wong Predict Epitopes, Find Vaccine Targets • Vaccines are often the only solution for viral diseases • Finding & developing effective vaccine targets (epitopes) is slow and expensive process Copyright 2003 limsoon wong Recognize Functional Sites, Help Scientists • Effective recognition of initiation, control, and termination of biological processes is crucial to speeding up and focusing scientific experiments • Data mining of bio seqs to find rules for recognizing & understanding functional sites Dragon’s 10x reduction of TSS recognition false positives Copyright 2003 limsoon wong Diagnose Leukaemia, Benefit Children • Childhood leukaemia is a heterogeneous disease • Treatment is based on subtype • 3 different tests and 4 different experts are needed for diagnosis Curable in USA, fatal in Indonesia Copyright 2003 limsoon wong Understand Proteins, Fight Diseases • Understanding function and role of protein needs organised info on interaction pathways • Such info are often reported in scientific paper but are seldom found in structured databases • Knowledge extraction system to process free text • extract protein names • extract interactions Copyright 2003 limsoon wong risk assignment of childhood ALL patients to optimize risk-benefit ratio of therapy Copyright 2003 limsoon wong Childhood ALL Heterogeneous Disease • Major subtypes are – T-ALL – E2A-PBX1 – TEL-AML1 – MLL genome rearrangements – Hyperdiploid>50 – BCR-ABL Copyright 2003 limsoon wong Childhood ALL Treatment Failure • Overly intensive treatment leads to – Development of secondary cancers – Reduction of IQ • Insufficiently intensive treatment leads to – Relapse Copyright 2003 limsoon wong Childhood ALL Risk-Stratified Therapy • Different subtypes respond differently to the same treatment intensity Generally good-risk, lower intensity TEL-AML1, Hyperdiploid>50 T-ALL Generally high-risk, higher intensity E2A-PBX1 BCR-ABL, MLL Match patient to optimum treatment intensity for his subtype & prognosis Copyright 2003 limsoon wong Childhood ALL Risk Assignment • The major subtypes look similar • Conventional diagnosis requires – Immunophenotyping – Cytogenetics – Molecular diagnostics Copyright 2003 limsoon wong Mission • Conventional risk assignment procedure requires difficult expensive tests and collective judgement of multiple specialists • Generally available only in major advanced hospitals Can we have a single-test easy-to-use platform instead? Copyright 2003 limsoon wong Single-Test Platform of Microarray & Machine Learning Copyright 2003 limsoon wong Overall Strategy Diagnosis of subtype Subtypedependent prognosis • For each subtype, select genes to develop classification model for diagnosing that subtype Riskstratified treatment intensity • For each subtype, select genes to develop prediction model for prognosis of that subtype Copyright 2003 limsoon wong Childhood ALL Subtype Diagnosis by PCL • • • • Gene expression data collection Gene selection by 2 Classifier training by emerging pattern Classifier tuning (optional for some machine learning methods) • Apply classifier for diagnosis of future cases by PCL Copyright 2003 limsoon wong Childhood ALL Subtype Diagnosis Our Workflow A tree-structured diagnostic workflow was recommended by our doctor collaborator Copyright 2003 limsoon wong Childhood ALL Subtype Diagnosis Training and Testing Sets Copyright 2003 limsoon wong Childhood ALL Subtype Diagnosis Signal Selection Basic Idea • Choose a signal w/ low intra-class distance • Choose a signal w/ high inter-class distance Copyright 2003 limsoon wong Childhood ALL Subtype Diagnosis Signal Selection by 2 Copyright 2003 limsoon wong Childhood ALL Subtype Diagnosis Emerging Patterns • An emerging pattern is a set of conditions – usually involving several features – that most members of a class satisfy – but none or few of the other class satisfy • A jumping emerging pattern is an emerging pattern that – some members of a class satisfy – but no members of the other class satisfy • We use only jumping emerging patterns Copyright 2003 limsoon wong Childhood ALL Subtype Diagnosis PCL: Prediction by Collective Likelihood Copyright 2003 limsoon wong Childhood ALL Subtype Diagnosis Accuracy of PCL (vs. other classifiers) The classifiers are all applied to the 20 genes selected by 2 at each level of the tree Copyright 2003 limsoon wong Multidimensional Scaling Plot Subtype Diagnosis Copyright 2003 limsoon wong Multidimensional Scaling Plot Subtype-Dependent Prognosis • Similar computational analysis was carried out to predict relapse and/or secondary AML in a subtypespecific manner • >97% accuracy achieved Copyright 2003 limsoon wong Childhood ALL Is there a new subtype? • Hierarchical clustering of gene expression profiles reveals a novel subtype of childhood ALL Copyright 2003 limsoon wong Childhood ALL Cure Rates in ASEAN Countries • Conventional risk assignment procedure requires difficult expensive tests and collective judgement of multiple specialists Not available in less advanced ASEAN 80% countries cure rate cambodia vietnam thailand philippines indonesia malaysia singapore 0% 20% 40% 60% Copyright 2003 limsoon wong Childhood ALL Treatment Cost • Treatment for childhood ALL over 2 yrs – Intermediate intensity: US$60k – Low intensity: US$36k – High intensity: US$72k • Treatment for relapse: US$150k • Cost for side-effects: Unquantified Copyright 2003 limsoon wong Childhood ALL in ASEAN Counties Current Situation (2000 new cases/yr) • Intermediate intensity conventionally applied in less advanced ASEAN countries Over intensive for 50% of patients, thus more side effects Under intensive for 10% of patients, thus more relapse 5-20% cure rates • US$120m (US$60k * 2000) for intermediate intensity treatment • US$30m (US$150k * 2000 * 10%) for relapse treatment • Total US$150m/yr plus un-quantified costs for dealing with side effects Copyright 2003 limsoon wong Childhood ALL in ASEAN Counties Using Our Platform (2000 new cases/yr) • Low intensity applied to 50% of patients • Intermediate intensity to 40% of patients • High intensity to 10% of patients Reduced side effects Reduced relapse 75-80% cure rates • US$36m (US$36k * 2000 * 50%) for low intensity • US$48m (US$60k * 2000 * 40%) for intermediate intensity • US$14.4m (US$72k * 2000 * 10%) for high intensity • Total US$98.4m/yr Save US$51.6m/yr Copyright 2003 limsoon wong Acknowledgements Copyright 2003 limsoon wong recognition of translation intiation sites from DNA sequences Copyright 2003 limsoon wong Translation Initiation Site Copyright 2003 limsoon wong A Sample mRNA 299 HSU27655.1 CAT U27655 Homo sapiens CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT ............................................................ ................................iEEEEEEEEEEEEEEEEEEEEEEEEEEE EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE 80 160 240 80 160 240 What makes the second ATG the translation initiation site? Copyright 2003 limsoon wong Translation Initiation Site Recognition: Steps of a General Approach • Training data gathering • Signal generation k-grams, colour, texture, domain know-how, ... • Signal selection Entropy, 2, CFS, t-test, domain know-how... • Signal integration SVM, ANN, PCL, CART, C4.5, kNN, ... Copyright 2003 limsoon wong Translation Initiation Site Recognition: Training & Testing Data • Vertebrate dataset of Pedersen & Nielsen [ISMB’97] • • • • • 3312 sequences 13503 ATG sites 3312 (24.5%) are TIS 10191 (75.5%) are non-TIS Use for 3-fold x-validation expts Copyright 2003 limsoon wong Translation Initiation Site Recognition: Signal Generation • K-grams (ie., k consecutive letters) – – – – K = 1, 2, 3, 4, 5, … Window size vs. fixed position Up-stream, downstream vs. any where in window In-frame vs. any frame 3 2.5 2 seq1 seq2 seq3 1.5 1 0.5 0 A C G T Copyright 2003 limsoon wong Signal Generation: An Example 299 HSU27655.1 CAT U27655 Homo sapiens CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT 80 160 240 • Window = 100 bases • In-frame, downstream – GCT = 1, TTT = 1, ATG = 1… • Any-frame, downstream – GCT = 3, TTT = 2, ATG = 2… • In-frame, upstream – GCT = 2, TTT = 0, ATG = 0, ... Copyright 2003 limsoon wong Signal Generation: Too Many Signals • For each value of k, there are 4k * 3 * 2 k-grams • If we use k = 1, 2, 3, 4, 5, we have 4 + 24 + 96 + 384 + 1536 + 6144 = 8188 features! • This is too many for most machine learning algorithms Copyright 2003 limsoon wong Translation Initiation Site Recognition: Signal Selection (eg., 2) Copyright 2003 limsoon wong Translation Initiation Site Recognition: Signal Selection (eg., CFS) • Instead of scoring individual signals, how about scoring a group of signals as a whole? • CFS – Correlation-based Feature Selection – A good group contains signals that are highly correlated with the class, and yet uncorrelated with each other Copyright 2003 limsoon wong Signal Selection: Sample k-grams Selected Kozak consensus Leaky scanning • Position –3 • in-frame upstream ATG • in-frame downstream Stop codon – TAA, TAG, TGA, – CTG, GAC, GAG, and GCC Codon bias Copyright 2003 limsoon wong Translation Initiation Site Recognition: Signal Integration • kNN Given a test sample, find the k training samples that are most similar to it. Let the majority class win. • SVM Given a group of training samples from two classes, determine a separating plane that maximises the margin of error. • Naïve Bayes, ANN, C4.5, PCL, ... Copyright 2003 limsoon wong Translation Initiation Site Recognition: Results (on Pedersen & Nielsen’s mRNA) TP/(TP + FN) TN/(TN + FP) TP/(TP + FP) Accuracy Naïve Bayes 84.3% 86.1% 66.3% 85.7% SVM 73.9% 93.2% 77.9% 88.5% Neural Network 77.6% 93.2% 78.8% 89.4% Decision Tree 74.0% 94.4% 81.1% 89.4% Copyright 2003 limsoon wong Translation Initiation Site Recognition: mRNAprotein A T How about using k-grams from the translation? E L R F S L P Y I T N K D E S stop M V A H Q C W R G Copyright 2003 limsoon wong Signal Generation: Amino-Acid Features Copyright 2003 limsoon wong Signal Generation: Amino-Acid Features Copyright 2003 limsoon wong Signal Selection: Amino Acid K-grams Discovered Copyright 2003 limsoon wong Translation Initiation Site Recognition: Results (based on amino acid features) Performance based on amino-acid features: is better than performance based on DNA seq. features: Copyright 2003 limsoon wong Acknowledgements • • • • • • Huiqing Liu Jinyan Li Roland Yap Zeng Fanfan A.G. Pedersen H. Nielsen Copyright 2003 limsoon wong To give this lecture to SMA students. Date: 28 Oct 2003 Time: 10-11.30am Venue: Video Conference Room, S15-04-30 Copyright 2003 limsoon wong