From Datamining to Bioinformatics Limsoon Wong Laboratories for Information Technology Singapore What is Bioinformatics? Themes of Bioinformatics Bioinformatics = Data Mgmt + Knowledge Discovery Data Mgmt = Integration + Transformation + Cleansing Knowledge Discovery = Statistics + Algorithms + Databases Benefits of Bioinformatics To the patient: Better drug, better treatment To the pharma: Save time, save cost, make more $ To the scientist: Better science From Informatics to Bioinformatics 8 years of bioinformatics R&D in Singapore Integration Technology (Kleisli) 1994 ISS MHC-Peptide Protein Interactions Binding Extraction (PIES) (PREDICT) Gene Expression Cleansing & & Medical Record Warehousing Datamining (PCL) (FIMM) Gene Feature Recognition (Dragon) 1996 Venom Informatics 1998 KRDL 2000 2002 LIT Quick Samplings Epitope Prediction TRAP-559AA MNHLGNVKYLVIVFLIFFDLFLVNGRDVQNNIVDEIKYSE EVCNDQVDLYLLMDCSGSIRRHNWVNHAVPLAMKLIQQLN LNDNAIHLYVNVFSNNAKEIIRLHSDASKNKEKALIIIRS LLSTNLPYGRTNLTDALLQVRKHLNDRINRENANQLVVIL TDGIPDSIQDSLKESRKLSDRGVKIAVFGIGQGINVAFNR FLVGCHPSDGKCNLYADSAWENVKNVIGPFMKAVCVEVEK TASCGVWDEWSPCSVTCGKGTRSRKREILHEGCTSEIQEQ CEEERCPPKWEPLDVPDEPEDDQPRPRGDNSSVQKPEENI IDNNPQEPSPNPEEGKDENPNGFDLDENPENPPNPDIPEQ KPNIPEDSEKEVPSDVPKNPEDDREENFDIPKKPENKHDN QNNLPNDKSDRNIPYSPLPPKVLDNERKQSDPQSQDNNGN RHVPNSEDRETRPHGRNNENRSYNRKYNDTPKHPEREEHE KPDNNKKKGESDNKYKIAGGIAGGLALLACAGLAYKFVVP GAATPYAGEPAPFDETLGEEDKDLDEPEQFRLPEENEWN Epitope Prediction Results Prediction by our ANN model for HLA-A11 29 predictions 22 epitopes 76% specificity Prediction by BIMAS matrix for HLA-A*1101 Number of experimental binders 19 (52.8%) 5 (13.9%) 12 (33.3%) 1 66 100 Rank by BIMAS Transcription Start Prediction Transcription Start Prediction Results Medical Record Analysis age sex chol ecg heart sick 49 64 58 58 58 M M F M M 266 211 283 284 224 Hyp Norm Hyp Hyp Abn 171 144 162 160 173 N N N Y Y Looking for patterns that are valid novel useful understandable Gene Expression Analysis Classifying gene expression profiles find stable differentially expressed genes find significant gene groups derive coordinated gene expression Medical Record & Gene Expression Analysis Results PCL, a novel “emerging pattern’’ method Beats C4.5, CBA, LB, NB, TAN in 21 out of 32 UCI benchmarks Works well for gene expressions Cancer Cell, March 2002, 1(2) Behind the Scene Vladimir Bajic Vladimir Brusic Jinyan Li See-Kiong Ng Limsoon Wong Louxin Zhang Allen Chong Judice Koh SPT Krishnan Huiqing Liu Seng Hong Seah Soon Heng Tan Guanglan Zhang Zhuo Zhang and many more: students, folks from geneticXchange, MolecularConnections, and other collaborators…. Questions? A More Detailed Account What is Datamining? Jonathan’s blocks Jessica’s blocks Whose block is this? Jonathan’s rules : Blue or Circle Jessica’s rules : All the rest What is Datamining? Question: Can you explain how? The Steps of Data Mining Training data gathering Signal generation k-grams, colour, texture, domain know-how, ... Signal selection Entropy, 2, CFS, t-test, domain know-how... Signal integration SVM, ANN, PCL, CART, C4.5, kNN, ... Translation Initiation Recognition A Sample cDNA 299 HSU27655.1 CAT U27655 Homo sapiens CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT ............................................................ ................................iEEEEEEEEEEEEEEEEEEEEEEEEEEE EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE What makes the second ATG the translation initiation site? 80 160 240 80 160 240 Signal Generation K-grams (ie., k consecutive letters) K = 1, 2, 3, 4, 5, … Window size vs. fixed position Up-stream, downstream vs. any where in window In-frame vs. any frame 3 2.5 2 seq1 seq2 seq3 1.5 1 0.5 0 A C G T Too Many Signals For each value of k, there are 4k * 3 * 2 k-grams If we use k = 1, 2, 3, 4, 5, we have 4 + 24 + 96 + 384 + 1536 + 6144 = 8188 features! This is too many for most machine learning algorithms Signal Selection (Basic Idea) Choose a signal w/ low intra-class distance Choose a signal w/ high inter-class distance Which of the following 3 signals is good? Signal Selection (eg., t-statistics) Signal Selection (eg., MIT-correlation) Signal Selection (eg., 2) Signal Selection (eg., CFS) Instead of scoring individual signals, how about scoring a group of signals as a whole? CFS A good group contains signals that are highly correlated with the class, and yet uncorrelated with each other Homework: find a formula that captures the key idea of CFS above Sample k-grams Selected Kozak consensus Leaky scanning Position –3 in-frame upstream ATG in-frame downstream Stop codon TAA, TAG, TGA, CTG, GAC, GAG, and GCC Codon bias Signal Integration kNN Given a test sample, find the k training samples that are most similar to it. Let the majority class win. SVM Given a group of training samples from two classes, determine a separating plane that maximises the margin of error. Naïve Bayes, ANN, C4.5, ... Results (on Pedersen & Nielsen’s mRNA) TP/(TP + FN) TN/(TN + FP) TP/(TP + FP) Accuracy Naïve Bayes 84.3% 86.1% 66.3% 85.7% SVM 73.9% 93.2% 77.9% 88.5% Neural Network 77.6% 93.2% 78.8% 89.4% Decision Tree 74.0% 94.4% 81.1% 89.4% Acknowledgements Roland Yap Zeng Fanfan A.G. Pedersen H. Nielsen Questions? Common Mistakes Self-fulfilling Oracle Consider this scenario Given classes C1 and C2 w/ explicit signals Use 2 to C1 and C2 to select signals s1, s2, s3 Run 3-fold x-validation on C1 and C2 using s1, s2, s3 and get accuracy of 90% Is the accuracy really 90%? What can be wrong with this? Phil Long’s Experiment Let there be classes C1 and C2 w/ 100000 features having randomly generated values Use 2 to select 20 features Run k-fold x-validation on C1 and C2 w/ these 20 features Expect: 50% accuracy Get: 90% accuracy! Lesson: choose features at each fold Apples vs Oranges Consider this scenario: Fanfan reported 89% accuracy on his TIS prediction method Hatzigeorgiou reported 94% accuracy on her TIS prediction method So Hatzigeorgiou’s method is better What is wrong with this conclusion? Apples vs Oranges Differences in datasets used: Fanfan’s expt used Pedersen’s dataset Hatzigeorgiou’s used her own dataset Differences in counting: Fanfan’s expt was on a per ATG basis Hatzigeorgiou’s expt used the scanning rule and thus was on a per cDNA basis When Fanfan ran the same dataset and count the same way as Hatzigeorgiou, got 94% also! Questions?