From Datamining to Bioinformatics

From Datamining to Bioinformatics Limsoon Wong Laboratories for Information Technology Singapore What is Bioinformatics? Themes of Bioinformatics Bioinformatics = Data Mgmt + Knowledge Discovery Data Mgmt = Integration + Transformation + Cleansing Knowledge Discovery = Statistics + Algorithms + Databases Benefits of Bioinformatics To the patient: Better drug, better treatment To the pharma: Save time, save cost, make more $ To the scientist: Better science From Informatics to Bioinformatics 8 years of bioinformatics R&D in Singapore Integration Technology (Kleisli) 1994 ISS MHC-Peptide Protein Interactions Binding Extraction (PIES) (PREDICT) Gene Expression Cleansing & & Medical Record Warehousing Datamining (PCL) (FIMM) Gene Feature Recognition (Dragon) 1996 Venom Informatics 1998 KRDL 2000 2002 LIT Quick Samplings Epitope Prediction TRAP-559AA MNHLGNVKYLVIVFLIFFDLFLVNGRDVQNNIVDEIKYSE EVCNDQVDLYLLMDCSGSIRRHNWVNHAVPLAMKLIQQLN LNDNAIHLYVNVFSNNAKEIIRLHSDASKNKEKALIIIRS LLSTNLPYGRTNLTDALLQVRKHLNDRINRENANQLVVIL TDGIPDSIQDSLKESRKLSDRGVKIAVFGIGQGINVAFNR FLVGCHPSDGKCNLYADSAWENVKNVIGPFMKAVCVEVEK TASCGVWDEWSPCSVTCGKGTRSRKREILHEGCTSEIQEQ CEEERCPPKWEPLDVPDEPEDDQPRPRGDNSSVQKPEENI IDNNPQEPSPNPEEGKDENPNGFDLDENPENPPNPDIPEQ KPNIPEDSEKEVPSDVPKNPEDDREENFDIPKKPENKHDN QNNLPNDKSDRNIPYSPLPPKVLDNERKQSDPQSQDNNGN RHVPNSEDRETRPHGRNNENRSYNRKYNDTPKHPEREEHE KPDNNKKKGESDNKYKIAGGIAGGLALLACAGLAYKFVVP GAATPYAGEPAPFDETLGEEDKDLDEPEQFRLPEENEWN Epitope Prediction Results  Prediction by our ANN model for HLA-A11    29 predictions 22 epitopes 76% specificity  Prediction by BIMAS matrix for HLA-A*1101 Number of experimental binders 19 (52.8%) 5 (13.9%) 12 (33.3%) 1 66 100 Rank by BIMAS Transcription Start Prediction Transcription Start Prediction Results Medical Record Analysis age sex chol ecg heart sick 49 64 58 58 58 M M F M M 266 211 283 284 224 Hyp Norm Hyp Hyp Abn 171 144 162 160 173 N N N Y Y  Looking for patterns that are     valid novel useful understandable Gene Expression Analysis  Classifying gene expression profiles    find stable differentially expressed genes find significant gene groups derive coordinated gene expression Medical Record & Gene Expression Analysis Results  PCL, a novel “emerging pattern’’ method  Beats C4.5, CBA, LB, NB, TAN in 21 out of 32 UCI benchmarks  Works well for gene expressions Cancer Cell, March 2002, 1(2) Behind the Scene  Vladimir Bajic  Vladimir Brusic  Jinyan Li  See-Kiong Ng  Limsoon Wong  Louxin Zhang  Allen Chong  Judice Koh  SPT Krishnan  Huiqing Liu  Seng Hong Seah  Soon Heng Tan  Guanglan Zhang  Zhuo Zhang and many more: students, folks from geneticXchange, MolecularConnections, and other collaborators…. Questions? A More Detailed Account What is Datamining? Jonathan’s blocks Jessica’s blocks Whose block is this? Jonathan’s rules : Blue or Circle Jessica’s rules : All the rest What is Datamining? Question: Can you explain how? The Steps of Data Mining  Training data gathering  Signal generation  k-grams, colour, texture, domain know-how, ...  Signal selection  Entropy, 2, CFS, t-test, domain know-how...  Signal integration  SVM, ANN, PCL, CART, C4.5, kNN, ... Translation Initiation Recognition A Sample cDNA 299 HSU27655.1 CAT U27655 Homo sapiens CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT ............................................................ ................................iEEEEEEEEEEEEEEEEEEEEEEEEEEE EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE What makes the second ATG the translation initiation site? 80 160 240 80 160 240 Signal Generation  K-grams (ie., k consecutive letters)     K = 1, 2, 3, 4, 5, … Window size vs. fixed position Up-stream, downstream vs. any where in window In-frame vs. any frame 3 2.5 2 seq1 seq2 seq3 1.5 1 0.5 0 A C G T Too Many Signals  For each value of k, there are 4k * 3 * 2 k-grams  If we use k = 1, 2, 3, 4, 5, we have 4 + 24 + 96 + 384 + 1536 + 6144 = 8188 features!  This is too many for most machine learning algorithms Signal Selection (Basic Idea)  Choose a signal w/ low intra-class distance  Choose a signal w/ high inter-class distance  Which of the following 3 signals is good? Signal Selection (eg., t-statistics) Signal Selection (eg., MIT-correlation) Signal Selection (eg., 2) Signal Selection (eg., CFS)  Instead of scoring individual signals, how about scoring a group of signals as a whole?  CFS  A good group contains signals that are highly correlated with the class, and yet uncorrelated with each other  Homework: find a formula that captures the key idea of CFS above Sample k-grams Selected Kozak consensus Leaky scanning  Position –3  in-frame upstream ATG  in-frame downstream   Stop codon TAA, TAG, TGA, CTG, GAC, GAG, and GCC Codon bias Signal Integration  kNN Given a test sample, find the k training samples that are most similar to it. Let the majority class win.  SVM Given a group of training samples from two classes, determine a separating plane that maximises the margin of error.  Naïve Bayes, ANN, C4.5, ... Results (on Pedersen & Nielsen’s mRNA) TP/(TP + FN) TN/(TN + FP) TP/(TP + FP) Accuracy Naïve Bayes 84.3% 86.1% 66.3% 85.7% SVM 73.9% 93.2% 77.9% 88.5% Neural Network 77.6% 93.2% 78.8% 89.4% Decision Tree 74.0% 94.4% 81.1% 89.4% Acknowledgements     Roland Yap Zeng Fanfan A.G. Pedersen H. Nielsen Questions? Common Mistakes Self-fulfilling Oracle  Consider this scenario    Given classes C1 and C2 w/ explicit signals Use 2 to C1 and C2 to select signals s1, s2, s3 Run 3-fold x-validation on C1 and C2 using s1, s2, s3 and get accuracy of 90%  Is the accuracy really 90%?  What can be wrong with this? Phil Long’s Experiment  Let there be classes C1 and C2 w/ 100000 features having randomly generated values  Use 2 to select 20 features  Run k-fold x-validation on C1 and C2 w/ these 20 features  Expect: 50% accuracy  Get: 90% accuracy!  Lesson: choose features at each fold Apples vs Oranges  Consider this scenario:   Fanfan reported 89% accuracy on his TIS prediction method Hatzigeorgiou reported 94% accuracy on her TIS prediction method  So Hatzigeorgiou’s method is better  What is wrong with this conclusion? Apples vs Oranges  Differences in datasets used:   Fanfan’s expt used Pedersen’s dataset Hatzigeorgiou’s used her own dataset  Differences in counting:   Fanfan’s expt was on a per ATG basis Hatzigeorgiou’s expt used the scanning rule and thus was on a per cDNA basis  When Fanfan ran the same dataset and count the same way as Hatzigeorgiou, got 94% also! Questions?

From Datamining to Bioinformatics

Related documents

Products

Support

From Datamining to Bioinformatics

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib