Some Data Mining Challenges Learned From Bioinformatics & Actions Taken Limsoon Wong National University of Singapore Bertinoro, Nov 2005 Plan • Bioinformatics Examples – Treatment prognosis of DLBC lymphoma – Prediction of translation initiation site – Prediction of protein function from PPI data • What have we learned from these projects? • What have I been looking at recently? – Statistical measures beyond frequent items – Small changes that have large impact – Evolution of pattern spaces Bertinoro, Nov 2005 Copyright 2005 © Limsoon Wong Example #1: Treatment Prognosis for DLBC Lymphoma Ref: H. Liu et al, “Selection of patient samples and genes for outcome prediction”, Proc. CSB2004, pages 382--392 Bertinoro, Nov 2005 Image credit: Rosenwald et al, 2002 Diffuse Large B-Cell Lymphoma • DLBC lymphoma is the most common type of lymphoma in adults • Can be cured by anthracycline-based chemotherapy in 35 to 40 percent of patients DLBC lymphoma comprises several diseases that differ in responsiveness to chemotherapy Bertinoro, Nov 2005 • Intl Prognostic Index (IPI) – age, “Eastern Cooperative Oncology Group” Performance status, tumor stage, lactate dehydrogenase level, sites of extranodal disease, ... • Not very good for stratifying DLBC lymphoma patients for therapeutic trials Use gene-expression profiles to predict outcome of chemotherapy? Copyright 2005 © Limsoon Wong Knowledge Discovery from Gene Expression of “Extreme” Samples 240 samples “extreme” sample selection: < 1 yr vs > 8 yrs knowledge discovery from gene expression 47 shortterm survivors 26 longterm survivors 7399 genes 80 samples 84 genes T is long-term if S(T) < 0.3 T is short-term if S(T) > 0.7 Bertinoro, Nov 2005 Copyright 2005 © Limsoon Wong Kaplan-Meier Plot for 80 Test Cases Low risk High risk p-value of log-rank test: < 0.0001 Risk score thresholds: 0.7, 0.3 Bertinoro, Nov 2005 No clear difference on the overall survival of the 80 samples in the validation group of DLBCL study, if no training sample selection conducted Copyright 2005 © Limsoon Wong Example #2: Protein Translation Initiation Site Recognition Ref: L. Wong et al., “Using feature generation and feature selection for accurate prediction of translation initiation sites”, GIW 13:192-200, 2002 Bertinoro, Nov 2005 A Sample cDNA • What makes the second ATG the TIS? 299 HSU27655.1 CAT U27655 Homo sapiens CGTGTGTGCAGCAGCCTGCAGCTGCCCCAAGCCATGGCTGAACACTGACTCCCAGCTGTG CCCAGGGCTTCAAAGACTTCTCAGCTTCGAGCATGGCTTTTGGCTGTCAGGGCAGCTGTA GGAGGCAGATGAGAAGAGGGAGATGGCCTTGGAGGAAGGGAAGGGGCCTGGTGCCGAGGA CCTCTCCTGGCCAGGAGCTTCCTCCAGGACAAGACCTTCCACCCAACAAGGACTCCCCT 80 160 240 • Approach – Training data gathering – Signal generation • k-grams, distance, domain know-how, ... – Signal selection • Entropy, 2, CFS, t-test, domain know-how... – Signal integration • SVM, ANN, PCL, CART, C4.5, kNN, ... Bertinoro, Nov 2005 Copyright 2005 © Limsoon Wong Too Many Signals Feature Selection • For each value of k, there are 4k * 3 * 2 k-grams • If we use k = 1, 2, 3, 4, 5, we have 24 + 96 + 384 + 1536 + 6144 = 8184 features! • This is too many for most machine learning algorithms Bertinoro, Nov 2005 • Choose a signal w/ low intra-class distance • Choose a signal w/ high inter-class distance • E.g., Copyright 2005 © Limsoon Wong Sample k-grams Selected by CFS Kozak consensus Leaky scanning • Position –3 • in-frame upstream ATG • in-frame downstream – TAA, TAG, TGA, – CTG, GAC, GAG, and GCC Stop codon Codon bias? Bertinoro, Nov 2005 Copyright 2005 © Limsoon Wong Validation Results (on Chr X and Chr 21) Our method ATGpr • Using top 100 features selected by entropy and trained on Pedersen & Nielsen’s Bertinoro, Nov 2005 Copyright 2005 © Limsoon Wong Level-1 neighbour Example #3: Protein Function Prediction from Protein Interactions Bertinoro, Nov 2005 Level-2 neighbour An illustrative Case of Indirect Functional Association? SH3 Proteins SH3-Binding Proteins • Is indirect functional association plausible? • Is it found often in real interaction data? • Can it be used to improve protein function prediction from protein interaction data? Bertinoro, Nov 2005 Copyright 2005 © Limsoon Wong Freq of Indirect Functional Association • 59.2% proteins in dataset share some function with level-1 neighbours YAL012W |1.1.6.5 |1.1.9 YJR091C YMR300C YPL149W YBR055C YMR101C |1.3.16.1 |16.3.3 |1.3.1 |14.4 |20.9.13 |42.25 |14.7.11 |11.4.3.1 |42.1 YPL088W YBR293W |2.16 |1.1.9 |16.19.3 |42.25 |1.1.3 |1.1.9 YDR158W |1.1.6.5 |1.1.9 YBL072C |12.1.1 YBR023C YLR330W YBL061C |10.3.3 |32.1.3 |34.11.3.7 |42.1 |43.1.3.5 |43.1.3.9 |1.5.1.3.2 |1.5.4 |34.11.3.7 |41.1.1 |43.1.3.5 |43.1.3.9 |1.5.4 |10.3.3 |18.2.1.1 |32.1.3 |42.1 |43.1.3.5 |1.5.1.3.2 YLR140W • 27.9% share some function with level-2 neighbours but share no function with level-1 neighbours YMR047C |11.4.2 |14.4 |16.7 |20.1.10 |20.1.21 |20.9.1 YKL006W YOR312C |12.1.1 |16.3.3 |12.1.1 Bertinoro, Nov 2005 YPL193W YDL081C YDR091C YPL013C |12.1.1 |12.1.1 |1.4.1 |12.1.1 |12.4.1 |16.19.3 |12.1.1 |42.16 Copyright 2005 © Limsoon Wong Over-Rep of Functions in L1 & L2 Neighbours Fraction of Neighbours w ith Functional Sim ilarity Sensitivity vs Precision 1 0.6 0.5 Fraction 0.4 0.3 0.2 L1 - L2 0.9 L2 - L1 0.8 L1 ∩ L2 0.7 Sensitivity L1 - L2 L2 - L1 L1 ∩ L2 L3 - (L1 U L2) All Proteins 0.6 0.5 0.4 0.3 0.2 0.1 0.1 0 ≥0.1 ≥0.2 ≥0.3 ≥0.4 ≥0.5 ≥0.6 ≥0.7 ≥0.8 ≥0.9 Similarity Bertinoro, Nov 2005 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Precision Copyright 2005 © Limsoon Wong Performance Evaluation • Prediction performance improves after incorporation of L1, L2, & interaction reliability info Informative FCs 1 NC Chi² PRODISTIN Weighted Avg Weighted Avg R 0.9 0.8 Sensitivity 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 Bertinoro, Nov 2005 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Precision 1 Copyright 2005 © Limsoon Wong What Have We Learned? Bertinoro, Nov 2005 Some of those “techniques” frequently needed in analysis of biomedical data are insufficiently studied by current data mining researchers Bertinoro, Nov 2005 • Recognizing what samples are relevant and what are not • Recognizing what features are relevant and what are not & handling missing or incorrect values • Recognizing trends, changes, and their causes Copyright 2005 © Limsoon Wong Action #1: Going Beyond Frequent Patterns to Recognize What Features Are Relevant and What Are Not Bertinoro, Nov 2005 Going Beyond Frequent Patterns • Statisticians use a battery of “interestingness” measures to decide if a feature/factor is relevant PD ,ed • Odds ratio PD , d OR ( P, D ) PD ,e PD , • Examples: – Odds ratio – Relative risk – Gini index – Yule’s Q & Y – etc Bertinoro, Nov 2005 Copyright 2005 © Limsoon Wong Challenge: Frequent Pattern Mining Relies on Convexity for Efficiency, But … • Proposition: Let SkOR(ms,D) = { P F(ms,D) | OR(P,D) k}. Then SkOR(ms,D) is not convex • i.e., the space of odds ratio patterns is not convex. Ditto for many other types of patterns {A,B}:1 {A,B,C}:3 {A}:∞ OR search space Bertinoro, Nov 2005 Copyright 2005 © Limsoon Wong Solution: Luckily They Become Convex When Decomposed Into Plateaus • Theorem: Let Sn,kOR(ms,D) = { P F(ms,D) | PD,ed=n, OR(P,D) k}. Then Sn,kOR(ms,D) is convex The space of odds ratio patterns becomes convex when stratified into plateaus based on support levels on positive (or negative) dataset Bertinoro, Nov 2005 • Proposition: Let Q ∊[P]D, then OR(Q,D)=OR(P,D) The plateau space can be further divided into convex equivalence classes on the whole dataset The space of equivalence classes can be concisely represented by generators and closed patterns Copyright 2005 © Limsoon Wong Performance • Mining odds ratio and relative patterns depends on GC-growth • GC-Growth is mining both generators and closed patterns • It is comparable in speed to the fastest algorithms that mined only closed patterns Bertinoro, Nov 2005 Copyright 2005 © Limsoon Wong Action #2: Tipping Factors---The Small Changes With Large Impact Bertinoro, Nov 2005 Tipping Events • Given a data set, such as those related to human health, it is interesting to determine impt cohorts and impt factors causing transition betw cohorts Tipping events Tipping factors are “action items” for causing transitions Bertinoro, Nov 2005 • “Tipping event” is two or more population cohorts that are significantly different from each other • “Tipping factors” (TF) are small patterns whose presence or absence causes significant difference in population cohorts • “Tipping base” (TB) is the pattern shared by the cohorts in a tipping event • “Tipping point” (TP) is the combination of TB and a TF Copyright 2005 © Limsoon Wong Impact-To-Cost-Ratio of Tipping Points Bertinoro, Nov 2005 Copyright 2005 © Limsoon Wong Some Simple Results Useful For Constructing TPs Bertinoro, Nov 2005 Copyright 2005 © Limsoon Wong Action #3: Evolution of Pattern Spaces---How Do They Change When the Sample Space Changes? Bertinoro, Nov 2005 Impact of Adding New Transactions on Key and Closed Patterns Bertinoro, Nov 2005 Copyright 2005 © Limsoon Wong Impact of Removing Items From All Transactions Bertinoro, Nov 2005 Copyright 2005 © Limsoon Wong Acknowledgements • DLBC Lymphoma: – Jinyan Li, Huiqing Liu • Translation Initiation: – Fanfan Zeng, Roland Yap – Huiqing Liu • Protein Function Prediction: – Kenny Chua, Ken Sung Bertinoro, Nov 2005 • Odds Ratio & Relative Risk – Mengling Feng, Yap-Peng Tan, – Haiquan Li, Jinyan Li • Tipping Points: – Guimei Liu, Jinyan Li – Guozhu Dong • Pattern Space Evolution: – Mengling Feng, Yap-Peng Tan – Guozhu Dong – Jinyan Li Copyright 2005 © Limsoon Wong