Knowledge Discovery from Biological and Clinical Data: BASIC BACKGROUND What is Datamining? Jonathan’s blocks Jessica’s blocks Whose block is this? Jonathan’s rules : Blue or Circle Jessica’s rules : All the rest What is Datamining? Question: Can you explain how? Knowledge Discovery from Biological and Clinical Data: MOTIVATION Driving Forces: Genes, Proteins, Interactions, Diagnosis, & Cures • Complete genomes are now available • Proteins, not genes, • Proteins function by are responsible for interacting with other proteins and • Knowing the genes is many cellular activities biomolecules not enough to understand how biology functions INTERACTOME GENOME PROTEOME If we figure out how these work, we get these Benefits To the patient: Better drug, better treatment To the pharma: Save time, save cost, make more $ To the scientist: Better science To figure these out, we bet on... “solution” = Data Mgmt + Knowledge Discovery Data Mgmt = Integration + Transformation + Cleansing Knowledge Discovery = Statistics + Algorithms + Databases Knowledge Discovery from Biological and Clinical Data: ACCOMPLISHMENT Predict Epitopes, Find Vaccine Targets • Vaccines are often the only solution for viral diseases • Finding & developing effective vaccine targets (epitopes) is slow and expensive process • Develop systems to recognize protein peptides that bind MHC molecules • Develop systems to recognize hot spots in viral antigens Recognize Functional Sites, Help Scientists • Effective recognition of initiation, control, and termination of biological processes is crucial to speeding up and focusing scientific experiments • Data mining of bio seqs to find rules for recognizing & understanding functional sites Dragon’s 10x reduction of TSS recognition false positives Diagnose Leukaemia, Benefit Children • Childhood leukaemia is a heterogeneous disease • Treatment is based on subtype • 3 different tests and 4 different experts are needed for diagnosis Curable in USA, fatal in Indonesia • A single platform diagnosis based on gene expression • Data mining to discover rules that are easy for doctors to understand Understand Proteins, Fight Diseases • Understanding function and role of protein needs organised info on interaction pathways • Such info are often reported in scientific paper but are seldom found in structured databases • Knowledge extraction system to process free text • extract protein names • extract interactions Knowledge Discovery from Biological and Clinical Data: OPPORTUNITY Direction & Plan • Objectives – Translate inspiration from biological systems into advancement of life and computing sciences – Advance data mining technologies in decision systems for complex problems • To work on practical systems for – data mining – data cleansing – knowledge extraction • Applied to – – – – gene regulation protein interaction clinical data analysis ligand-receptor interaction E.g., How to Get More Out of the Same Experiments? • How to recognize false positives from two-hybrid and other types of high-throughput protein interaction experiments? • Some initial thoughts: It seems that configuration a is less likely than b. Can we exploit this? a b E.g., How to Improve Classifier Algorithms? • SVM, ANN, etc. – Good accuracy, – but not easy to understand • C4.5, CART, etc. – Clear rules, – but lower accuracy • Why can’t we have a classifier algorithm that – handles high dimension – achieves high accuracy – provides understandable rules Who will you be working with... SOC Vladimir Bajic I2 R Vladimir Brusic See-Kiong Ng Jinyan Li Limsoon Wong Wynne Hsu Mong Li Lee Ken Sung