From Informatics to Bioinformatics Limsoon Wong Kent Ridge Digital Labs Singapore Show & Tell What is Bioinformatics? Show & Tell What are the Themes of Bioinformatics? Bioinformatics = Data Mgmt + Knowledge Discovery Data Mgmt = Integration + Transformation + Cleansing Knowledge Discovery = Statistics + Algorithms + Databases Show & Tell What are the Benefits of Bioinformatics? To the patient: To the pharma: Save time, save cost, make more $ To the scientist: Show & Tell Better drug, better treatment Better science Data Integration A DOE “impossible query”: For each gene on a given cytogenetic band, find its non-human homologs. Show & Tell source type location remarks GDB Sybase Baltimore Flat tables SQL joins Location info Entrez ASN.1 Bethesda Nested tables Keywords Homolog info Data Integration Results • Using Kleisli: • Clear • Succint • Efficient • Handles •heterogeneity •complexity sybase-add (#name:”GDB", ...); create view L from locus_cyto_location using GDB; create view E from object_genbank_eref using GDB; select #accn: g.#genbank_ref, #nonhuman-homologs: H from L as c, E as g, (select u from g.#genbank_ref.na-get-homolog-summary as u where not(u.#title string-islike "%Human%") andalso not(u.#title string-islike "%H.sapien%")) as H where c.#chrom_num = "22” andalso g.#object_id = c.#locus_id andalso not (H = { }); Show & Tell Data Warehousing Show & Tell Motivation efficiency availabilty “denial of service” data cleansing Requirements efficient to query easy to update. model data naturally {(#uid: 6138971, #title: "Homo sapiens adrenergic ...", #accession: "NM_001619", #organism: "Homo sapiens", #taxon: 9606, #lineage: ["Eukaryota", "Metazoa", …], #seq: "CTCGGCCTCGGGCGCGGC...", #feature: { (#name: "source", #continuous: true, #position: [ (#accn: "NM_001619", #start: 0, #end: 3602, #negative: false)], #anno: [ (#anno_name: "organism", #descr: "Homo sapiens"), …] ), …)} Data Warehousing Results Relational DBMS is insufficient because it forces us to fragment data into 3NF. Show & Tell Kleisli turns flat relational DBMS into nested relational DBMS. It can use flat relational DBMS such as Sybase, Oracle, MySQL, etc. to be its updatable complex object store. It can even use all of these systems simultaneously! ! Log in oracle-cplobj-add (#name: "db", ...); ! Define table create table GP (#uid: "NUMBER", #detail: "LONG") using db; ! Populate table with GenPept reports select #uid: x.#uid, #detail: x into GP from aa-get-seqfeat-general "PTP” as x using db; ! Map GP to that table create view GP from GP using db; ! Run a queryto get title of 131470 select x.#detail.#title from GP as x where x.#uid = 131470; Epitope Prediction TRAP-559AA MNHLGNVKYLVIVFLIFFDLFLVNGRDVQNNIVDEIKYSE EVCNDQVDLYLLMDCSGSIRRHNWVNHAVPLAMKLIQQLN LNDNAIHLYVNVFSNNAKEIIRLHSDASKNKEKALIIIRS LLSTNLPYGRTNLTDALLQVRKHLNDRINRENANQLVVIL TDGIPDSIQDSLKESRKLSDRGVKIAVFGIGQGINVAFNR FLVGCHPSDGKCNLYADSAWENVKNVIGPFMKAVCVEVEK TASCGVWDEWSPCSVTCGKGTRSRKREILHEGCTSEIQEQ CEEERCPPKWEPLDVPDEPEDDQPRPRGDNSSVQKPEENI IDNNPQEPSPNPEEGKDENPNGFDLDENPENPPNPDIPEQ KPNIPEDSEKEVPSDVPKNPEDDREENFDIPKKPENKHDN QNNLPNDKSDRNIPYSPLPPKVLDNERKQSDPQSQDNNGN RHVPNSEDRETRPHGRNNENRSYNRKYNDTPKHPEREEHE KPDNNKKKGESDNKYKIAGGIAGGLALLACAGLAYKFVVP GAATPYAGEPAPFDETLGEEDKDLDEPEQFRLPEENEWN Show & Tell Epitope Prediction Results Prediction by our ANN model for HLA-A11 29 predictions 22 epitopes 76% specificity Prediction by BIMAS matrix for HLA-A*1101 Number of experimental binders 19 (52.8%) 5 (13.9%) 12 (33.3%) 1 Show & Tell 66 100 Rank by BIMAS Gene Expression Analysis Clustering gene expression profiles Classifying gene expression profiles Show & Tell find stable differentially expressed genes Gene Expression Analysis Results The Discovery System • Correlation test • Voter selection • Class prediction Show & Tell Protein Interaction Extraction “What are the protein-protein interaction pathways from the latest reported discoveries?” Show & Tell Protein Interaction Extraction Results Rule-based system for processing free texts in scientific abstracts Specialized in Show & Tell extracting protein names extracting protein-protein interactions Transcription Start Prediction Show & Tell Transcription Start Prediction Results Show & Tell Medical Record Analysis age 49 64 58 58 58 chol 266 211 283 284 224 ecg Hyp Norm Hyp Hyp Abn heart 171 144 162 160 173 Looking for patterns that are Show & Tell sex M M F M M valid novel useful understandable sick N N N Y Y Medical Record Analysis Results DeEPs, a novel “emerging pattern’’ method Beats C4.5, CBA, LB, NB, TAN in 21 out of 32 UCI benchmarks Works for gene expressions Show & Tell Behind the Scene Research Peter Saunders Industry Assignees Show & Tell Vladimir Bajic Vladimir Brusic Jinyan Li See-Kiong Ng Limsoon Wong Louxin Zhang Business Hao Han (gX) Rahul Despande (MC) Engineering Allen Chong Judice Koh SPT Krishnan Seng Hong Seah Guanglan Zhang Zhuo Zhang Students Huiqing Liu Song Zhu Kun Yu