Applying Semantic Web Standards to Drug Discovery and Development Eric Neumann W3C HCLS co-chair Knowledge “--is the human acquired capacity (both potential and actual) to take effective action in varied and uncertain situations.” How does this translate into using Information Systems better in support of Innovation? 2 Knowledge Predictiveness • Knowledge of Target Mechanisms • Knowledge of Toxicity • Knowledge of Patient-Drug Profiles 3 Where Information Advances are Most Needed • Supporting Innovative Applications in R&D – Mol Diagnostics (Biomarkers) – Molecular Mechanisms (Systems) – Data Provenance, Rich Annotation • Clinical Information – – – – eHealth Records + EDC Clinical Submission Documents Safety Information, Pharmacovigilance, Adverse Events Handling Biomarker evidence • Standards – Central Data Sources • Genomics, Diseases, Chemistry, Toxicology – MetaData • Ontologies • Vocabularies 4 Raw Data MAGE ML Decision Support GO CDISC BioPAX Biomarker Qualification Translational Research Psi XML ICH ASN1. XLS SAS Tables Target Validation Semantic Bridge New Applications Safety CSV Tox 5 Losing Connectedness in Tables Fast Uptake and ease of use, but loose binding to entities and terms ? Genes Tissues 6 Data Integration? • Querying Databases is not sufficient • Data needs to include the Context of Local Scientists • Concepts and Vocabulary need to be associated • More about Sociology than Technology Information Knowledge 7 Data Integration: Biology Requirements Papers Disease Proteins Genes Retention Policy Samples Compounds Audit Trail Curation Ontology Experiment Tools 8 Standards- Why Not? • Good when there’s a majority of agreement • By vendors, for vendors? • Mainly about Data Packing-- should be more about Semantics (user-defined) • Ease and Expressivity • Too often they’re Brittle and Slow to develop • “They’re great, that’s why there are so many of them” 9 Data Integration Enables Business Integration: Efficiency and Innovation • • • • • • Searching Visualization Analysis Reporting Notification Navigation 10 Searching… #1 way for finding information in companies… 11 Semantic Web Data Integration R&D Scientist Dynamic, Linked, Searchable LIMS Bioinformatics Cheminformatics 13 Public Data Sources The Current Web What the computer sees: “Dumb” links No semantics - <a href> treated just like <bold> Minimal machineprocessable information 14 The Semantic Web Machine-processable semantic information Semantic context published – making the data more informative to both humans and machines 15 The Web of Data target target gene • • • • URI’s are universal ID’s Distributed data references Non-locality of data NamedGraphs can help segment external references • New meaning for Annotation pathway 16 Case Study: Omics ApoA1 … … is produced by the Liver … is expressed less in Atherosclerotic Liver … is correlated with DKK1 … is cited regarding Tangier’s disease … has Tx Reg elements like HNFR1 Subject Verb Object 17 Example: Knowledge Aggregation 18 Courtesy of BG-Medicine Tim Berners-Lee’s App View 20 Semantic Web Drug DD Application Space Therapeutics Chem Lib manufacturing NDA Production Genomics Clinical Studies HTS eADM E Biology Compound Opt DMPK genes 21 Patent informatics W3C Launches Semantic Web for HealthCare and Life Sciences Interest Group • Interest Group formally launched Nov 2005: http://www.w3.org/2001/sw/hcls • First Domain Group for W3C - “…take SW through its paces” • An Open Scientific Forum for Discussing, Capturing, and Showcasing Best Practices • Recent life science members: Pfizer, Merck, Partners HealthCare, Teranode, Cerebra, NIST, U Manchester, Stanford U, AlzForum • SW Supporting Vendors: Oracle, IBM, HP, Siemens, AGFA, • Co-chairs: Dr. Tonya Hongsermeier (Partners HealthCare); Eric Neumann (Teranode) 22 HCLS Objectives • Share use cases, applications, demonstrations, experiences • Exposing collections • Developing vocabularies • Building / extending (where appropriate) core vocabularies for data integration 23 HCLS Activities • • • • • • BioRDF - data as RDF BioNLP - unstructured data BioONT - ontology coordination Clinical Trials - CDISC/HL7 Scientific Publishing - evidence management Adaptive Healthcare Protocols 24 Semantic Web in R&D Progression Manager Toxicogenomicist Shared Annotations Notified of Alternatives Reporting on Progression Notify Others of Decisions A Single Compound Scientist Found Determinations Noted Alternatives Open Data Format and Flexible Linking Enabled Data Integration and Collaboration 25 R&D Applications in the Semantic Web Progression Manager Project Dashboard Toxicogenomicist Experiment Manager Scientist R&D Commons A Single Compound 26 Other Benefits of Semantic Web • Enterprise Distributed Connectivity – Universal Resource Identifiers (URI) • Authenticity – Auditability (Sarbanes-Oxley) – Authorship Non-repudibility • Privacy – Encryptibility and Trust Networks • Security – At any level of granularity 27 What is the Semantic Web ? It’s Semantic Webs It’s Text Extraction It’s AI It’s Web 2.0 It’s Data Tracking It’s a Global Conspiracy • http://www.w3.org/2006/Talks/0125-hclsig-em/ 28 It’s Ontologies W3C Roadmap • Semantic Web foundation specifications – RDF, RDF Schema and OWL are W3C Recommendations as of Feb 2004 • Standardization work is underway in Query, Best Practices and Rules • Goal of moving from a Web of Document to a Web of Data The Only Open and Web-based Data Integration Model Game in Town 29 Leveraging with Semantic Web Benefit #1 • Free Data from Applications… – Data uniquely defined by URI’s, even across multiple databases – Mapped through a common graph semantic model – Data can be distributed (not in one location) – New relations and attributes dynamically added • As easy as spreadsheets, but with semantics and web locations 30 Leveraging with Semantic Web Benefit #2 • All things on the Web can have semantics added to them – – – – – Ability to define and link in ontologies Documents Management through Links Changed data and semantics can be managed as versions Semantics can be used to define and apply policies No Need for complex Middleware 31 Leveraging with Semantic Web Benefit #3 • Supporting the Management of Knowledge – All data nodes and doc resources can be linked – Ability to represent Assertions and Hypotheses • Include authorship and assumptions • Use of KD45 logic – Both Local and Global Knowledge • Scientists can upload partially validated facts – View Data and Interpretations through Points-of-View (Semantic Lenses) • Share views with others 32 The Technologies: RDF • Resource Description Framework • Think: "Relational Data Format" • W3C standard for making statements of fact or belief about data or concepts • Descriptive statements are expressed as triples: (Subject, Verb, Object) – We call verb a “predicate” or a “property” Subject <Patient HB2122> Property <shows_sign> 33 Object <Disease Pneumococcal_Meningitis> What RDF Gets You Universal, semantic connectivity supports the construction of elaborate structures. 34 What does RDF get you? • Structure is not format-rigid (i.e. tree) – Semantics not implicit in Syntax – No new parsers need to be defined for new data • Entities can be anywhere on the web (URI) • Define semantics into graph structures (ontologies) – Use rules to test data consistency and extract important relations • Data can be merged into complete graphs • Multiple ontologies supported 35 RDF vs. XML example Wang et al., Nature Biotechnology, Sept 2005 AGML QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. 36 HUPML RDF Stripe Mode Node>Edge>Node >Edge…. 37 RDF Graph 38 gsk:KENPAL rdf:type :Compound ; dc:source http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&#38;db=pubmed&#38;dopt=Ab stract&#38;list_uids=14698171 ; chemID “3820” ; clogP “2.4” ; kA “e-8” ; mw “327.17” ; ic50 { rdf:type :IC50 ; value “23” ; units :nM ; forTarget gsk:GSK3beta } ; chemStructure “C16H11BrN2O” ; rdfs:label “kenpaullone” ; synonym “bromo-paullone” ; smiles “C1C2=C(C3=CC=CC=C3NC1=O)NC4=C2C=C(C=C4)B” ; inChI “1/C16H11BrN2O/c17-9-5-6-14-11(7-9)12-8-15(20)18-13-4-2-1-3-10(13)16(12)1914/h1-7,19H,8H2,(H,18,20)/f/h18H” ; xref http://pubchem.ncbi.nlm.nih.gov/summary/summary.cgi?cid=3820 . 40 Mapping from Current Formats DB 41 Excel => RDF ls:indivCell ${ rdf:type ls:GE_Cell; ls:probeHub gl:CASP2 ; ls:GE_Expected_Ratio "0.2726" ; ls:conditionHub gl:BREAST_MALIGNANT ls:indivCell ${ rdf:type ls:GE_Cell; ls:probeHub gl:TNFRS ; ls:GE_Expected_Ratio "0.0138" ; ls:conditionHub gl:BREAST_MALIGNANT }; ls:indivCell ${ rdf:type ls:GE_Cell; ls:probeHub gl:CASP2 ; ls:GE_Expected_Ratio "0.1275" ; ls:conditionHub gl:BREAST_NORMAL }; 42 Casp2 }; TNFRS Breast Malig W3C Launches Semantic Web for HealthCare and Life Sciences Interest Group • • Interest Group formally launched Nov 2005: http://www.w3.org/2001/sw/hcls First Domain Group for W3C - “…take SW through its paces” – Not a standards group, but a group to identify the best implementations of current SW Standards! • • An Open Scientific Forum for Discussing, Capturing, and Showcasing Best Practices Co-chairs: Dr. Tonya Hongsermeier (Partners HealthCare); Eric Neumann (Teranode) 43 W3C Launches Semantic Web for HealthCare and Life Sciences Interest Group • First formal meeting: Jan 25-26, 2006 Cambridge, MA • SW Supporting Vendors: Oracle, IBM, HP, Siemens, Agfa, • Recent life science members: Pfizer, Merck, Partners HealthCare, Teranode, Cerebra, NIST, U Manchester, Stanford U, U Bolzano, AlzForum, • Joining W3C gets you in as s group member – Early access to technology and discussions – Interaction with potential partners and clients 44 Multiple Ontologies Used Together Disease OMIM UMLS Group FOAF Disease Polymorphisms SNP Drug target ontology UniProt Protein BioPAX Person PubChem Patent ontology Extant ontologies Chemical entity 45 Under development Bridge concept Potential Linked Clinical Ontologies Clinical Obs Disease Descriptions SNOMED Applications CDISC ICD10 RCRIM (HL7) Clinical Trials Disease Models ontology Mechanisms Pathways (BioPAX) IRB Tox Extant ontologies Genomics Molecules Under development Bridge concept 46 Case Studies 47 Case Study: NeuroCommons.org • • • • Public Data & Knowledge for CNS R&D Forum Available for industry and academia All based on Semantic Web Standards 48 NeuroCommons The Recontribution of Knowledge Publications are usually copyrighted… Knowledge of Nature should be openly shareable! 49 NeuroCommons.org The Neurocommons project, a collaboration between Science Commons and the Teranode Corporation, is creating a free, public Semantic Web for neurological research. The project has three distinct goals: 1. To demonstrate that scientific impact and innovation is directly related to the freedom to legally reuse and technically transform scientific information. 2. To establish a legal and technical framework that increases the impact of investment in neurological research in a public and clearly measurable manner. 3. To develop an open community of neuroscientists, funders of neurological research, technologists, physicians, and patients to extend the Neurocommons work in an open, collaborative, distributed manner. 50 NeuroCommons First Steps The first stage is underway: • Using NLP and other automated technologies, extract machine-readable representations of neuroscience-related knowledge as contained in free text and databases • Assemble those representations into a graph • Publish the graph with no intellectual property rights or contractual restrictions on reuse 52 HCLS Neuro Tasks • Aggregate facts and models around Parkinson’s Disease • SWAN: scientific annotations and evidence • Use RDF and OWL to describe – – – – – Brain scans in the The Whole Brain Atlas Neural entries in NCBI’s Entrez Gene Database ’Brain Connectivity' N euronal data in SenseLab Neurological Disease entries in OMIM 53 Case Study: BioPAX (Pathways) <bp:PATHWAYSTEP rdf:ID="xDshToXGSK3bPathwayStep"> <bp:next-step rdf:resource="#xGSK3bToBetaCateninPathwayStep"/> <bp:step-interactions> <bp:MODULATION rdf:ID="xDshToXGSK3b"> <bp:keft rdf:resource="#xDsh"/> <bp:right rdf:resource="#xGSK-3beta"/> <bp:participants rdf:resource="#xGSK-3beta"/> <bp:name rdf:datatype="http://www.w3.org/2001/XMLSchema#string"> Dishevelled to GSK3beta</bp:name> <bp:direction rdf:datatype="http://www.w3.org/2001/XMLSchema#string"> IRREVERSIBLE-LEFT-TO-RIGHT</bp: direction > <bp:control-type rdf:datatype="http://www.w3.org/2001/XMLSchema#string"> INHIBITION</bp: control-type > <bp: participants rdf:resource="#xDsh"/> </bp: MODULATION > </bp: step-interactions > </bp: PATHWAYSTEP > 54 Case Study: BioPAX (Pathways) <bp:PATHWAYSTEP rdf:ID="xDshToXGSK3bPathwayStep"> Modulation <bp:next-step rdf:resource="#xGSK3bToBetaCateninPathwayStep"/> <bp:step-interactions> <bp:MODULATION rdf:ID="xDshToXGSK3b"> <bp:keft rdf:resource="#xDsh"/> <bp:right rdf:resource="#xGSK-3beta"/> <bp:participants rdf:resource="#xGSK-3beta"/> <bp:name rdf:datatype="http://www.w3.org/2001/XMLSchema#string"> Dishevelled to GSK3beta</bp:name> <bp:direction rdf:datatype="http://www.w3.org/2001/XMLSchema#string"> IRREVERSIBLE-LEFT-TO-RIGHT</bp: direction > <bp:control-type rdf:datatype="http://www.w3.org/2001/XMLSchema#string"> INHIBITION</bp: control-type > <drug:affectedBy rdf:resource=”http://pharma.com/cmpd/CHIR99102"/> <bp: participants rdf:resource="#xDsh"/> </bp: MODULATION > </bp: step-interactions > </bp: PATHWAYSTEP > 55 affectedBy CHIR99102 Case Study: Drug Discovery Dashboards • Dashboards and Project Reports • Next generation browsers for semantic information via Semantic Lenses • Renders OWL-RDF, XML, and HTML documents • Lenses act as information aggregators and logic style-sheets add { ls:TheraTopic hs:classView:TopicView } 56 Drug Discovery Dashboard http://www.w3.org/2005/04/swls/BioDash Topic: GSK3beta Topic Disease: DiabetesT2 Alt Dis: Alzheimers Target: GSK3beta Cmpd: SB44121 CE: DBP Team: GSK3 Team Person: John Related Set Path: WNT 58 Bridging Chemistry and Molecular Biology Semantic Lenses: Different Views of the same data BioPax Components Target Model urn:lsid:uniprot.org:uniprot:P49841 Apply Correspondence Rule: if ?target.xref.lsid == ?bpx:prot.xref.lsid then ?target.correspondsTo.?bpx:prot 59 Bridging Chemistry and Molecular Biology •Lenses can aggregate, accentuate, or even analyze new result sets • Behind the lens, the data can be persistently stored as RDF-OWL • Correspondence does not need to mean “same descriptive object”, but may mean objects with identical references 60 Case Study: Drug Safety ‘Safety Lenses’ • Lenses can ‘focus data in specific ways – Hepatoxicity, genotoxicity, hERG, metabolites • Can be “wrapped” around statistical tools • Aggregate other papers and findings (knowledge) in context with a particular project • Align animal studies with clinical results • Support special “Alert-channels” by regulators for each different toxicity issue • Integrate JIT information on newly published mechanisms of actions 61 GeneLogic GeneExpress Data • Additional relations and aspects can be defined additionally Diseased Tissue Links to OMIM (RDF) 62 ClinDash: Clinical Trials Browser Subjects •Values can be normalized across all measurables (rows) Clinical Obs •Samples can be aligned to their subjects using RDF rules Expression Data •Clustering can now be done over all measureables (rows) 63 Case Study: Nokia • Developer’s Forum Portal 64 Case Study: TERANODE Design Suite Supports Laboratory Data and Workflow • Protocol Modeler – Accelerates workflow development – Eliminates database programming • Protocol Player 65 – Guides users through workflow – Automates data capture – Automates complex data flow plates – Integrates lab data with project and enterprise data Conclusions: Key Semantic Web Principles • • • • • • • • Plan for change Free data from the application that created it Lower reliance on overly complex Middleware The value in "as needed" data integration Big wins come from many little ones The power of links - network effect Open-world, open solutions are cost effective Importance of "Partial Understanding" 66 Efficiency and Innovation: Semantic Web Applications Roadmap