UPDATE OF SEMANTIC INTEGRATION AT ASTRAZENECA ”SIEVE” SAFETY INFORMATION EVALUATION AND VISUAL EXPLORATION Authors: Suzanne Tracy1, Robert Stanley2, Jason Eshleman2, Rodj Martin2, Tom Plasterer1, David Brott1 Presenter - Conference: Robert Stanley - ISWC, Philadelphia PA, October 2015 1AstraZeneca (Patient Safety Systems, Design and Interpretation Center of Excellence, Patient Safety, Safety Science); 2IO Informatics (Software and Knowledge Engineering services) Confidential © 2015 ISWC, Philadelphia PA CONTENTS • Goals, Challenges • Project Metrics • Content, outcomes, next steps • Methods and Tools • Staging RDF => Formal RDF (succinct, strategically aligned) • Lexical matching, ontologies, inference • Automation, provenance, alerting, subgraphs/data pools • Outcome for Searching, Reporting, Visualization • Summary Benefits, Uses for SIEVE Confidential © 2015 ISWC, Philadelphia PA GOALS AND CHALLENGES Confidential © 2015 ISWC, Philadelphia PA GOALS, CHALLENGES Use Cases Cross-study safety research, biomarker identification and qualification, study design and interpretation, translational applications Technical Goal Single resource for search and retrieval of user-defined clinical trial data from multiple studies, positioned for long-term interoperability Challenges Sources – heterogeneous file structure and nomenclature, inconsistency, partial content across multiple files, duplication Identification – missing/incomplete identifiers, undefined headers, test files Harmonization - heterogeneity of terms End User performance and interaction requirements Patient Safety Science needed to efficiently retrieve, search and evaluate clinical trial data and their biometric assessments in a homogeneous manner across studies… Confidential © 2015 ISWC, Philadelphia PA SIEVE 1 AND SIEVE 1.5 DATA AND SYSTEM INFORMATION Confidential © 2015 ISWC, Philadelphia PA SIEVE I DATA METRICS 2013-14 Clinical Trials Integration Source Data • • • • • • • • Subjects Studies Studies w’ lab codes Subjects w’ lab codes Studies w. Adverse Events Unique Terms (post-harmonization) Subjects missing Gender or DoB Subjects w’ Gender or DoB conflicts 41,582 199 159 28,300 137 4,800 6000 40 Sample Lab Results Counts • • • • Cystatin C Creatinine Bilirubin, total ALP 2761 165,879 137,496 123,000 Confidential © 2015 ISWC, Philadelphia PA SIEVE 1.5 WHAT’S NEW? User feedback UAT, UI Mods Move to production quality platform Multiple servers, LDAP Automation pipeline Alerts, check points Extending rule-set Increasing familiarity with data diversity New data + lower cost import training “Diverse” data vs. expected data Confidential © 2015 ISWC, Philadelphia PA SIEVE 1.5 DATA METRICS 2014 - 2015 New Source Data • • • • • 14 New Studies imported* 2484 Subjects 350210 Lab results 2404 different column headers in 272 different files 1425 headers appear in only one file** * Automation is useful, but when each new dataset is different, decisions may need to be made. For example, should Study 12497a and Study 12497b be modeled as one or two studies? ** Note regarding data diversity - most column headers appeared one time. Confidential © 2015 ISWC, Philadelphia PA SIEVE 2 – WHAT’S NEXT? • Dual LDAP requirements • Growing ETL rules with automation, alerts, human guidance when needed • ‘Assisted learning’ to handle diverse new data • Assessing new Web Query architecture • Adds micro-services/programmability, web-based graph views, horizontal scalability Confidential © 2015 ISWC, Philadelphia PA METHODS Confidential © 2015 ISWC, Philadelphia PA TACTICAL => STRATEGIC INTEGRATION STAGING RDF - SIEVE I Bottom-up (data), top-down (use cases) Staging model to curated terms and relationships, accurate identifiers StudyID, SubjectID, Sex, DoB, RowID [Treatment, Outcome, …] Apply lexical matching, SPARQL/inference, ontologies, dictionaries Select predicates and URIs for best practices, inference, elegance Surface conceptually meaningful and searchable data model Visually iterate decisionmaking with Subject Matter Experts Maintain provenance on decisions, track to source UPPER HARMONIZATION – SIEVE I.5 Data model refinement => ontology / vocabulary including AZ Corporate and Lab Code Dictionaries SDTM – (AZ / CDISC) MedDRA for Adverse Events NCI Thesaurus Succinct ontology (reasoning, interoperability, subgraphs) Confidential © 2015 ISWC, Philadelphia PA FUNDAMENTAL INTEGRATION PROCESS Confidential © 2015 ISWC, Philadelphia PA STARTING WITH DIVERSITY I want to be able to ask questions that include all of my clinical trials data but I currently can’t do this. Clinical Trial 1 Data Set Patient Name Cond [Patient x] Alz The data is in separate applications, using different standards and databases. Trtmnt AzilectTM It can take days (even months) to manually sort through these data to find an answer. Pt ID Disease Diag. Rx [Pt ID xx4x] Parkinsons Rasagaline Clinical Trial 2 Data Set Confidential © 2015 ISWC, Philadelphia PA Copyright © 2015 IO Informatics Inc. CLINICAL DATA MODELING First, “drop” the data into the system for analysis. Creation and application of staging RDF reduces manual review requirements by over 90%.* Clinical Trial 1 Data Set (is transformed) Patient Name Cond Trtmnt [Patient x] Alz AzilectTM Semantic Clinical Trials Network (for initial discovery, harmonization) Patient [Preferred ID #] has diagnosis (adds meaningful relationships) * Applying SPARQL and visualization, AZ data dictionary, SKOS, (UMLS, PROV, etc.) for alignment with AZ’s “upper” ontologies and nomenclature has treatment Alzheimers Disease AzilectTM Confidential © 2015 ISWC, Philadelphia PA CREATE AND APPLY RULES VISUALIZATION, LEXICAL MATCHING, QUERIES / ENTAILMENT Bringing the next dataset into the system applies lexical matching, ontologies/vocabularies and inference for curation with formal provenance. Clinical Trial 2 Data Set Pt ID Disease Diag. Rx [Pt ID xx4x] Parkinsons (is enhanced into) Semantic Clinical Trials Network Patient [Preferred ID #] Has treatment Rasagaline Rasagiline has diagnosis (aligns content with preferred terms, synonyms) * Combine lexical matching, inference / entailment, ontology/thesauri with visual modeling and rule creation has brand name Parkinsons Disease AzilectTM Confidential © 2015 ISWC, Philadelphia PA CLINICAL DATA MODELING OUTCOME - LINKING WELL-FORMED DATA Staging RDF is linked by common concepts, instance terminology and relationships, with least predicates to ensure well-formed, searchable data Semantic Trial 2 Network Semantic Trial 1 Network Patient [Preferred ID #] are automatically linked by all common terms has diagnosis Has treatment has treatment Alzheimers Disease Patient [Preferred ID #] has diagnosis Rasagaline has brand name AzilectTM Rasagiline Parkinsons Disease has brand name AzilectTM * SIEVE 1.5 aligns with AZ drug dictionaries and corporate standards; NCI Thesaurus, SDTM, SKOS, also VOID for metrics Confidential © 2015 ISWC, Philadelphia PA OUTCOME… All data is harmonized and deeply searchable Find patients diagnosed with both Parkinsons and Alzheimers disease who were treated with Azilect. automatically Patient linked [Preferred ID #] by common terms has diagnosis has diagnosis Alzheimers Disease Meaningful searches across linked terms and relationships. has treatment Parkinsons Disease Rasagaline has brand name AzilectTM Confidential © 2015 ISWC, Philadelphia PA SOFTWARE TOOLS SENTIENT APPLICATIONS Knowledge Explorer Web Query Custom Web Applications (ASK) Complementary Tools • SDBs, Knime/PLP, R, … Confidential © 2015 ISWC, Philadelphia PA SOFTWARE-ASSISTED VISUAL DATA MODELING INTEGRATES FILES, PUBLIC AND RELATIONAL RESOURCES 1) Tabular, relational, XML, semantic resources are imported (e.g. NCBI resources, GEO datasets) 3) Extensible integration mappers,queries and rules… 2) Data-driven, transparent, iterative “knowledge engineering” applies inference, ontologies, dictionaries, SME interaction Confidential © 2015 ISWC, Philadelphia PA Visually supported Modeling Workflow Mapping tools create interoperable RDF from text, spreadsheets, relational database and other sources. Select source (Step 1), create mapping (Step 2) and review it (Step 3). Confidential © 2015 ISWC, Philadelphia PA Copyright - IO Informatics © 2015 INSTANCE LEVEL DATA MODEL SUBSET INTERACTIVE VISUAL REVIEW & MODELING Direct-to-concept data model useful for SMEs and information scientists (KEs) Iterative modeling and enrichment environment Create bottom-up data models, import, apply, refine ontologies Test and apply SPARQL, rules, scripts, thesauri Confidential © 2015 ISWC, Philadelphia PA ETL WORKFLOW FOR AUTOMATED DELIVERY TO 1) Deep semantic interaction is provided – interoperability, extensibility, inference, pattern-based searching, (…) SIEVE 3a) Fast, full-featured Query and Reporting for End Users 2) SPARQL queries, inference, updates on pipeline assures reliability and currency with minimal downtime for End Users 3) Applications, web services 3b) Integrated data for visualization, statistics, chem search, web services, (…) Confidential © 2015 ISWC, Philadelphia PA SEARCHING, VISUALIZATION, REPORTING Confidential © 2015 ISWC, Philadelphia PA SEARCH INTERFACES ADVANCED AND WEB-BASED QUERY UIS Expert SPARQL and ‘no training required’ End User search options Ad hoc, saved, shared queries Selectively accessible Views, Filters, faceted browsing Nested, range, substructure, etc. functions Confidential © 2015 ISWC, Philadelphia PA VISUALIZATION AND REPORTING (GRAPH AND) WEB-BASED REPORT UIS Expert Network Views Tabular, sortable, customizable Reports Faceted subsearching, list-based queries Nested, range, substructure, etc. functions Modify queries from Reports Charting and view connection options (Spotfire, etc.) “ASK” pattern-based querying and alerting options Export .xml, .xls, .tsv, .rdf (.n3, .nt, .ttl, .owl) Confidential © 2015 ISWC, Philadelphia PA BENEFITS Confidential © 2015 ISWC, Philadelphia PA BENEFITS SUBSTANTIVE / IMMEDIATE BENEFITS Content for critical uses is available to researchers within minutes rather than weeks / months User interface provides research-facing, rapid access to actual study data rather than just an index of trials Targeted views, variables (including clinical assay, therapy area, adverse event, subject demographics, …) Faceted search and reporting provides sensitive and specific identification of potential biomarkers, information about adverse events. “Questions that used to take six months to answer… are now answered in six seconds.” TECHNICAL / LONGER TERM BENEFITS Semantic resource prepared for delivery to web services, other apps Data curation, routines and loading procedures are leading the way for other clinical data projects “Cooperation without Collaboration” – SIEVE data resource is positioned for rapid, agile extension / federation with complementary data resources Confidential © 2015 ISWC, Philadelphia PA THANKS! AstraZeneca Suzanne Tracy, Tom Plasterer, Michael Goodman, Tom Plasterer, Kerstin Forsberg, Kaushal Desai, David Cook, (…) IO Informatics Jason Eshleman, Erich Gombocz, Alexander DeLeon, Robert “Rodj” Martin, Sergey Nikitin, Jane Condrashina, (…) And others, including Franz (Allegrograph), Openlink (Virtuoso), Peter Bogetti, Stephen Furlong For more information contact: rstanley@io-informatics.com or visit http://www.io-informatics.com Confidential © 2015 ISWC, Philadelphia PA USES FOR SIEVE Confidential © 2015 ISWC, Philadelphia PA ACTIVE USES Study Design & Interpretation Provide ranges on trials endpoints for evidence-based uncertainty analysis Support interpretation of trial outcomes through analysis of evidence and summary data Safety & Biomarker Project Support Support Safety Physicians for ongoing applications Identify previous biomarker work to facilitate vendor selection Biomarker Qualification (Example: Cystatin C) Access and review clinical biomarker data with lab test range and variability information Provide biomarker levels in different disease populations (age, gender, etc.) Translational Medicine (Example: Adverse Event - Dizziness) Provide clinical insights back to preclinical to incorporate clinical findings Provide information on incidence of an adverse event Provide biomarker lab values that can be used to help translate pre-clinical toxicity studies Confidential © 2015 ISWC, Philadelphia PA SELECTED REFERENCES • E. Gombocz: "Research Data Integration of Retrospective Studies for Prediction of Disease Progression" IO Informatics, Berkeley, CA, June 2010 White Paper [.pdf / 1.8 MB] • E. Gombocz: "Semantic cross-domain integration: The intersection of research, public, and clinical data; creating applicable knowledge for decision support in patient-centric healthcare" NCBO Webinar Series Stanford, CA, May 4, 2011 Abstract [.html] Lecture Slides [.pdf / 3.2 MB] WebEx Recording [~54 min. stream] • R. Stanley, B. McManus, R. Ng, E. Gombocz, J. Eshleman, C. Rockey: "Case Study: Applied Semantic Knowledgebase for Detection of Patients at Risk of Organ Failure through Immune Rejection" Joint Case Study of IO Informatics and University British Columbia (UBC), NCE CECR PROOF Centre of Excellence, James Hogg iCAPTURE Centre, Vancouver, BC, Canada, March 29, 2011 W3C Semantic Web Use Cases and Case Studies • T.N. Plasterer, R. Stanley, E. Gombocz: "Correlation Network Analysis and Knowledge Integration" In: "Applied Statistics for Network Biology: Methods in Systems Biology", M. Dehmer, F. Emmert-Streib, A. Graber, A. Salvador (Eds.), 2011 Wiley-VCH, Weinheim, ISBN: 978-3-527-32750-8 W EBSITE : http://www.io-informatics.com Confidential © 2015 ISWC, Philadelphia PA DISCUSSION For more information contact: rstanley@io-informatics.com or visit http://www.io-informatics.com Confidential © 2015 ISWC, Philadelphia PA