IOI_SIEVE_for_AstraZeneca_AZ_ISWC_20151012

advertisement
UPDATE OF SEMANTIC INTEGRATION AT
ASTRAZENECA
”SIEVE”
SAFETY INFORMATION EVALUATION AND VISUAL EXPLORATION
Authors:
Suzanne Tracy1, Robert Stanley2, Jason Eshleman2, Rodj Martin2, Tom
Plasterer1, David Brott1
Presenter - Conference:
Robert Stanley - ISWC, Philadelphia PA, October 2015
1AstraZeneca
(Patient Safety Systems, Design and Interpretation Center of Excellence, Patient
Safety, Safety Science); 2IO Informatics (Software and Knowledge Engineering services)
Confidential © 2015 ISWC, Philadelphia PA
CONTENTS
• Goals, Challenges
• Project Metrics
• Content, outcomes, next steps
• Methods and Tools
• Staging RDF => Formal RDF (succinct, strategically aligned)
• Lexical matching, ontologies, inference
• Automation, provenance, alerting, subgraphs/data pools
• Outcome for Searching, Reporting, Visualization
• Summary Benefits, Uses for SIEVE
Confidential © 2015 ISWC, Philadelphia PA
GOALS AND CHALLENGES
Confidential © 2015 ISWC, Philadelphia PA
GOALS, CHALLENGES
Use Cases
 Cross-study safety research, biomarker identification and
qualification, study design and interpretation, translational
applications
Technical Goal
 Single resource for search and retrieval of user-defined clinical trial
data from multiple studies, positioned for long-term interoperability
Challenges
 Sources – heterogeneous file structure and nomenclature,
inconsistency, partial content across multiple files, duplication
 Identification – missing/incomplete identifiers, undefined headers,
test files
 Harmonization - heterogeneity of terms
 End User performance and interaction requirements
Patient Safety Science needed to efficiently retrieve, search and evaluate clinical trial
data and their biometric assessments in a homogeneous manner across studies…
Confidential © 2015 ISWC, Philadelphia PA
SIEVE 1 AND SIEVE 1.5
DATA AND SYSTEM INFORMATION
Confidential © 2015 ISWC, Philadelphia PA
SIEVE I DATA METRICS
2013-14 Clinical Trials Integration
Source Data
•
•
•
•
•
•
•
•
Subjects
Studies
Studies w’ lab codes
Subjects w’ lab codes
Studies w. Adverse Events
Unique Terms (post-harmonization)
Subjects missing Gender or DoB
Subjects w’ Gender or DoB conflicts
41,582
199
159
28,300
137
4,800
6000
40
Sample Lab Results Counts
•
•
•
•
Cystatin C
Creatinine
Bilirubin, total
ALP
2761
165,879
137,496
123,000
Confidential © 2015 ISWC, Philadelphia PA
SIEVE 1.5
WHAT’S NEW?
 User feedback
 UAT, UI Mods
 Move to production quality platform
 Multiple servers, LDAP
 Automation pipeline
 Alerts, check points
 Extending rule-set
 Increasing familiarity with data diversity
 New data + lower cost import training
 “Diverse” data vs. expected data
Confidential © 2015 ISWC, Philadelphia PA
SIEVE 1.5 DATA METRICS
2014 - 2015 New Source Data
•
•
•
•
•
14 New Studies imported*
2484 Subjects
350210 Lab results
2404 different column headers in 272 different files
1425 headers appear in only one file**
* Automation is useful, but when each new dataset is different, decisions may
need to be made. For example, should Study 12497a and Study 12497b be
modeled as one or two studies?
** Note regarding data diversity - most column headers appeared one time.
Confidential © 2015 ISWC, Philadelphia PA
SIEVE 2 –
WHAT’S NEXT?
• Dual LDAP requirements
• Growing ETL rules with automation, alerts,
human guidance when needed
• ‘Assisted learning’ to handle diverse new data
• Assessing new Web Query architecture
• Adds micro-services/programmability, web-based
graph views, horizontal scalability
Confidential © 2015 ISWC, Philadelphia PA
METHODS
Confidential © 2015 ISWC, Philadelphia PA
TACTICAL => STRATEGIC
INTEGRATION
STAGING RDF - SIEVE I
 Bottom-up (data), top-down (use cases)
 Staging model to curated terms and relationships, accurate identifiers
 StudyID, SubjectID, Sex, DoB, RowID [Treatment, Outcome, …]
 Apply lexical matching, SPARQL/inference, ontologies, dictionaries
 Select predicates and URIs for best practices, inference, elegance
 Surface conceptually meaningful and searchable data model
 Visually iterate decisionmaking with Subject Matter Experts
 Maintain provenance on decisions, track to source
UPPER HARMONIZATION – SIEVE I.5
 Data model refinement => ontology / vocabulary including
 AZ Corporate and Lab Code Dictionaries
 SDTM – (AZ / CDISC)
 MedDRA for Adverse Events
 NCI Thesaurus
 Succinct ontology (reasoning, interoperability, subgraphs)
Confidential © 2015 ISWC, Philadelphia PA
FUNDAMENTAL
INTEGRATION PROCESS
Confidential © 2015 ISWC, Philadelphia PA
STARTING WITH DIVERSITY
I want to be able to ask questions that include all of my clinical
trials data but I currently can’t do this.
Clinical Trial 1 Data Set
Patient
Name
Cond
[Patient x] Alz
The data is in separate
applications, using
different standards and
databases.
Trtmnt
AzilectTM
It can take days
(even months) to
manually sort
through these data
to find an answer.
Pt ID
Disease
Diag.
Rx
[Pt ID
xx4x]
Parkinsons Rasagaline
Clinical Trial 2 Data Set
Confidential © 2015 ISWC, Philadelphia PA
Copyright © 2015 IO Informatics Inc.
CLINICAL DATA MODELING
 First, “drop” the data into the system for analysis. Creation and application of
staging RDF reduces manual review requirements by over 90%.*
Clinical Trial 1 Data Set
(is transformed)
Patient
Name
Cond
Trtmnt
[Patient x]
Alz
AzilectTM
Semantic Clinical Trials Network
(for initial discovery, harmonization)
Patient
[Preferred ID #]
has diagnosis
(adds meaningful
relationships)
* Applying SPARQL and visualization, AZ data
dictionary, SKOS, (UMLS, PROV, etc.) for alignment
with AZ’s “upper” ontologies and nomenclature
has treatment
Alzheimers
Disease
AzilectTM
Confidential © 2015 ISWC, Philadelphia PA
CREATE AND APPLY RULES VISUALIZATION, LEXICAL MATCHING, QUERIES / ENTAILMENT
 Bringing the next dataset into the system applies lexical matching,
ontologies/vocabularies and inference for curation with formal provenance.
Clinical Trial 2 Data Set
Pt ID
Disease Diag. Rx
[Pt ID
xx4x]
Parkinsons
(is enhanced into)
Semantic Clinical Trials Network
Patient
[Preferred ID #]
Has treatment
Rasagaline
Rasagiline
has diagnosis
(aligns content with
preferred terms, synonyms)
* Combine lexical matching, inference /
entailment, ontology/thesauri with visual modeling
and rule creation
has brand name
Parkinsons
Disease
AzilectTM
Confidential © 2015 ISWC, Philadelphia PA
CLINICAL DATA MODELING
OUTCOME - LINKING WELL-FORMED DATA
 Staging RDF is linked by common concepts, instance terminology and
relationships, with least predicates to ensure well-formed, searchable data
Semantic Trial 2 Network
Semantic Trial 1 Network
Patient
[Preferred ID #]
are automatically linked
by all common terms
has
diagnosis
Has treatment
has treatment
Alzheimers
Disease
Patient
[Preferred ID #]
has diagnosis
Rasagaline
has brand name
AzilectTM
Rasagiline
Parkinsons
Disease
has brand name
AzilectTM
* SIEVE 1.5 aligns with AZ drug dictionaries and corporate
standards; NCI Thesaurus, SDTM, SKOS, also VOID for metrics
Confidential © 2015 ISWC, Philadelphia PA
OUTCOME…
All data is harmonized and deeply searchable
 Find patients diagnosed with both Parkinsons and Alzheimers
disease who were treated with Azilect.
automatically
Patient linked
[Preferred ID #]
by common terms
has diagnosis
has diagnosis
Alzheimers
Disease
Meaningful searches across
linked terms and relationships.
has treatment
Parkinsons
Disease
Rasagaline
has brand name
AzilectTM
Confidential © 2015 ISWC, Philadelphia PA
SOFTWARE TOOLS
SENTIENT
APPLICATIONS
Knowledge
Explorer
Web
Query
Custom Web
Applications (ASK)
 Complementary Tools
•
SDBs, Knime/PLP, R, …
Confidential © 2015 ISWC, Philadelphia PA
SOFTWARE-ASSISTED VISUAL DATA MODELING
INTEGRATES FILES, PUBLIC AND RELATIONAL RESOURCES
1) Tabular, relational, XML, semantic
resources are imported (e.g. NCBI
resources, GEO datasets)
3) Extensible integration
mappers,queries and rules…
2) Data-driven, transparent, iterative
“knowledge engineering” applies inference,
ontologies, dictionaries, SME interaction
Confidential © 2015 ISWC, Philadelphia PA
Visually supported
Modeling Workflow
Mapping tools create interoperable RDF from text, spreadsheets, relational database and other
sources. Select source (Step 1), create mapping (Step 2) and review it (Step 3).
Confidential © 2015 ISWC, Philadelphia PA
Copyright - IO Informatics © 2015
INSTANCE LEVEL
DATA MODEL SUBSET
INTERACTIVE VISUAL REVIEW &
MODELING
 Direct-to-concept data model useful for SMEs and
information scientists (KEs)
 Iterative modeling and enrichment environment
 Create bottom-up data models, import, apply, refine
ontologies
 Test and apply SPARQL, rules, scripts, thesauri
Confidential © 2015 ISWC, Philadelphia PA
ETL WORKFLOW
FOR AUTOMATED DELIVERY TO
1) Deep semantic interaction is provided – interoperability,
extensibility, inference, pattern-based searching, (…)
SIEVE
3a) Fast, full-featured Query
and Reporting for End Users
2) SPARQL queries, inference, updates on
pipeline assures reliability and currency
with minimal downtime for End Users
3) Applications, web services
3b) Integrated data for
visualization, statistics, chem
search, web services, (…)
Confidential © 2015 ISWC, Philadelphia PA
SEARCHING, VISUALIZATION, REPORTING
Confidential © 2015 ISWC, Philadelphia PA
SEARCH INTERFACES
ADVANCED AND WEB-BASED
QUERY UIS
 Expert SPARQL and ‘no training
required’ End User search options
 Ad hoc, saved, shared queries
 Selectively accessible Views, Filters,
faceted browsing
 Nested, range, substructure, etc.
functions
Confidential © 2015 ISWC, Philadelphia PA
VISUALIZATION AND
REPORTING
(GRAPH AND) WEB-BASED
REPORT UIS






Expert Network Views
Tabular, sortable, customizable Reports
Faceted subsearching, list-based queries
Nested, range, substructure, etc. functions
Modify queries from Reports
Charting and view connection options
(Spotfire, etc.)
 “ASK” pattern-based querying and alerting
options
 Export .xml, .xls, .tsv, .rdf (.n3, .nt, .ttl, .owl)
Confidential © 2015 ISWC, Philadelphia PA
BENEFITS
Confidential © 2015 ISWC, Philadelphia PA
BENEFITS
SUBSTANTIVE / IMMEDIATE BENEFITS
 Content for critical uses is available to researchers within minutes
rather than weeks / months
 User interface provides research-facing, rapid access to actual study
data rather than just an index of trials
 Targeted views, variables (including clinical assay, therapy area,
adverse event, subject demographics, …)
 Faceted search and reporting provides sensitive and specific
identification of potential biomarkers, information about adverse events.
“Questions that used to take six months to answer… are now answered in six seconds.”
TECHNICAL / LONGER TERM BENEFITS
 Semantic resource prepared for delivery to web services, other apps
 Data curation, routines and loading procedures are leading the way for
other clinical data projects
 “Cooperation without Collaboration” – SIEVE data resource is
positioned for rapid, agile extension / federation with complementary data
resources
Confidential © 2015 ISWC, Philadelphia PA
THANKS!
AstraZeneca
Suzanne Tracy, Tom Plasterer, Michael Goodman, Tom Plasterer,
Kerstin Forsberg, Kaushal Desai, David Cook, (…)
IO Informatics
Jason Eshleman, Erich Gombocz, Alexander DeLeon, Robert
“Rodj” Martin, Sergey Nikitin, Jane Condrashina, (…)
And others, including
Franz (Allegrograph), Openlink (Virtuoso), Peter Bogetti,
Stephen Furlong
For more information contact:
rstanley@io-informatics.com or visit http://www.io-informatics.com
Confidential © 2015 ISWC, Philadelphia PA
USES FOR SIEVE
Confidential © 2015 ISWC, Philadelphia PA
ACTIVE USES
Study Design & Interpretation


Provide ranges on trials endpoints for evidence-based uncertainty analysis
Support interpretation of trial outcomes through analysis of evidence and
summary data
Safety & Biomarker Project Support


Support Safety Physicians for ongoing applications
Identify previous biomarker work to facilitate vendor selection
Biomarker Qualification (Example: Cystatin C)


Access and review clinical biomarker data with lab test range and variability
information
Provide biomarker levels in different disease populations (age, gender, etc.)
Translational Medicine (Example: Adverse Event - Dizziness)



Provide clinical insights back to preclinical to incorporate clinical findings
Provide information on incidence of an adverse event
Provide biomarker lab values that can be used to help translate pre-clinical
toxicity studies
Confidential © 2015 ISWC, Philadelphia PA
SELECTED REFERENCES
•
E. Gombocz:
"Research Data Integration of Retrospective Studies for Prediction of Disease Progression"
IO Informatics, Berkeley, CA, June 2010
White Paper [.pdf / 1.8 MB]
•
E. Gombocz:
"Semantic cross-domain integration: The intersection of research, public, and clinical data; creating
applicable knowledge for decision support in patient-centric healthcare"
NCBO Webinar Series Stanford, CA, May 4, 2011
Abstract [.html]
Lecture Slides [.pdf / 3.2 MB]
WebEx Recording [~54 min. stream]
•
R. Stanley, B. McManus, R. Ng, E. Gombocz, J. Eshleman, C. Rockey:
"Case Study: Applied Semantic Knowledgebase for Detection of Patients at Risk of Organ Failure
through Immune Rejection"
Joint Case Study of IO Informatics and University British Columbia (UBC), NCE CECR PROOF Centre of
Excellence, James Hogg iCAPTURE Centre, Vancouver, BC, Canada, March 29, 2011
W3C Semantic Web Use Cases and Case Studies
•
T.N. Plasterer, R. Stanley, E. Gombocz:
"Correlation Network Analysis and Knowledge Integration"
In: "Applied Statistics for Network Biology: Methods in Systems Biology", M. Dehmer, F. Emmert-Streib,
A. Graber, A. Salvador (Eds.), 2011
Wiley-VCH, Weinheim, ISBN: 978-3-527-32750-8
W EBSITE : http://www.io-informatics.com
Confidential © 2015 ISWC, Philadelphia PA
DISCUSSION
For more information contact:
rstanley@io-informatics.com or visit http://www.io-informatics.com
Confidential © 2015 ISWC, Philadelphia PA
Download