Data - Open PHACTS

advertisement
Open PHACTS
“Data integration for all”
Andrew Leach
Task, workflow and results
Task: create a focussed set to
identify leads against voltagegated potassium channels
AUREUS search
targets: voltage-gated potassium channels
Apply filters
(MW, cLogP, Lipinski
+ remove undesirable target)
⇒ ~1000 molecules
Similarity searches
Series for lead
optimisation
(RG, TP, Daylight)
Cluster analysis
⇒ ~10000 molecules selected
IonWorks© single shot screening
5 full curve actives
(in at least one test occasion)
240 single shot hits progressed
into full curve assay
Stefan Senger, ca. 2004
We (may) know where the data is, but integrating
is a pain, bespoke, and often only for experts
Q: Identify all oxidoreductase inhibitors with an activity <100nM in both
mouse and human
Q: The current Factor Xa lead series is characterised by substructure X.
Retrieve all bioactivity data in serine protease assays for molecules
that contain substructure X.
Q: For a given interaction profile, give me compounds similar to it.
ChEMBL
Gene
Ontology
DrugBank
ChEBI
Uniprot
Wikipathways
UMLS
ChemSpider
ConceptWiki
etc.
Internal
The Innovative Medicines Initiative
Biggest public-private partnership in
area of medicine
Collaboration between European
Commission and European
Federation of Pharmaceutical
Industries and Associations (EFPIA)
Promotion of medical innovation in
Europe
Tackle key bottlenecks
Recognises “in kind” contributions
Focus on key problems
– Efficacy, Safety, Education &
Training, Knowledge
Management
Public Domain Drug Discovery Data
Pharma are accessing, processing, storing & re-processing
GSK
Literature Genbank
PatentsLiterature
PubChemGenbank
PatentsLiterature
PubChemGenbank
PatentsLiterature
PubChemGenbank
Patents PubChem
Databases
Databases
Databases
Databases
Downloads
Downloads
Downloads
Downloads
AZ
Pfizer
Merck
Data Integration
Data Integration
Data Integration
Data Integration
Data Analysis
Data Analysis
Data Analysis
Data Analysis
Firewalled Databases
Firewalled Databases
Firewalled Databases
Firewalled Databases
Why repeat at each company?
Information Tombs
– Built for primary use-case
– Tailored indexes
– Tailored GUIs
– Unique language & metadata
– Poor interoperability/integration
In vivo
Portfolio Literature
HR
Synthesis
SAR
Docs
Safety
Etc
Project Partners
Pfizer Limited – Coordinator
Universität Wien – Managing entity
Technical University of Denmark
University of Hamburg, Center for Bioinformatics
BioSolveIT GmBH
Consorci Mar Parc de Salut de Barcelona
Leiden University Medical Centre
Royal Society of Chemistry
Vrije Universiteit Amsterdam
Spanish National Cancer Research Centre
University of Manchester
Maastricht University
Aqnowledge
University of Santiago de Compostela
Rheinische Friedrich-Wilhelms-Universität Bonn
AstraZeneca
GlaxoSmithKline
Esteve
Novartis
Merck Serono
H. Lundbeck A/S
Eli Lilly
Netherlands Bioinformatics Centre
Swiss Institute of Bioinformatics
ConnectedDiscovery
EMBL-European Bioinformatics Institute
Janssen
OpenLink
A use-case driven approach, focussed on delivery
for the real world
Main architecture, technical implementation and primary
capabilities driven by a set of prioritised research questions
Based on the main research questions define prioritised data
sources
Develop three Exemplars to demonstrate the capabilites of
the Open PHACTS System and to define interfaces and
input/output standards
Work Streams
Build: Service layer and resource integration
Drive: Development of exemplar work packages & Applications
Sustain: Community engagement and long-term sustainability
‘Consumer’
Firewall
Target
Dossier
Pharmacological
Networks
Compound
Dossier
OPS Service Layer
Assertion & Meta Data Mgmt
Transform / Translate
Integrator
Std Public
Vocabularies
Business
Rules
Supplier
Firewall
Work Stream 2: Exemplar Drug Discovery Informatics tools
Develop exemplar services to test OPS Service Layer
Target Dossier (Data Integration)
Pharmacological Network Navigator (Data Visualisation)
Compound Dossier (Data Analysis)
Work Stream 1: Open Pharmacological
Space (OPS) Service Layer
Standardised software layer to allow public
DD resource integration
−
−
Db 2
−
Db 4
Corpus 1
Db 3
Corpus 5
Define standards and construct OPS service layer
Develop interface (API) for data access, integration
and analysis
Develop secure access models
Existing Drug Discovery (DD) Resource
Integration
Platform
Explorer
Apps
API
Standards
Prioritised research questions
Number
sum
Nr of 1
Question
15
12
9
All oxido,reductase inhibitors active <100nM in both human and mouse
18
14
8
Given compound X, what is its predicted secondary pharmacology? What are the on and off,target safety
concerns for a compound? What is the evidence and how reliable is that evidence (journal impact factor,
KOL) for findings associated with a compound?
24
13
8
Given a target find me all actives against that target. Find/predict polypharmacology of actives. Determine
ADMET profile of actives.
32
13
8
For a given interaction profile, give me compounds similar to it.
37
13
8
The current Factor Xa lead series is characterised by substructure X. Retrieve all bioactivity data in serine
protease assays for molecules that contain substructure X.
38
13
8
41
13
8
44
13
8
46
13
8
59
14
8
Retrieve all experimental and clinical data for a given list of compounds defined by their chemical
structure (with options to match stereochemistry or not).
A project is considering Protein Kinase C Alpha (PRKCA) as a target. What are all the compounds known to
modulate the target directly? What are the compounds that may modulate the target directly? i.e. return
all cmpds active in assays where the resolution is at least at the level of the target family (i.e. PKC) both
from structured assay databases and the literature.
Give me all active compounds on a given target with the relevant assay data
Give me the compound(s) which hit most specifically the multiple targets in a given pathway (disease)
Identify all known protein-protein interaction inhibitors
Kamal Azzaoui et al, DDT in press 2013
Pathways
Interactions
Proteins
Pharmacological
Activities
Genes
Transcripts
Clinical Drug
Applications
`
Biological
Processes
Diseases
Pathological
Processes
Indications
Drugs
Chemicals
Compounds
Open PHACTS will be built upon semantic technologies and
standards, providing an opportunity to:
• Demonstrate that semantic technologies can perform to the same degree
as existing systems
• Provide an open platform to address common drug discovery questions;
expose pharma’s use-cases and knowledge
• Create a pre-competitive infrastructure that can be sustained and
expanded into new areas; providing the platform for future collaboration
Why Semantic Technologies?
• Rapidly developing technology, powerful algorithms for integration and
querying of data
• “schema free”
• Open standards – facilitating sharing public, private, commercial
• A community of developers, leverage work going on elsewhere
User Interfaces & Applications
Linked Data API
Linked Data Cache
Domain
Specific
Services
Data
Identity
Mapping
Service
Identity
Resolution
Service
Key architecture components
Open PHACTS
Explorer
1st Gen Apps
Partner Apps
Oct. 2012
Core Platform
App
Framework
Identity
Resolution
Service
(ConceptWiki)
“Adenosine
receptor 2a”
Identifier
Management
Service
(BridgeDb+)
P12374
EC2.43.4
CS4532
Linked Data API (RDF/XML, TTL, JSON)
Semantic Workflow Engine
(LARKC)
Chemistry
Normalisation
& Q/C
ChemSpider
Data Cache
(Triple Store)
VoID
VoID
VoID
Nanopub
Public
Ontologies
Db
Domain
Specific
Services
Db
Public Content
VoID
Nanopub
Db
Db
Commercial
VoID
Nanopub
User
Annotations
Data
Import
Building Quality
High quality chemical names and synonyms. Leverage ChemSpider and
Concept wiki curation, Q/C and mapping
ChemSpider Validation and Standardization Platform (CVSP) for flagging
chemical representation issues
Basic curation interface for editing concept terms available through Concept
Wiki
Data quality issues detected in data sources reported back to depositors for
their evaluation
STANDARD_TYPE
Quantitative Data Challenges
-----------------IC50
IC50
STANDARD_TYPE
UNIT_COUNT
---------------- ------- IC50
AC50
7 IC50
Activity
421 IC50
EC50
39 IC50
IC50
46 IC50
ID50
42 IC50
Ki
23 IC50
IC50
Log IC50
4
IC50
Log Ki
7
Potency
11 IC50
IC50
log IC50
0
IC50
IC50
IC50
>5000 types
Implemented using the Quantities, Dimension, Units, Types
Ontology (http://www.qudt.org/)
STANDARD_UNITS
COUNT(*)
------------------ -------nM
829448
ug.mL-1
41000
38521
ug/ml
2038
ug ml-1
509
mg kg-1
295
molar ratio
178
ug
117
%
113
uM well-1
52
p.p.m.
51
ppm
36
uM-1
25
nM kg-1
25
milliequivalent
22
kJ m-2
20
~ 100 units
Chemistry within Open PHACTS
The challenges associated with handling chemistry data require the
support of a publicly accessible platform to integrate, standardise and
host the data.
ChemSpider, an online database from the Royal Society of Chemistry
hosts the chemical compound collection underpinning Open PHACTS
and is responsible for standardising the chemical compounds and
providing both regular updates and ongoing data curation.
To serve the Open PHACTS platform, a structure validation and
standardisation platform (CVSP) has been developed to ensure
chemical structures are normalised to rules derived from the FDA
structure standardisation guidelines and modified based on input from the
EFPIA members.
The many challenges of chemistry representation…
Identities within Open PHACTS
Open PHACTS integrates information
from multiple different databases, many
of which use unique identifiers. The
Identity
Mapping
Service
(IMS)
ensures these identifiers are linked and
available for use interchangeably
throughout the Open PHACTS platform.
Synonyms:
Aspirin
Dispril
2-Acetoxybenzoic acid
Acetyl salicylic acid
Salicylic acid, acetyl-
DrugBank ID: APRD00264
ChEBI ID: CHEBI:15365
ChemSpider ID: 2157
FDA: 16030
IMS
To maintain vocabulary heterogeneity
and
provide
interoperability,
the
ConceptWiki is used. The ConceptWiki
is an open access system that accepts
essentially
unlimited
numbers
of
synonyms, in multiple languages, and
then maps all the terms correctly back to
one unique concept identifier, alleviating
vocabulary problems and identifier
differences.
Explorer
Why Provenance Matters
Using a community specification known as
“VoID” (Vocabulary of Interlinked Datasets)
Record version, author, derivations
Builds trust with users – know what you are
querying (and why it might have changed)
Provides mechanism to provide usage
statistics back to providers, help them
understand the value
Easier to track errors and ensure quality
Actively participating in community
provenance programme (W3C)
What does Open PHACTS do?
Currently integrated
databases
Database
ACD Labs /
ChemSpider
ChEBI
ChEMBL_v13
ConceptWiki
DrugBank
Enzyme
Gene Ontology
SwissProt
WikiPathways
TOTAL
Number of
triples
(million)
161.34
0.91
146.08
3.74
0.52
0.07
0.85
156.57
0.14
470.21
Open PHACTS draws together
multiple sources of publiclyavailable
pharmacological
and
chemical data, allowing public
access to the information via the
Open PHACTS Explorer, an
intuitive interface.
Licensing: 3 “public” databases
All are available as “open” RDF you can download right now. But:
Drugbank
OMIM
Comparative Toxicogenomics Database
“CUTTING THE GORDIAN KNOT”
What are the problems with licensing we had to address?
–
To make the data and software generated by the project usable and reusable
–
Multiplicity of unclear or non-standard licenses on original data sources
•
•
•
–
‘Public’ can mean use but not redistribute, use in commercial environment,
Legal position on use and reuse extremely unclear
Different issues than just linking to data
What is the legal status of integrated collections of the above, and of derived knowledge from
such a collection?
–
Appropriate software license selection
–
Legal clarity for EFPIA and end users
–
Approaches for commercial data integration, EFPIA in-house data
AIM: to enable maximum possible dissemination and usability of the integrated data and
architecture generated by the project - with approaches that will be applicable in other
data integration projects
Data Licensing Solution
Chose John Wilbanks as consultant
A framework built around STANDARD well-understood
Creative Commons licences – and how they interoperate
Deal with the problems by:
Interoperable licences
Appropriate terms
Declare expectations to users and
data publishers
One size won‘t fit all requirements
Open PHACTS and the
scientific community
MoU
Associated
partners
Associated partners
Support, information
Exchange of ideas, data, technology
Opportunities to demo at ctions, mostommunity
webinars
Need MoU
+Annexe
Development
partnerships
Development partnerships
Influence on API developments
Opportunities to demo ideas & use cases to core team
Need MoU and annexe
Consortium
28 current members
Consortium
Example applications
Advanced analytics
ChemBioNavigator
Navigating at the interface of chemical and
biological data with sorting and plotting options
TargetDossier
Interconnecting Open PHACTS with multiple
target centric services. Exploring target
similarity using diverse criteria
PharmaTrek
Interactive Polypharmacology space of
experimental annotations
UTOPIA
Semantic enrichment of scientific PDFs
Predictions
GARFIELD
Prediction of target pharmacology based on the
Similar Ensemble Approach
eTOX collector
Automatic extraction of data for building
predictive toxicology models in eTOX project
ChemBioNavigator
Matthias Rarey et al
PharmaTrek
Jordi Mestres et al
Call for expressions of interest
Open PHACTS ENSO proposal
The Open PHACTS Foundation
Open PHACTS intends to submit a
proposal for IMI ENSO funding.
Open PHACTS has a successor
organisation, the Open PHACTS
Foundation.
We are currently drafting our ENSO
proposal and invite all EFPIA
companies with an interest in Open
PHACTS to contact us to discuss
opportunities for involvement.
Please register your interest with us
for further information on membership
and other opportunities to get involved
within Open PHACTS.
For more information and/or to register interest email us at
pmu@openphacts.org
Acknowledgements
Stefan Senger
Gerhard Ecker
The OpenPHACTS consortium
SERVICES
Application
(Knowledge)
Fact Visualisation
e.g. Target Dossiers;
SAR Visualisation
Assertions
e.g. Gene-to-Disease;
Compound-to-Target;
Compound-to-ADR
Standards
Ontology/taxonomy;
Minimum information guide;
Dictionaries; Interchange mapping
Data
Targets; Chemistry; Pharmacology;
Literature; Patents
After Barnes et al Nature Review Drug Discovery 2009 doi10.1038/nrd2944
Nanopublications – Capturing scientific information in
the Triple Store
Download