The AZ Chemical Dictionary: Overview and User Guide

advertisement
Drugs, Names, Compounds, Structures and Mixtures.
By Chris Southan, Sep 2015 (n.b. this was adapted from a 2011 report but the generalities still apply)
Drug Definition
The Federal Food, Drug, and Cosmetic Act (FD&C Act) defines drugs, in part, as "articles
intended for use in the diagnosis, cure, mitigation, treatment, or prevention of disease". The
Wikipedia entry for “drug” expands on this and readers of this document will be aware of the
complications. However, the informatics challenges are such that the concept defers to
records classified as “drugs” in commercial competitive intelligence (CI) sources (e.g.
Thomson Reuters, Citeline, Adis et al.). This brings in everything from rosuvastatin
strontium, atorvastatin iron to pomegranate juice and zinc acetate lozenges However, the
pragmatic advantage is that it absolves those users who can pay for it from the massive
filtration, curation and collation exercise from primary sources. However, there are three
main challenges this presents. Firstly, the information from pre-approval stages is publically
declared largely by commercial organisations in the context of shareholder interests company
profiling and competitive obfuscation. Thus, historically at least (and despite recent
declarations regarding “transparency” “precompetive” and “openess”) these have not been
inclined to disclose their complete development portfolio at maximal informatics resolution
using standardised formats, metadata or chemical structure representation (and this would
largely obviate the CI brokerage industry if they did).
As they move closer to regulatory submission such as the US Investigational New Drug
Application (INDA) the level of detail around candidates generally increases as some of this
becomes mandatory and/or published in journals. However, early disclosures can still show
clear elements of selective information gaps and other forms of obfuscation such as the
persistent blinding of links between code names and structures even into phase I clinical
studies going bacl to 2008 (e.g. CTS-21166). Thus, CI data is inherently “fuzzy”. Secondly,
commercial sources, despite producing powerful (not to mention expensive) web applications
and insightful commentaries, show little evidence of curatorial rigour or standardisation. In
part this is because they consider their primary sources as sacrosanct (i.e. not modifiable).
Unfortunately, this means they collate and propagate the fuzz rather than resolving it. For CI
source integration this is compounded by the emergence of a second layer of fuzz because of
differences in their capture, curation, and data schema for the primary information. The third
challenge comes from the inherent complexities of chemical representation that we are well
aware of (some of which are referred to below) but that, unfortunately, layers on yet a third
layer of fuzz.
Drugs by the numbers
Given the scale and importance of regulatory approval it seems paradoxical that there is no
single source of mappings between unique and “correct” electronic chemical structure
representations of approved small-molecule drugs and their names. Notwithstanding, a
number of FDA databases capture most of the information (but differently) and the
Wikipedians are progressing what might eventualy default to an “official” set (see the
Rosuvastatin entry). There is also no exact consensus on what this number is, but estimates
are between 1,200 and 1,500. The nearest approximation to a definitive public structure set
(but not updated) is the FDA Maximum (Recommended) Daily Dose Database somewhat
oddly instantiated as PubChem BioAssay ID 1195 with 1,216 structures.
1
This discordance between nominal collections of approved drugs has been highlighted in this
AZ paper (Southan et al 2009) and the need for curation more recent publications (Huang et.
al. 2011, Williams and Ekins 2011) but we can return to this theme when considering
structure mappings. The next obvious category, drugs currently in clinical trials, is also not
easy to resolve but the figure of 1,423 from this release of TDD seems reasonable. The next
category up the scale has generated a notable quote “The path from lead to clinical drug
candidate represents the most idiosyncratic segment of drug discovery and development”
(Hefti, 2008). Nevetherless, both commercial and public sources provide estimates of 9,000
to 12,000 compounds as “development candidates” globally in any one year (the open
Citeline Pharma Annual R&D Review 2015 gives a useful breakdown of current development
statistics). GVKBIO take a broader window with 17,167 structures in their Clinical
Candidates Database subset of GOSTAR. However, after passing the threshold of public
declaration many of these will default to “no further development reported” in CI sources. So
how many compounds have ever passed this threshold? PharmaProjects reports a historical
total of ~35,000 records from 1980 to 2007 and Thomson Reuters Pharma 33,000 drug
monographs. These are in the same range as the ~45,000 compound total that ChEMBL
predicted for their erstwhile CandiStore project.
Table 1. Approximate small-molecule drug and proto-drug numbers
Historical development entry
Approached regulatory entry (INNs)
Between clinical phases
In active trials
FDA approved
INNs issued per year
Discontinued (post approval)
New approvals per year
~35,000
~8,000
~15,000
~1,500
~1,400
~150
~50
~15
Drugs by the names
Given this triage from ~35,000 historical candidates down to ~1,500 currently approved drugs
we can consider the progression of the four key name types. The first is the systematic
IUPAC name by which the structure is likely to be exemplified and defined in the first patent
claims (although there are cases where these rest only on the image or Markush enumeration).
The second is the ubiquitous practice of assigning a company code name for the public
declaration of advanced testing, lead status, entry into development or specification in posters
and journal articles. The third is the application for and approval of an INN also called the
generic name. This is a single name selected by a expert committee to have worldwide
acceptability for each active substance that is to be marketed as a pharmaceutical. They are
systematically stemmed according the therapeutic class (e.g. statins) and usually harmonised
with relevant national authorities as, for example, a USAN or BAN, so differences between
these and INNs are rare. There were 164 USANs approved in 2010. The fourth is the
assignment of a trade name for the prescription product which may be language specific but
should be clearly differentiated from the INN.
2
We can make a rough historical estimate that ~ 5 million data-supported patent-specified
exemplars have been triaged to ~ 200,000 leads for selection of the 40,000 development
candidates. Of these ~ 8000 have progressed to INN approval, one of the conditions of which
is that they have entered a clinical trial. The timings are even more approximate but we can
estimate ~ 1 to 5 years from patent to an externally declared code name with anything up to
10 years to the INN but trade names tend to follow soon after the INNs (more background on
generic and trade names is given in both the WHO and USAN websites). The Sept 2015 CID
counts by PubChem synonym search are 8274 INNs, 5671 USAN with an OR union of
10952.
However, unless I have missed it there is no public estimate of the number of “blinded” code
names in circulation that are publically declared by the company (or academic groups) as
being at some level of testing or development. These will generally have structures per-se
specified in patents (unless they are well hidden in Markush nests) but these cannot be backlinked before the assignees (or licensees) go public with the necessary code name-to-structure
linkage, typically in a poster or a journal paper but occasionally via portfolio announcements.
There are also many instances of “no-name” structures, typically where these smaller
Markush enumerations or SAR listings are given in journal papers where implicit structures
are given informative designations like “8a” or “compound 65” . In general GVKBIO and
ChEMBL do a good job of capturing these (as the core value of their databases) but their
journal coverage is restricted and these name types are of such low specificity they have to be
linked to a document identifier. Even without layering on the complications associated with
mapping these to structures to chemistry the constitutive challenges associated with the name
space alone (“synonym spaghetti”) are well recognised












Code names vary by syntactical mutation (presumably from the form used by the
original source but even these may not be consistent), for example BW- 348U87, BW
348U, BW 348U87 or BW-348U.
Companies may publically use multiple codes for the same structure, including legacy
codes from mergers, for example GSK/SB, Wyeth/Pfizer and AH/AZ/AZD
Licensing deals may lead to changes in prefix or completely different numbers.
INNs can be assigned to prodrugs and active drug metabolites.
USANs may be assigned to salt forms and parent structures.
Common INN usage (by PubChem submitters, MeSH in PubChem or PubMed) may
default to parent even if USAN is assigned to a salt.
Generics companies obtain INN approvals and for salt forms that surface in data
sources but are unlikely to get FDA approval and/or be the subject of patent disputes.
Fixed-dose combinations do not get an INN or USAN but they can get a BAN (via the
prefix of "co-") and may get a trade name
Combination descriptions have to include name types, sometimes all three, for each of
the components
CI sources may surface salt or combinations that are effectively patent-only and
unlikely to acquire any development data
Trade names are language and country-specific (Drugs.com lists 24K medication
descriptions for the US and 40K for non-US)
Trade names can be suffixed (e.g. –CR or –XR -IV) for different formulations. These
are arguably not the “same” drug but have the same chemical structures.
3
Download