Drugs, Names, Compounds, Structures and Mixtures. By Chris Southan, Sep 2015 (n.b. this was adapted from a 2011 report but the generalities still apply) Drug Definition The Federal Food, Drug, and Cosmetic Act (FD&C Act) defines drugs, in part, as "articles intended for use in the diagnosis, cure, mitigation, treatment, or prevention of disease". The Wikipedia entry for “drug” expands on this and readers of this document will be aware of the complications. However, the informatics challenges are such that the concept defers to records classified as “drugs” in commercial competitive intelligence (CI) sources (e.g. Thomson Reuters, Citeline, Adis et al.). This brings in everything from rosuvastatin strontium, atorvastatin iron to pomegranate juice and zinc acetate lozenges However, the pragmatic advantage is that it absolves those users who can pay for it from the massive filtration, curation and collation exercise from primary sources. However, there are three main challenges this presents. Firstly, the information from pre-approval stages is publically declared largely by commercial organisations in the context of shareholder interests company profiling and competitive obfuscation. Thus, historically at least (and despite recent declarations regarding “transparency” “precompetive” and “openess”) these have not been inclined to disclose their complete development portfolio at maximal informatics resolution using standardised formats, metadata or chemical structure representation (and this would largely obviate the CI brokerage industry if they did). As they move closer to regulatory submission such as the US Investigational New Drug Application (INDA) the level of detail around candidates generally increases as some of this becomes mandatory and/or published in journals. However, early disclosures can still show clear elements of selective information gaps and other forms of obfuscation such as the persistent blinding of links between code names and structures even into phase I clinical studies going bacl to 2008 (e.g. CTS-21166). Thus, CI data is inherently “fuzzy”. Secondly, commercial sources, despite producing powerful (not to mention expensive) web applications and insightful commentaries, show little evidence of curatorial rigour or standardisation. In part this is because they consider their primary sources as sacrosanct (i.e. not modifiable). Unfortunately, this means they collate and propagate the fuzz rather than resolving it. For CI source integration this is compounded by the emergence of a second layer of fuzz because of differences in their capture, curation, and data schema for the primary information. The third challenge comes from the inherent complexities of chemical representation that we are well aware of (some of which are referred to below) but that, unfortunately, layers on yet a third layer of fuzz. Drugs by the numbers Given the scale and importance of regulatory approval it seems paradoxical that there is no single source of mappings between unique and “correct” electronic chemical structure representations of approved small-molecule drugs and their names. Notwithstanding, a number of FDA databases capture most of the information (but differently) and the Wikipedians are progressing what might eventualy default to an “official” set (see the Rosuvastatin entry). There is also no exact consensus on what this number is, but estimates are between 1,200 and 1,500. The nearest approximation to a definitive public structure set (but not updated) is the FDA Maximum (Recommended) Daily Dose Database somewhat oddly instantiated as PubChem BioAssay ID 1195 with 1,216 structures. 1 This discordance between nominal collections of approved drugs has been highlighted in this AZ paper (Southan et al 2009) and the need for curation more recent publications (Huang et. al. 2011, Williams and Ekins 2011) but we can return to this theme when considering structure mappings. The next obvious category, drugs currently in clinical trials, is also not easy to resolve but the figure of 1,423 from this release of TDD seems reasonable. The next category up the scale has generated a notable quote “The path from lead to clinical drug candidate represents the most idiosyncratic segment of drug discovery and development” (Hefti, 2008). Nevetherless, both commercial and public sources provide estimates of 9,000 to 12,000 compounds as “development candidates” globally in any one year (the open Citeline Pharma Annual R&D Review 2015 gives a useful breakdown of current development statistics). GVKBIO take a broader window with 17,167 structures in their Clinical Candidates Database subset of GOSTAR. However, after passing the threshold of public declaration many of these will default to “no further development reported” in CI sources. So how many compounds have ever passed this threshold? PharmaProjects reports a historical total of ~35,000 records from 1980 to 2007 and Thomson Reuters Pharma 33,000 drug monographs. These are in the same range as the ~45,000 compound total that ChEMBL predicted for their erstwhile CandiStore project. Table 1. Approximate small-molecule drug and proto-drug numbers Historical development entry Approached regulatory entry (INNs) Between clinical phases In active trials FDA approved INNs issued per year Discontinued (post approval) New approvals per year ~35,000 ~8,000 ~15,000 ~1,500 ~1,400 ~150 ~50 ~15 Drugs by the names Given this triage from ~35,000 historical candidates down to ~1,500 currently approved drugs we can consider the progression of the four key name types. The first is the systematic IUPAC name by which the structure is likely to be exemplified and defined in the first patent claims (although there are cases where these rest only on the image or Markush enumeration). The second is the ubiquitous practice of assigning a company code name for the public declaration of advanced testing, lead status, entry into development or specification in posters and journal articles. The third is the application for and approval of an INN also called the generic name. This is a single name selected by a expert committee to have worldwide acceptability for each active substance that is to be marketed as a pharmaceutical. They are systematically stemmed according the therapeutic class (e.g. statins) and usually harmonised with relevant national authorities as, for example, a USAN or BAN, so differences between these and INNs are rare. There were 164 USANs approved in 2010. The fourth is the assignment of a trade name for the prescription product which may be language specific but should be clearly differentiated from the INN. 2 We can make a rough historical estimate that ~ 5 million data-supported patent-specified exemplars have been triaged to ~ 200,000 leads for selection of the 40,000 development candidates. Of these ~ 8000 have progressed to INN approval, one of the conditions of which is that they have entered a clinical trial. The timings are even more approximate but we can estimate ~ 1 to 5 years from patent to an externally declared code name with anything up to 10 years to the INN but trade names tend to follow soon after the INNs (more background on generic and trade names is given in both the WHO and USAN websites). The Sept 2015 CID counts by PubChem synonym search are 8274 INNs, 5671 USAN with an OR union of 10952. However, unless I have missed it there is no public estimate of the number of “blinded” code names in circulation that are publically declared by the company (or academic groups) as being at some level of testing or development. These will generally have structures per-se specified in patents (unless they are well hidden in Markush nests) but these cannot be backlinked before the assignees (or licensees) go public with the necessary code name-to-structure linkage, typically in a poster or a journal paper but occasionally via portfolio announcements. There are also many instances of “no-name” structures, typically where these smaller Markush enumerations or SAR listings are given in journal papers where implicit structures are given informative designations like “8a” or “compound 65” . In general GVKBIO and ChEMBL do a good job of capturing these (as the core value of their databases) but their journal coverage is restricted and these name types are of such low specificity they have to be linked to a document identifier. Even without layering on the complications associated with mapping these to structures to chemistry the constitutive challenges associated with the name space alone (“synonym spaghetti”) are well recognised Code names vary by syntactical mutation (presumably from the form used by the original source but even these may not be consistent), for example BW- 348U87, BW 348U, BW 348U87 or BW-348U. Companies may publically use multiple codes for the same structure, including legacy codes from mergers, for example GSK/SB, Wyeth/Pfizer and AH/AZ/AZD Licensing deals may lead to changes in prefix or completely different numbers. INNs can be assigned to prodrugs and active drug metabolites. USANs may be assigned to salt forms and parent structures. Common INN usage (by PubChem submitters, MeSH in PubChem or PubMed) may default to parent even if USAN is assigned to a salt. Generics companies obtain INN approvals and for salt forms that surface in data sources but are unlikely to get FDA approval and/or be the subject of patent disputes. Fixed-dose combinations do not get an INN or USAN but they can get a BAN (via the prefix of "co-") and may get a trade name Combination descriptions have to include name types, sometimes all three, for each of the components CI sources may surface salt or combinations that are effectively patent-only and unlikely to acquire any development data Trade names are language and country-specific (Drugs.com lists 24K medication descriptions for the US and 40K for non-US) Trade names can be suffixed (e.g. –CR or –XR -IV) for different formulations. These are arguably not the “same” drug but have the same chemical structures. 3