Uploaded by Vaibhav Topiwala

ADMET AND CHEMICAL DATABASE 4.0

advertisement
ADMET AND CHEMICAL DATABASE
ADMET DATABASE
Introduction
 Current pharmaceutical research and development (R&D) is a high-risk investment which is
usually faced with some
 unexpected even disastrous failures in diferent stages of drug discovery.
 One main reason for R&D failures is the efficacy and safety deficiences which are related largely
to absorption, distribution, metabolism and excretion (ADME) properties and various toxicities (T).
 Therefore, rapid ADMET evaluation is urgently needed to minimize failures in the drug discovery
process.
 We believe that this web platform will hopefully facilitate the drug discovery process by enabling
early drug-likeness evaluation, rapid ADMET virtual screening or fltering and prioritization of
chemical structures
LAZAR: a modular predictive toxicology
framework
 introduction
 lazar (lazy structure–activity relationships) is a modular framework for predictive toxicology.
Similar to the read across procedure in toxicological risk assessment, lazar creates local
QSAR (quantitative structure–activity relationship) models for each compound to be predicted.
 Model developers can choose between a large variety of algorithms for
descriptor calculation and selection, chemical similarity indices, and model building.
This paper presents a high level description of the lazar framework and discusses the
performance of example classification and regression models.
 The main objective of lazar is to provide a generic tool for the prediction of complex toxicological
endpoints, like carcinogenicity,long-term, and reproductive toxicity.
 As these endpoints involve a huge number of complex (and probably unknown) biological mechanisms,
lazar does not intend to model all involved biological processes (as in molecular modeling or various systems
biology approaches), but follows a data driven approach.
 lazar uses data mining algorithms to derive predictions for untested compounds from experimental training
data.
 Any dataset with chemical structures and biological activities can be used as training data. This makes lazar
a generic prediction algorithm for any biological endpoint with sufficient experimental data.
 At present, lazar does not consider chemical, biological, or toxicological expert knowledge, but derives
computational models from statistical criteria.
 Such an approach has the distinct advantage that incomplete, wrong, or incorrectly formulated
background knowledge cannot affect predictions, because they are based on objective, traceable, and
reproducible statistical criteria.
 Although lazar does not use explicit background knowledge for predictions, it was
created with an intent to support mechanistic based risk assessment.
 For this purpose, rationales for predictions are presented together with a hypothesis about
possible biological mechanisms that is based on statistically significant properties of the
underlying data.
 As both, predictions and mechanisms are statistically derived (not causally or
mechanistically), the toxicological expert is a key part of the process. He should review
and interpret the output in order to identify, e.g., training data errors, chance correlations,
systematic problems, or findings that contradict with current knowledge and discard
results if necessary.
 In contrast to most machine learning and QSAR methods, which create a global
prediction model from all training data, lazar uses local QSAR models.
limitations
 Scientific limitations:
Limited capability of some quantitative structure–activity relationship (QSAR) algorithms (e.g.,
linear regression) to handle complex relationships.
• Missing, improper, ambiguous, or poorly reproducible definitions of applicability domains
• Improper application of validation procedures, ignorance of applicability domains1
• Poor validation of applicability domain concepts
• Poor consideration of biological mechanisms
• Irreproducible results, because proprietary algorithms are not disclosed
 TECHNICAL LIMITATIONS:
• Hard to use and unintuitive software
• Standalone solutions with poor integration of external databases, ontologies etc.
 SOCIAL LIMITATIONS:
• Insufficient translation of statistics/data mining/QSAR concepts into toxicological
terminology
• Poor understanding of the significance of validation results
•
Poor and/or too technical documentation of algorithms, which is hard to understand for
non-computer scientists.
admetSAR: A Comprehensive Source and Free Tool for Assessment of Chemical
ADMET Properties
 Absorption, distribution, metabolism, excretion, and toxicity (ADMET) properties play key roles in the
discovery/development of drugs, pesticides, food additives, consumer products, and industrial chemicals.

This information is especially useful when to conduct environmental and human hazard assessment. The
most critical rate limiting step in the chemical safety assessment workflow is the availability of high quality
data. This paper describes an ADMET structure−activity relationship database, abbreviated as admetSAR.

It is an open source, text and structure searchable, and continually updated database that collects,
curates, and manages available ADMET-associated properties data from the published literature.
 In admetSAR, over 210 000 ADMET annotated data points for more than 96 000 unique compounds with 45
kinds of ADMET-associated properties, proteins, species, or organisms have been carefully curated from a
large number of diverse literatures.

The database provides a user-friendly interface to query a specific chemical profile, using either CAS registry
number, common name, or structure similarity. In addition, the database includes 22 qualitative
classification and 5 quantitative regression models with highly predictive accuracy, allowing to estimate
ecological/mammalian ADMET properties for novel chemicals.
Methods

database Source.

Data Collection: The core ADMET associated data in admetSAR are extracted from the full text of peer-reviewed scientific
publications via weekly PubMed and Google Scholar) searches from 2002 to 2011.

A literature search was performed with general items: “computational (in silico) ADME”, “computational toxicology”, etc. The
ADMET-associated keywords, such as water solubility, human intestinal absorption, oral bioavailability, blood−brain barrier
penetration, transporter, plasma protein binding, volume of distribution, CYP450, toxicity, etc., were used to refine the
research results.

Review articles and the publication with data points less than 10 were removed. In order to control the quality of data points,
only publications in a variety high quality journals, such as The Journal of Chemical Information and Modeling, Molecular
Informatics, Chemical Research in Toxicology, Journal of Medicinal Chemistry, Bioorganic Medicinal Chemistry Letters, etc.,
were selected for further manual checking.

Data Extraction and Preparation:As given in Figure , from each eligible publication, detailed raw data of the compounds
tested, any ADMET-associated protein, organism, or specie information for these assays are abstracted and manually
checked by our experts. Some data points and structure of chemicals were downloaded from the supporting information of
publications.

The fuzzy, uncertain, obviously uncorrected data points were removed. At last, molecules with compound name, ADMET end
points, detailed test materials, structural information, and data source (full journal name) were collected.
Methods
 CAS registry number (CASRN) information of each environmental chemical or drug was
extracted from US-EPA ACToR (http://actor.epa.gov/actor/faces/ACToRHome.jsp) and
DrugBank databases.
 32 Structure of each compound are draw in full by SMILES format and were converted
canonical SMILES using the OpenBabel v2.3.1.33 The IUPAC name of all molecules was
generated using Marvin v5.8.0 (http://www. chemaxon.com/). The DrugBank ID number were
validated by mapping with DrugBank database v3.0.32.
 Calculation of Physicochemical Property: Calculations of small molecular physicochemical
properties are important for computationally filtering their “druglikeness” and “leadlikeness” and
toxicity potentials.
 In admetSAR, five classic physicochemical properties, namely the number of hydrogen bond
acceptors and donors, Log P, topological polar surface area (TopoPSA), and molecular weight
(MW) were calculated for all compounds using OpenBabel v2.3.1.33.
Methods
 Development of Computational Models:
 Prediction of ADMET-associated properties of new chemicals is a big challenge in free ADMET
research communities. In admetSAR, 22 qualitative classification models were implemented,
which were developed using support vector machine classification algorithm and in house
substructure pattern recognition method.20 In addition, five quantitative regression models
were also built and implemented, which were developed using support vector machine
regression algorithm.
 The robustness of all models was validated based on 5-fold cross validation, and the
predictability of several models was validated using available external validation sets. Only
models with high predictive accuracy were selected and implemented in admetSAR.
 In the process of model development, all compounds were represented using MACCS keys
implemented with OpenBabel v2.3.1.33 The detailed descriptions of model building procedure,
modeling algorithms, and model validation criteria were given in the Supporting Information.
methods
 Database Design and Implementation:
 The data extracted from publications and manually checked were managed through MySQL
v5.1.61 (http://www.mysql.com/). The admetSAR system (Figure 1) was built using Django v1.4.0
on Apache v2.2.20 with mod_wsgi v3.3, installed on UbuntuServer v11.10. AdmetSAR provides
user-friendly web interfaces to generate a chemical profile, by either text or structural similarity
search, and computational prediction using cascading style sheet (CSS) and python script.
 Similarity Search:JSDraw v1.3.1 (http://www.scilligence.
com/web/download.aspx?prod=JSDraw) was implemented as the build-in molecule editor. The
structural similarity search is assessed by the Tanimoto coefficient using the MACCS keys
implemented with OpenBabel v2.3.1.33.
 Visualization Features: The two-dimensional (2D) chemical structures were displayed by images,
which were generated using Marvin v5.8.0 (http://www.chemaxon.com/).
SwissADME: a free web tool to evaluate pharmacokinetics,
druglikeness and medicinal chemistry friendliness of small molecules
 To be effective as a drug, a potent molecule must reach its target in the body in sufficient
concentration, and stay there in a bioactive form long enough for the expected biologic
events to occur. Drug development involves assessment of absorption, distribution,
metabolism and excretion (ADME) increasingly earlier in the discovery process, at a stage
when considered compounds are numerous but access to the physical samples is limited.
 Here, SwissADME web tool that gives free access to a pool of fast yet robust predictive
models for physicochemical properties, pharmacokinetics, drug-likeness and medicinal
chemistry friendliness, among which in-house proficient methods such as the BOILEDEgg,
iLOGP and Bioavailability Radar.
 SwissADME strong points are, non-exhaustively: different input methods, computation for
multiple molecules, and the possibility to display, save and share results per individual
molecule or through global intuitive and interactive graphs.
 The knowledge about compounds being substrate or non-substrate of the permeability
glycoprotein (P-gp, suggested the most important member among ATP-binding cassette
transporters or ABC-transporters) is key to appraise active efflux through biological
membranes, for instance from the gastrointestinal wall to the lumen or from the brain40.
One major role of P-gp is to protect the central nervous system (CNS) from xenobiotics41.
Importantly as well, P-gp is overexpressed in some tumour cells and leads to multidrugresistant cancers.
 So, SwissADME enables the estimation for a chemical to be substrate of P-gp or inhibitor of
the most important CYP isoenzymes.
 Moreover, SwissADME is an integral part of the SwissDrugDesign program, an ambitious
initiative driven by the Molecular Modeling Group of the SIB Swiss Institute of Bioinformatics
that aims at providing a collection of free Web-based tools covering many aspects of
CADD.
SwissADME submission page.
 The actual input is a list of SMILES on the right-hand side of the submission page , which
can be edited as a standard text, allowing for typing or pasting SMILES , which contains
one molecule per line with an optional name separated by a space. Molecules can be
directly pasted or typed in SMILES format, or inserted through the molecular sketcher. The
latter enables importing from databases, opening a local file or drawing a 2D chemical
structure to be transferred to the list by clicking on the double-arrow button.
 When the list of molecules is ready to be submitted, the user can start the calculations by
clicking on the “Run” button.
 on the right-hand side of the submission page, is the actual input for computation. It can
be edited as a standard text, allowing for typing or pasting SMILES.
In this figure , Computed parameter values are grouped in the different sections of the
one-panel-par-molecule output (like Physicochemical Properties, Lipophilicity,
Pharmacokinetics, Drug-likeness and Medicinal Chemistry. More over , Bioavailability
Radar is displayed for a rapid appraisal of drug-likeness
The Bioavailability Radar enables a first glance
(look) at the drug-likeness of a molecule . The
pink area represents the optimal range for
each properties (lipophilicity: XLOGP3 between
−0.7 and +5.0, size: MW between 150 and 500
g/mol, polarity: TPSA between 20 and 130Å2 ,
solubility: log S not higher than 6, saturation:
fraction of carbons in the sp3 hybridization not
less than 0.25, and flexibility: no more than 9
rotatable bonds. In this example, the
compound is predicted not orally bioavailable,
because too flexible and too polar.
The Bioavailability Radar
The BOILED-Egg17 allows for intuitive evaluation of passive gastrointestinal absorption (HIA) and brain penetration (BBB) in function of the
position of the molecules in the WLOGP-versus-TPSA referential. The white region is for high probability of passive absorption by the
gastrointestinal tract, and the yellow region (yolk) is for high probability of brain penetration. Yolk and white areas are not mutually exclusive.
In addition the points are coloured in blue if predicted as actively effluxed by P-gp (PGP+) and in red if predicted as non-substrate of P-gp
(PGP−). For an interactive analysis, the user can leave the mouse over a dot to show the structure of the molecule and click on the dot to
scroll to the corresponding output panel. In this example, Lapatinib is predicted as not absorbed and not brain penetrant (outside the Egg),
Omeprazol is predicted well-absorbed but not accessing the brain (in the white) and PGP+ (blue dot), Sunitinib is predicted as passively
crossing the BBB (in the yolk), but pumped-out from the brain (blue dot), and Palonosetron is predicted as brain-penetrant (in the yolk) and
not subject to active efflux (red dot). One molecule is predicted not absorbed and not BBB permeant because outside of the range of the
plot (streptomycin with a TPSA of 331.43Å2 and a WLOGP of −7.74).
FAF DrugS4
 FAF-Drugs4 is a free toolkit to assist in silico screening experiments as well as
experimental screening as it helps to select compounds for in silico / in vitro / in
cellulo assays.
 A related goal is the computational prediction of some ADME-Tox
properties (Adsorption, Distribution, Metabolism, Excretion and Toxicity).
 Additionnaly without managing the whole filtering process, FAF-Drugs has basic
capabilities such as removing salts and counterions, removing duplicates or computing
the ADME-Tox descriptors without filtering the database.
o Both services share a Data Curation procedure (which could be also applied
using the Bank Cleaner online tool) prior data calculations and filtering steps.
o That
step
gather
a
removal
of
large
coumpounds, inorganics, mixtures, counterions, empty structures, and
211 salts known to be employed in medicinal chemistry.
o Then,
the
standardization
procedure
is
performed,
after
a neutralization step, using the ChemAxon Standardizer Academic package
which involves a ring aromatization protocol and in-house SMARTS depicting
patterns for key chemical functions (amine, nitro, carboxylic acid,
phosphonamide, amide, sulfonamide, phosphonate and sulfonate).
o Duplicates are detected using internal CANSMILES defintions (Pybel) for each
normalized compounds.
o During this step, stereochemistry is considered in order to keep stereoisomers.
o Note that user can employ the Bank Formatter service in order to process a
corrected input SDF file (single or multiple) with standard format needed for
FAFDrugs.
o One can also use the Filter Editor service to design a tuned physchem filter
applied in FAF-Drugs4.
o Then, for both services, input file is divided in sub-libraries and managed by as
many subprocesses as subdivisions of this file has been prior created.
o If user run FAF-Drugs4, physchem descriptors are computed using OpenBabel
and ChemAxon.
o In order to properly compute protonation-dependent descriptors (HBA, HBD,
charges...),
a
benchmarked
in-house
procedure
employing
ChemAxon [22] cxcalc library at physiological pH is applied dynamically during
calculations (also applied for FAF-QED).
o Toxicophores and interferents SMARTS definitions were refined for 137
structural alerts and 515 PAINS.
o Depending of the filtering ranges, Accepted, Intermediate or Rejected files are
written associated with all their csv results files
o If user run FAF-QED, physchem descriptors and alerts are computed as FAFDrugs4 and a csv file offering descriptors values and QED estimations are
written.
o When the entire process is done, all results and graphical reports are proposed
on the Mobyle RPBS Portal and available to be download.
FAFDrugs4 detailed results
July-2006:
development and
deployment of
FAFDrugs on the RPBS
plateform.
April-2016:
Major revision of
SMARTS definations for
the better sensibility of
toxicophores
detection.
July-2015:
Sept-2008:
Release of the
FAFDrug2 standalone
version.
Upgrade of FAFDrug2 to
FAFDrug3 wed-server
revision of the algoritham
for the input of
toxicophores detection
2009-2010:
Sept-2014:
Several improvements
dedicated to a wedserver version.
Major improvements
regarding rules and
toxicphores detection.
Augest-2017:
Complete revision
of the FAF-Drugs4’s
general algorithm in
order to allow
multiprocessing
calculations and
speed –up
computations.
Chemical Database
Introduction
 Chemical databases/resources are the backbone of computer-aided drug discovery,
whether it is chemoinformatics or bioinformatics. These databases gives information which
can be used to build knowledge-based models for discovering and designing drug
molecules. Here, we have provides list of major databases that are freely available and
widely used.
Chemical databases:
Database
description
Pubchem
PubChem is a database of chemical molecules which maintains three types of
information namely, substance, compound and BioAssays.
CheMBL
This database provides comprehensive information about 1 million bioactive (small druglike molecules) compounds with 8200 drug targets.
NCI
(national cancer
institute)
NCI database had more than 275,000 small molecules structures, a very useful resource
for researchers working in the filled of cancer/AIDS.
ChemDB
It is a databse of five million chemicals which contains information of chemicals that
includes predicted or experimentally determined physicochemical properties, such as
3D structure, melting temperature and solubility.
HIT(herb
ingredients
targets)
HIT is a comprehensive database for protein targets for FDA-approved drugs as well as
the promising precursors. It currently contains about 1,301 known protein targets(221
proteins are described as direct targets).
Chemical database:
database
description
supernatural
A freely available database of approximately 50,000 natural compounds.
Superdrug
This database contains approximately 2500 3D-structures of active ingredients of
essential marketed drugs.
PDB-bind
It is a collection of binding affinities for protein-ligand complexes with known threedimensional structures. It contains 5671 protein-ligand complexes.
PDBechem
It provides comprehensive information of ligands, small molecules and monomers.
Presently it consists 15502 ligands.
HMDB( human
metabolome
A database containing detailed information about small molecule metabolites found in
the human body.
ZINC 15 − Ligand Discovery for Everyone
•
Many questions about the biological activity and availability of small molecules remain inaccessible to investigators who could most benefit from
their answers.
•
To narrow the gap between chemoinformatics and biology, developed a suite of ligand annotation, purchasability , target, and biology
association tools, incorporated into ZINC and meant for investigators who are not computer specialists.
•
The new version contains over 120 million purchasable “drug-like” compounds − effectively all organic molecules that are for sale − a quarter of
which are available for immediate delivery.
•
ZINC connects purchasable compounds to high-value ones such as metabolites, drugs, natural products, and annotated compounds from the
literature.
•
Compounds may be accessed by the genes for which they are annotated as well as the major and minor target classes to which those genes
belong.
•
It offers new analysis tools that are easy for non-specialists yet with few limitations for experts.
•
ZINC retains its original 3D roots − all molecules are available in biologically relevant, ready-to-dock formats.
•
ZINC15 is designed to bring together biology and chemoinformatics with a tool that is easy to use for nonexperts, while remaining fully
programmable for chemoinformaticians and computational biologists.
•
ZINC 15, a new research tool for ligand discovery that connects biological activities by gene product, drugs, and natural products with commercial
availability.
•
ZINC15 draws upon third party databases such as ChEMBL, HMDB, DrugBank, and https://ClinicalTrials.gov to annotate high information compounds
that are active in, or created by, nature.
ZINC 15 − Ligand Discovery for Everyone
 ZINC (ZINC Is Not Commercial) is a public access database and tool set, initially
developed to enable ready access to compounds for virtual screening, that has become
ever widely used for virtual screening, ligand discovery, pharamcophore screens,
benchmarking, and force field development.
 Increasingly, however, investigators have tried to interrogate it for questions that it was not
designed to answer. Simple questions, such as how many endogenous human
metabolites are there, which of these are purchasable, or what natural product or drug
does a compound most closely resemble, were surprisingly difficult to answer
Question like ?
 With a target in mind, investigators often wanted a focused library biased toward ligands
for that target.
 With new compounds discovered, they often wanted to find the most similar ligands
already known for that target.
 To optimize that ligand, they might look to the availability of starting products for synthesis,
asking, for instance, how many boronic acids that contain an indole ring may be
purchased in preparative quantities and how soon will they arrive
Introduction
 ZINC15 is designed to bring together biology and chemoinformatics with a tool that is easy to
use for nonexperts, while remaining fully programmable for chemoinformaticians and
computational biologists.
 ZINC 15, a new research tool for ligand discovery that connects biological activities by gene
product, drugs, and natural products with commercial availability
 To integrate and curate biological activity, chemical property, and commercial availability
data for small molecules from public sources, supplemented by additional calculated
properties into a chemistry-aware relational database.
 To build a general query language and report generator that is Web URL compatible.
 To design a graphical user interface that requires no programming to interrogate the database
using this query language.
 To demonstrate and document the use of this tool to answer previously difficult questions.
 (1) ZINC ID, name if known, subset membership, (2) properties and 2D depiction, (3) 3D
representations if available, (4) purchasing information, and (5) annotated catalog
membership, (6) breadcrumbs indicating current location, (7) selection tool for refinement
of query, and (8) download tool. B. Interesting bioactive and biogenic analogs section of
molecule detail page: (1) similar biogenic compounds, (2) similar bioactive compounds,
(3) compounds with a shared scaffold, (4) similar aggregators, and (5) similar purchasable
compounds currently too slow to calculate. A find more button in each case will find more
of the same
CMNPD: a comprehensive marine natural products database towards
facilitating drug discovery from the ocean
 Marine organisms are expected to be an important source of inspiration for drug discovery
after terrestrial plants and microorganisms. Despite the remarkable progress in the field of
marine natural products (MNPs) chemistry, there are only a few open access databases
dedicated to MNPs research.
 CMNPD currently contains more than 31 000 chemical entities with various
physicochemical and pharmacokinetic properties, standardized biological activity data,
systematic taxonomy and geographical distribution of source organisms, and detailed
literature citations. It is an integrated platform for structure dereplication (assessment of
novelty) of (marine) natural products, discovery of lead compounds, data mining of
structureactivity relationships and investigation of chemical ecology.
Aim
 CMNPD aims to provide an open access knowledge base for not only professional MNPs
researchers but also the broader scientific community to facilitate the research and
development of marine drugs.
 CMNPD contains 31 561 distinct chemical entities of MNPs from over 13 000 sampling
organisms. These organisms are distributed in 7 kingdoms, 38 phyla, 93 classes, 289 orders,
682 families, 1480 genera and 3354 species. There are 15 774 active compounds mapped
to 2652 targets with 72 343 bioactivities. These targets include 1122 single proteins, 923 cell
lines, 459 organisms and several other types. The document library includes 128 488
scientific literature and patents, of which ∼11 000 articles describe the discovery of new
compounds and structure revisions.
Schematic overview of the
CMNPD content.
Phylogenetic tree of marine organisms.
Core nodes include domains,
kingdoms, phyla, classes, orders and
families. Internal nodes include their
corresponding superior and
subordinate taxonomic units, which
are based on the NCBI Taxonomy
database . The bar chart shows the
number of compounds in each family.
This figure was produced using the
Interactive Tree Of Life (iTOL) web
server .
Knowledge graph of the
chemical entity.
. Statistics of various data collections in CMNPD
o Quick search is available in the middle
of the homepage and on the toolbar in
the upper right corner of each page.
The free-text search allows users to enter
any term of CMNPD identifier,
compound name, organism name or
target name without specifying a
search entity. The search bar will
provide suitable suggestions as the term
is typed
In addition, a powerful advanced search capability is
provided on the query builder page. This allows users
to specify any number of query conditions. Available
query conditions, which could be combined with the
Boolean operator ‘AND’, ‘OR’ or ‘NOT’, include
structure (drawing structure, structural classification),
compound representations (e.g. compound name,
molecular formula), physicochemical properties (e.g.
molecular weight, ALogP), ADMET prediction (e.g.
blood brain barrier penetration level, human intestinal
absorption level), resources (organism name,
collection site), bioactivities (e.g. target name, assay
type) and bibliography (e.g. authors, DOI). Multiple
query conditions could be easily grouped together by
just dragging and dropping them. The inner Boolean
operations of the grouped conditions will be
executed first
 Yet, there is still a small number of databases dedicated to MNPs research. The
commercial databases MarinLit (http: //pubs.rsc.org/marinlit) and Dictionary of Marine
Natural Products (http://dmnp.chemnetbase.com) are currently the most exhaustive and
complete MNPs databases, but subscription fees may prevent their broader access to
academic research. The recently established academic free database MarinChem3D
(http://mc3d.qnlm.ac) provides 3D structures of MNPs, but its biological activity data is
limited. Some open access databases such as the Seaweed Metabolite Database
(SWMD) (6) and the Dragon Exploration System on Marine Sponge Compounds
Interactions (DESMSCI) (7) contain only natural products produced by certain types of
marine organisms.
IMPPAT: A curated database of Indian Medicinal Plants, Phytochemistry And
Therapeutics
 Phytochemicals of medicinal plants encompass a diverse chemical space for drug discovery.
 India is rich with a fora of indigenous medicinal plants that have been used for centuries in traditional Indian
medicine to treat human maladies.
 A comprehensive online database on the phytochemistry of Indian medicinal plants will enable
computational approaches towards natural product based drug discovery.
 IMPPAT is the largest database on phytochemicals of Indian medicinal plants to date, and this resource is a
culmination of our eforts to digitize the wealth of information contained within traditional Indian medicine.
 IMPPAT provides an integrated platform to apply cheminformatic approaches to accelerate natural
product based drug discovery.
 IMPPAT, a manually curated database of 1742 Indian Medicinal Plants, 9596 Phytochemicals, And 1124
Therapeutic uses spanning 27074 plant-phytochemical associations and 11514 plant-therapeutic
associations.
 Notably, the curation efort led to a non-redundant in silico library of 9596 phytochemicals with standard
chemical identifers and structure information.
 Using cheminformatic approaches, they have computed the physicochemical, ADMET
(absorption, distribution, metabolism, excretion, toxicity) and drug-likeliness properties of
the IMPPAT phytochemicals.
 IMPPAT, containing 1742 Indian Medicinal Plants, 9596 Phytochemicals, And 1124 Therapeutic uses. In
addition, the IMPPAT database has linked Indian medicinal plants to 974 openly accessible traditional Indian
medicinal formulations.

Importantly, our curation eforts have led to a non-redundant in silico chemical library of 9596
phytochemicals with two-dimensional (2D) and three-dimensional (3D) chemical structures.
 For the 9596 phytochemicals in this database, they have computed physicochemical properties and
predicted Absorption, distribution, metabolism, excretion and toxicity (ADMET) properties using
cheminformatic tools .

Then employed cheminformatic approaches to evaluate the drug-likeliness of the phytochemicals in our in
silico chemical library using multiple scoring schemes such as Lipinski’s rule of fve (RO5), Oral PhysChem
Score (Trafc Lights), GlaxoSmithKline’s (GSK’s) 4/400, Pfzer’s 3/75, Veber rule and Egan rule.
 They found a subset of 960 phytochemicals of Indian medicinal plants that are potentially druggable in our
chemical library of 9596 phytochemicals based on multiple scoring schemes.
 They also provide predicted interactions between phytochemicals in our database and human target
proteins from STITCH database
Snapshot of the result of a standard query for phytochemicals of an
Indian medicinal plant. In this example, we show the plantphytochemical association for Ocimum tenuiforum, commonly
known as Tulsi, from IMPPAT database.
Snapshot of the dedicated page
containing detailed information on 2D
and 3D chemical structure,
physicochemical properties, druggability
scores, predicted ADMET properties and
predicted target human proteins for a
chosen phytochemical. From the
dedicated page for each phytochemical,
users can download the 2D and 3D
structure of the phytochemical in the form
of a SDF or MOL or MOL2 or PDB or PDBQT
fle.
Snapshot of the result of a standard query for therapeutic uses of an Indian medicinal plant. In this
example, we show the therapeutic uses of Ocimum tenuiforum from IMPPAT database.
Snapshot of the advanced search options which enable users to flter phytochemicals based
on their physiochemical properties or druggability scores or chemical similarity with a query
compound.
 (a) Pie chart shows the distribution of the 1742 Indian medicinal plants in IMPPAT database
across diferent taxonomic families. (b) Pie chart shows the distribution of the 9596 IMPPAT
phytochemicals across diferent chemical super-classes obtained from ClassyFire50. (c)
Histogram of the number of Indian medicinal plants which produce a given
phytochemical in our database. (d) Histogram of the number of therapeutic uses per
Indian medicinal plant in our database. (e–j) Histogram of the molecular weight (in g/mol),
logP, TPSA (in Å2 ), number of hydrogen bond donors, number of hydrogen bond
acceptors and number of rotatable bonds of the phytochemicals in our database.
 Comparison of the physicochemical properties of IMPPAT phytochemicals with other small
molecule collections. (a) Box plot shows the distribution of the stereochemical complexity of the
small molecule collections CC, DC’, NP, IMPPAT phytochemicals and TCM-Mesh
phytochemicals. Te median, mean and standard deviation (SD) of the stereochemical
complexity for each small molecule collection is shown below the box plot. (b) Box plot shows
the distribution of the Fsp3 for the small molecule collections CC, DC’, NP, IMPPAT
phytochemicals and TCM-Mesh phytochemicals. Te median, mean and SD of the Fsp3 for each
small molecule collection is shown below the box plot. Note the lower end of the box shows the
frst quartile, upper end of the box shows the third quartile, brown line shows the median and
green line shows the mean of the distribution of stereochemical complexity or Fsp3 in the two
box plots. (c) Median, mean and SD of six physicochemical properties, namely, molecular
weight, logP, TPSA, number of hydrogen bond donors, number of hydrogen bond acceptors
and number of rotatable bonds for the small molecule collections CC, DC’, NP, IMPPAT
phytochemicals and TCM-Mesh phytochemicals
ATDB: a uni-database platform for animal
toxins
 Venomous animals possess an arsenal of toxins for predation and defense. These toxins have
great diversity in function and structure as well as evolution and therefore are of value in both
basic and applied research.
 Recently, toxinomics researches using cDNA library sequencing and proteomics profiling have
revealed a large number of new toxins. Although several previous groups have attempted to
manage these data, most of them are restricted to certain taxonomic groups and/or lack
effective systems for data query and access.
 In addition, the description of the function and the classification of toxins is rather inconsistent
resulting in a barrier against exchanging and comparing the data. Here, we report the ATDB
database and website which contains more than 3235 animal toxins from UniProtKB/Swiss-Prot
and TrEMBL and related toxin databases as well as published literature.
 A new ontology (Toxin Ontology) was constructed to standardize the toxin annotations, which
includes 745 distinct terms within four term spaces. Furthermore, more than 8423 TO terms have
been manually assigned to 2132 toxins by trained biologists.
• Schematic overview of the pipeline of data integration in ATDB. All sequence data were
downloaded by December 2006. Signal peptide sequences were extracted by an inhouse Perl script. Taking these sequences as probes, we searched the NCBI-RefSeq
database by BLASTP and filtered by the key word ‘venom gland’ in tissue specificity
annotations. Toxin ontology construction and annotation were mainly done manually by
trained biologists.
steps
1) (1) Users can focus and expand the branches of the tree by clicking leaves (Serpents
suborder in this figure) and detailed information about the taxonomic group will be shown
in the right table.
2) (2) If you want to get all toxins related to the term, just click the ‘getSequence’ button to
display the toxin list.
3) (3, 4 and 5) Users can select toxins manually and filter them by keywords via a filter.
4) (6) The selected sequences can be downloaded smoothly as Excel file and FASTA file by
clicking the ‘Excel download’ and ‘Fasta download’ buttons, respectively.
Informations
 Category:
•
This term space describes the issue of toxin classification including two top branches:
functional categories and species categories. The first one is based on molecular functions
across species. The second one follows species classification at top level and then follows
the characterization of structure or function, which are accepted by related communities.
•
Bio-activity:
•
This term space covers most of the mechanisms by which the toxins take effect, such as
cytolysis, membrane interaction, channel transport regulation, vesicle transport regulation.
Informations
 Targests:
 This term space has three branches to describe the targets of toxin. The Organism branch
mentions the species or tissues affected by toxin. The ‘Mammal’ term (TX : 0000075)
assignment to a toxin means the toxin can act on mammals.
 The Cell branch describes the type of cell and organelles affected by toxin. The Molecule
branch contains detailed classification of the molecules, which interact with toxins such as
enzymes, GPCRs (G protein-coupled receptor) or ion channels.
 Symptom:
 This term space has two branches. The first one (individual symptom) describes the
symptoms that appear in an individual animal. These effects are divided into two parts:
local/regional effects and systemic effects.
 The other branch covers physiological model symptoms which records the symptoms of
certain physiological preparations (such as nerve–muscle preparation) induced by a toxin.
information
•
manually assigned to 2132 toxins based on annotations of the Tox-Prot, GO annotations
and related publications. Each term assignment was independently reviewed by at least
two biologists to avoid artificial errors. Additionally, we defined five TO evidence codes to
describe how these annotations were assigned and what is the type of the evidence to
support an annotation.
Future updates
 There is a major release of ATDB every 3 months with incremental updates as appropriate.
Current and future work includes populating the database with more data entries.
 The system of TO will be further examined and optimized to accommodate the
development of toxinomics research. Additionally, it is planned to integrate the multi
alignment tool ClusterW (14) into ATDB for fast sequence comparison.
 Nowdays an updated version of the Animal Toxin Database (ATDB 2.0) that provides a
new bioinformatics resource for analyzing toxin-channel (T-C) interactions. Data on more
than 54,000 T-C interactions, including 9193 high-confidence interactions, has been
extracted, formatted and mapped to toxin and ion channel databases.
Download