ADMET AND CHEMICAL DATABASE ADMET DATABASE Introduction Current pharmaceutical research and development (R&D) is a high-risk investment which is usually faced with some unexpected even disastrous failures in diferent stages of drug discovery. One main reason for R&D failures is the efficacy and safety deficiences which are related largely to absorption, distribution, metabolism and excretion (ADME) properties and various toxicities (T). Therefore, rapid ADMET evaluation is urgently needed to minimize failures in the drug discovery process. We believe that this web platform will hopefully facilitate the drug discovery process by enabling early drug-likeness evaluation, rapid ADMET virtual screening or fltering and prioritization of chemical structures LAZAR: a modular predictive toxicology framework introduction lazar (lazy structure–activity relationships) is a modular framework for predictive toxicology. Similar to the read across procedure in toxicological risk assessment, lazar creates local QSAR (quantitative structure–activity relationship) models for each compound to be predicted. Model developers can choose between a large variety of algorithms for descriptor calculation and selection, chemical similarity indices, and model building. This paper presents a high level description of the lazar framework and discusses the performance of example classification and regression models. The main objective of lazar is to provide a generic tool for the prediction of complex toxicological endpoints, like carcinogenicity,long-term, and reproductive toxicity. As these endpoints involve a huge number of complex (and probably unknown) biological mechanisms, lazar does not intend to model all involved biological processes (as in molecular modeling or various systems biology approaches), but follows a data driven approach. lazar uses data mining algorithms to derive predictions for untested compounds from experimental training data. Any dataset with chemical structures and biological activities can be used as training data. This makes lazar a generic prediction algorithm for any biological endpoint with sufficient experimental data. At present, lazar does not consider chemical, biological, or toxicological expert knowledge, but derives computational models from statistical criteria. Such an approach has the distinct advantage that incomplete, wrong, or incorrectly formulated background knowledge cannot affect predictions, because they are based on objective, traceable, and reproducible statistical criteria. Although lazar does not use explicit background knowledge for predictions, it was created with an intent to support mechanistic based risk assessment. For this purpose, rationales for predictions are presented together with a hypothesis about possible biological mechanisms that is based on statistically significant properties of the underlying data. As both, predictions and mechanisms are statistically derived (not causally or mechanistically), the toxicological expert is a key part of the process. He should review and interpret the output in order to identify, e.g., training data errors, chance correlations, systematic problems, or findings that contradict with current knowledge and discard results if necessary. In contrast to most machine learning and QSAR methods, which create a global prediction model from all training data, lazar uses local QSAR models. limitations Scientific limitations: Limited capability of some quantitative structure–activity relationship (QSAR) algorithms (e.g., linear regression) to handle complex relationships. • Missing, improper, ambiguous, or poorly reproducible definitions of applicability domains • Improper application of validation procedures, ignorance of applicability domains1 • Poor validation of applicability domain concepts • Poor consideration of biological mechanisms • Irreproducible results, because proprietary algorithms are not disclosed TECHNICAL LIMITATIONS: • Hard to use and unintuitive software • Standalone solutions with poor integration of external databases, ontologies etc. SOCIAL LIMITATIONS: • Insufficient translation of statistics/data mining/QSAR concepts into toxicological terminology • Poor understanding of the significance of validation results • Poor and/or too technical documentation of algorithms, which is hard to understand for non-computer scientists. admetSAR: A Comprehensive Source and Free Tool for Assessment of Chemical ADMET Properties Absorption, distribution, metabolism, excretion, and toxicity (ADMET) properties play key roles in the discovery/development of drugs, pesticides, food additives, consumer products, and industrial chemicals. This information is especially useful when to conduct environmental and human hazard assessment. The most critical rate limiting step in the chemical safety assessment workflow is the availability of high quality data. This paper describes an ADMET structure−activity relationship database, abbreviated as admetSAR. It is an open source, text and structure searchable, and continually updated database that collects, curates, and manages available ADMET-associated properties data from the published literature. In admetSAR, over 210 000 ADMET annotated data points for more than 96 000 unique compounds with 45 kinds of ADMET-associated properties, proteins, species, or organisms have been carefully curated from a large number of diverse literatures. The database provides a user-friendly interface to query a specific chemical profile, using either CAS registry number, common name, or structure similarity. In addition, the database includes 22 qualitative classification and 5 quantitative regression models with highly predictive accuracy, allowing to estimate ecological/mammalian ADMET properties for novel chemicals. Methods database Source. Data Collection: The core ADMET associated data in admetSAR are extracted from the full text of peer-reviewed scientific publications via weekly PubMed and Google Scholar) searches from 2002 to 2011. A literature search was performed with general items: “computational (in silico) ADME”, “computational toxicology”, etc. The ADMET-associated keywords, such as water solubility, human intestinal absorption, oral bioavailability, blood−brain barrier penetration, transporter, plasma protein binding, volume of distribution, CYP450, toxicity, etc., were used to refine the research results. Review articles and the publication with data points less than 10 were removed. In order to control the quality of data points, only publications in a variety high quality journals, such as The Journal of Chemical Information and Modeling, Molecular Informatics, Chemical Research in Toxicology, Journal of Medicinal Chemistry, Bioorganic Medicinal Chemistry Letters, etc., were selected for further manual checking. Data Extraction and Preparation:As given in Figure , from each eligible publication, detailed raw data of the compounds tested, any ADMET-associated protein, organism, or specie information for these assays are abstracted and manually checked by our experts. Some data points and structure of chemicals were downloaded from the supporting information of publications. The fuzzy, uncertain, obviously uncorrected data points were removed. At last, molecules with compound name, ADMET end points, detailed test materials, structural information, and data source (full journal name) were collected. Methods CAS registry number (CASRN) information of each environmental chemical or drug was extracted from US-EPA ACToR (http://actor.epa.gov/actor/faces/ACToRHome.jsp) and DrugBank databases. 32 Structure of each compound are draw in full by SMILES format and were converted canonical SMILES using the OpenBabel v2.3.1.33 The IUPAC name of all molecules was generated using Marvin v5.8.0 (http://www. chemaxon.com/). The DrugBank ID number were validated by mapping with DrugBank database v3.0.32. Calculation of Physicochemical Property: Calculations of small molecular physicochemical properties are important for computationally filtering their “druglikeness” and “leadlikeness” and toxicity potentials. In admetSAR, five classic physicochemical properties, namely the number of hydrogen bond acceptors and donors, Log P, topological polar surface area (TopoPSA), and molecular weight (MW) were calculated for all compounds using OpenBabel v2.3.1.33. Methods Development of Computational Models: Prediction of ADMET-associated properties of new chemicals is a big challenge in free ADMET research communities. In admetSAR, 22 qualitative classification models were implemented, which were developed using support vector machine classification algorithm and in house substructure pattern recognition method.20 In addition, five quantitative regression models were also built and implemented, which were developed using support vector machine regression algorithm. The robustness of all models was validated based on 5-fold cross validation, and the predictability of several models was validated using available external validation sets. Only models with high predictive accuracy were selected and implemented in admetSAR. In the process of model development, all compounds were represented using MACCS keys implemented with OpenBabel v2.3.1.33 The detailed descriptions of model building procedure, modeling algorithms, and model validation criteria were given in the Supporting Information. methods Database Design and Implementation: The data extracted from publications and manually checked were managed through MySQL v5.1.61 (http://www.mysql.com/). The admetSAR system (Figure 1) was built using Django v1.4.0 on Apache v2.2.20 with mod_wsgi v3.3, installed on UbuntuServer v11.10. AdmetSAR provides user-friendly web interfaces to generate a chemical profile, by either text or structural similarity search, and computational prediction using cascading style sheet (CSS) and python script. Similarity Search:JSDraw v1.3.1 (http://www.scilligence. com/web/download.aspx?prod=JSDraw) was implemented as the build-in molecule editor. The structural similarity search is assessed by the Tanimoto coefficient using the MACCS keys implemented with OpenBabel v2.3.1.33. Visualization Features: The two-dimensional (2D) chemical structures were displayed by images, which were generated using Marvin v5.8.0 (http://www.chemaxon.com/). SwissADME: a free web tool to evaluate pharmacokinetics, druglikeness and medicinal chemistry friendliness of small molecules To be effective as a drug, a potent molecule must reach its target in the body in sufficient concentration, and stay there in a bioactive form long enough for the expected biologic events to occur. Drug development involves assessment of absorption, distribution, metabolism and excretion (ADME) increasingly earlier in the discovery process, at a stage when considered compounds are numerous but access to the physical samples is limited. Here, SwissADME web tool that gives free access to a pool of fast yet robust predictive models for physicochemical properties, pharmacokinetics, drug-likeness and medicinal chemistry friendliness, among which in-house proficient methods such as the BOILEDEgg, iLOGP and Bioavailability Radar. SwissADME strong points are, non-exhaustively: different input methods, computation for multiple molecules, and the possibility to display, save and share results per individual molecule or through global intuitive and interactive graphs. The knowledge about compounds being substrate or non-substrate of the permeability glycoprotein (P-gp, suggested the most important member among ATP-binding cassette transporters or ABC-transporters) is key to appraise active efflux through biological membranes, for instance from the gastrointestinal wall to the lumen or from the brain40. One major role of P-gp is to protect the central nervous system (CNS) from xenobiotics41. Importantly as well, P-gp is overexpressed in some tumour cells and leads to multidrugresistant cancers. So, SwissADME enables the estimation for a chemical to be substrate of P-gp or inhibitor of the most important CYP isoenzymes. Moreover, SwissADME is an integral part of the SwissDrugDesign program, an ambitious initiative driven by the Molecular Modeling Group of the SIB Swiss Institute of Bioinformatics that aims at providing a collection of free Web-based tools covering many aspects of CADD. SwissADME submission page. The actual input is a list of SMILES on the right-hand side of the submission page , which can be edited as a standard text, allowing for typing or pasting SMILES , which contains one molecule per line with an optional name separated by a space. Molecules can be directly pasted or typed in SMILES format, or inserted through the molecular sketcher. The latter enables importing from databases, opening a local file or drawing a 2D chemical structure to be transferred to the list by clicking on the double-arrow button. When the list of molecules is ready to be submitted, the user can start the calculations by clicking on the “Run” button. on the right-hand side of the submission page, is the actual input for computation. It can be edited as a standard text, allowing for typing or pasting SMILES. In this figure , Computed parameter values are grouped in the different sections of the one-panel-par-molecule output (like Physicochemical Properties, Lipophilicity, Pharmacokinetics, Drug-likeness and Medicinal Chemistry. More over , Bioavailability Radar is displayed for a rapid appraisal of drug-likeness The Bioavailability Radar enables a first glance (look) at the drug-likeness of a molecule . The pink area represents the optimal range for each properties (lipophilicity: XLOGP3 between −0.7 and +5.0, size: MW between 150 and 500 g/mol, polarity: TPSA between 20 and 130Å2 , solubility: log S not higher than 6, saturation: fraction of carbons in the sp3 hybridization not less than 0.25, and flexibility: no more than 9 rotatable bonds. In this example, the compound is predicted not orally bioavailable, because too flexible and too polar. The Bioavailability Radar The BOILED-Egg17 allows for intuitive evaluation of passive gastrointestinal absorption (HIA) and brain penetration (BBB) in function of the position of the molecules in the WLOGP-versus-TPSA referential. The white region is for high probability of passive absorption by the gastrointestinal tract, and the yellow region (yolk) is for high probability of brain penetration. Yolk and white areas are not mutually exclusive. In addition the points are coloured in blue if predicted as actively effluxed by P-gp (PGP+) and in red if predicted as non-substrate of P-gp (PGP−). For an interactive analysis, the user can leave the mouse over a dot to show the structure of the molecule and click on the dot to scroll to the corresponding output panel. In this example, Lapatinib is predicted as not absorbed and not brain penetrant (outside the Egg), Omeprazol is predicted well-absorbed but not accessing the brain (in the white) and PGP+ (blue dot), Sunitinib is predicted as passively crossing the BBB (in the yolk), but pumped-out from the brain (blue dot), and Palonosetron is predicted as brain-penetrant (in the yolk) and not subject to active efflux (red dot). One molecule is predicted not absorbed and not BBB permeant because outside of the range of the plot (streptomycin with a TPSA of 331.43Å2 and a WLOGP of −7.74). FAF DrugS4 FAF-Drugs4 is a free toolkit to assist in silico screening experiments as well as experimental screening as it helps to select compounds for in silico / in vitro / in cellulo assays. A related goal is the computational prediction of some ADME-Tox properties (Adsorption, Distribution, Metabolism, Excretion and Toxicity). Additionnaly without managing the whole filtering process, FAF-Drugs has basic capabilities such as removing salts and counterions, removing duplicates or computing the ADME-Tox descriptors without filtering the database. o Both services share a Data Curation procedure (which could be also applied using the Bank Cleaner online tool) prior data calculations and filtering steps. o That step gather a removal of large coumpounds, inorganics, mixtures, counterions, empty structures, and 211 salts known to be employed in medicinal chemistry. o Then, the standardization procedure is performed, after a neutralization step, using the ChemAxon Standardizer Academic package which involves a ring aromatization protocol and in-house SMARTS depicting patterns for key chemical functions (amine, nitro, carboxylic acid, phosphonamide, amide, sulfonamide, phosphonate and sulfonate). o Duplicates are detected using internal CANSMILES defintions (Pybel) for each normalized compounds. o During this step, stereochemistry is considered in order to keep stereoisomers. o Note that user can employ the Bank Formatter service in order to process a corrected input SDF file (single or multiple) with standard format needed for FAFDrugs. o One can also use the Filter Editor service to design a tuned physchem filter applied in FAF-Drugs4. o Then, for both services, input file is divided in sub-libraries and managed by as many subprocesses as subdivisions of this file has been prior created. o If user run FAF-Drugs4, physchem descriptors are computed using OpenBabel and ChemAxon. o In order to properly compute protonation-dependent descriptors (HBA, HBD, charges...), a benchmarked in-house procedure employing ChemAxon [22] cxcalc library at physiological pH is applied dynamically during calculations (also applied for FAF-QED). o Toxicophores and interferents SMARTS definitions were refined for 137 structural alerts and 515 PAINS. o Depending of the filtering ranges, Accepted, Intermediate or Rejected files are written associated with all their csv results files o If user run FAF-QED, physchem descriptors and alerts are computed as FAFDrugs4 and a csv file offering descriptors values and QED estimations are written. o When the entire process is done, all results and graphical reports are proposed on the Mobyle RPBS Portal and available to be download. FAFDrugs4 detailed results July-2006: development and deployment of FAFDrugs on the RPBS plateform. April-2016: Major revision of SMARTS definations for the better sensibility of toxicophores detection. July-2015: Sept-2008: Release of the FAFDrug2 standalone version. Upgrade of FAFDrug2 to FAFDrug3 wed-server revision of the algoritham for the input of toxicophores detection 2009-2010: Sept-2014: Several improvements dedicated to a wedserver version. Major improvements regarding rules and toxicphores detection. Augest-2017: Complete revision of the FAF-Drugs4’s general algorithm in order to allow multiprocessing calculations and speed –up computations. Chemical Database Introduction Chemical databases/resources are the backbone of computer-aided drug discovery, whether it is chemoinformatics or bioinformatics. These databases gives information which can be used to build knowledge-based models for discovering and designing drug molecules. Here, we have provides list of major databases that are freely available and widely used. Chemical databases: Database description Pubchem PubChem is a database of chemical molecules which maintains three types of information namely, substance, compound and BioAssays. CheMBL This database provides comprehensive information about 1 million bioactive (small druglike molecules) compounds with 8200 drug targets. NCI (national cancer institute) NCI database had more than 275,000 small molecules structures, a very useful resource for researchers working in the filled of cancer/AIDS. ChemDB It is a databse of five million chemicals which contains information of chemicals that includes predicted or experimentally determined physicochemical properties, such as 3D structure, melting temperature and solubility. HIT(herb ingredients targets) HIT is a comprehensive database for protein targets for FDA-approved drugs as well as the promising precursors. It currently contains about 1,301 known protein targets(221 proteins are described as direct targets). Chemical database: database description supernatural A freely available database of approximately 50,000 natural compounds. Superdrug This database contains approximately 2500 3D-structures of active ingredients of essential marketed drugs. PDB-bind It is a collection of binding affinities for protein-ligand complexes with known threedimensional structures. It contains 5671 protein-ligand complexes. PDBechem It provides comprehensive information of ligands, small molecules and monomers. Presently it consists 15502 ligands. HMDB( human metabolome A database containing detailed information about small molecule metabolites found in the human body. ZINC 15 − Ligand Discovery for Everyone • Many questions about the biological activity and availability of small molecules remain inaccessible to investigators who could most benefit from their answers. • To narrow the gap between chemoinformatics and biology, developed a suite of ligand annotation, purchasability , target, and biology association tools, incorporated into ZINC and meant for investigators who are not computer specialists. • The new version contains over 120 million purchasable “drug-like” compounds − effectively all organic molecules that are for sale − a quarter of which are available for immediate delivery. • ZINC connects purchasable compounds to high-value ones such as metabolites, drugs, natural products, and annotated compounds from the literature. • Compounds may be accessed by the genes for which they are annotated as well as the major and minor target classes to which those genes belong. • It offers new analysis tools that are easy for non-specialists yet with few limitations for experts. • ZINC retains its original 3D roots − all molecules are available in biologically relevant, ready-to-dock formats. • ZINC15 is designed to bring together biology and chemoinformatics with a tool that is easy to use for nonexperts, while remaining fully programmable for chemoinformaticians and computational biologists. • ZINC 15, a new research tool for ligand discovery that connects biological activities by gene product, drugs, and natural products with commercial availability. • ZINC15 draws upon third party databases such as ChEMBL, HMDB, DrugBank, and https://ClinicalTrials.gov to annotate high information compounds that are active in, or created by, nature. ZINC 15 − Ligand Discovery for Everyone ZINC (ZINC Is Not Commercial) is a public access database and tool set, initially developed to enable ready access to compounds for virtual screening, that has become ever widely used for virtual screening, ligand discovery, pharamcophore screens, benchmarking, and force field development. Increasingly, however, investigators have tried to interrogate it for questions that it was not designed to answer. Simple questions, such as how many endogenous human metabolites are there, which of these are purchasable, or what natural product or drug does a compound most closely resemble, were surprisingly difficult to answer Question like ? With a target in mind, investigators often wanted a focused library biased toward ligands for that target. With new compounds discovered, they often wanted to find the most similar ligands already known for that target. To optimize that ligand, they might look to the availability of starting products for synthesis, asking, for instance, how many boronic acids that contain an indole ring may be purchased in preparative quantities and how soon will they arrive Introduction ZINC15 is designed to bring together biology and chemoinformatics with a tool that is easy to use for nonexperts, while remaining fully programmable for chemoinformaticians and computational biologists. ZINC 15, a new research tool for ligand discovery that connects biological activities by gene product, drugs, and natural products with commercial availability To integrate and curate biological activity, chemical property, and commercial availability data for small molecules from public sources, supplemented by additional calculated properties into a chemistry-aware relational database. To build a general query language and report generator that is Web URL compatible. To design a graphical user interface that requires no programming to interrogate the database using this query language. To demonstrate and document the use of this tool to answer previously difficult questions. (1) ZINC ID, name if known, subset membership, (2) properties and 2D depiction, (3) 3D representations if available, (4) purchasing information, and (5) annotated catalog membership, (6) breadcrumbs indicating current location, (7) selection tool for refinement of query, and (8) download tool. B. Interesting bioactive and biogenic analogs section of molecule detail page: (1) similar biogenic compounds, (2) similar bioactive compounds, (3) compounds with a shared scaffold, (4) similar aggregators, and (5) similar purchasable compounds currently too slow to calculate. A find more button in each case will find more of the same CMNPD: a comprehensive marine natural products database towards facilitating drug discovery from the ocean Marine organisms are expected to be an important source of inspiration for drug discovery after terrestrial plants and microorganisms. Despite the remarkable progress in the field of marine natural products (MNPs) chemistry, there are only a few open access databases dedicated to MNPs research. CMNPD currently contains more than 31 000 chemical entities with various physicochemical and pharmacokinetic properties, standardized biological activity data, systematic taxonomy and geographical distribution of source organisms, and detailed literature citations. It is an integrated platform for structure dereplication (assessment of novelty) of (marine) natural products, discovery of lead compounds, data mining of structureactivity relationships and investigation of chemical ecology. Aim CMNPD aims to provide an open access knowledge base for not only professional MNPs researchers but also the broader scientific community to facilitate the research and development of marine drugs. CMNPD contains 31 561 distinct chemical entities of MNPs from over 13 000 sampling organisms. These organisms are distributed in 7 kingdoms, 38 phyla, 93 classes, 289 orders, 682 families, 1480 genera and 3354 species. There are 15 774 active compounds mapped to 2652 targets with 72 343 bioactivities. These targets include 1122 single proteins, 923 cell lines, 459 organisms and several other types. The document library includes 128 488 scientific literature and patents, of which ∼11 000 articles describe the discovery of new compounds and structure revisions. Schematic overview of the CMNPD content. Phylogenetic tree of marine organisms. Core nodes include domains, kingdoms, phyla, classes, orders and families. Internal nodes include their corresponding superior and subordinate taxonomic units, which are based on the NCBI Taxonomy database . The bar chart shows the number of compounds in each family. This figure was produced using the Interactive Tree Of Life (iTOL) web server . Knowledge graph of the chemical entity. . Statistics of various data collections in CMNPD o Quick search is available in the middle of the homepage and on the toolbar in the upper right corner of each page. The free-text search allows users to enter any term of CMNPD identifier, compound name, organism name or target name without specifying a search entity. The search bar will provide suitable suggestions as the term is typed In addition, a powerful advanced search capability is provided on the query builder page. This allows users to specify any number of query conditions. Available query conditions, which could be combined with the Boolean operator ‘AND’, ‘OR’ or ‘NOT’, include structure (drawing structure, structural classification), compound representations (e.g. compound name, molecular formula), physicochemical properties (e.g. molecular weight, ALogP), ADMET prediction (e.g. blood brain barrier penetration level, human intestinal absorption level), resources (organism name, collection site), bioactivities (e.g. target name, assay type) and bibliography (e.g. authors, DOI). Multiple query conditions could be easily grouped together by just dragging and dropping them. The inner Boolean operations of the grouped conditions will be executed first Yet, there is still a small number of databases dedicated to MNPs research. The commercial databases MarinLit (http: //pubs.rsc.org/marinlit) and Dictionary of Marine Natural Products (http://dmnp.chemnetbase.com) are currently the most exhaustive and complete MNPs databases, but subscription fees may prevent their broader access to academic research. The recently established academic free database MarinChem3D (http://mc3d.qnlm.ac) provides 3D structures of MNPs, but its biological activity data is limited. Some open access databases such as the Seaweed Metabolite Database (SWMD) (6) and the Dragon Exploration System on Marine Sponge Compounds Interactions (DESMSCI) (7) contain only natural products produced by certain types of marine organisms. IMPPAT: A curated database of Indian Medicinal Plants, Phytochemistry And Therapeutics Phytochemicals of medicinal plants encompass a diverse chemical space for drug discovery. India is rich with a fora of indigenous medicinal plants that have been used for centuries in traditional Indian medicine to treat human maladies. A comprehensive online database on the phytochemistry of Indian medicinal plants will enable computational approaches towards natural product based drug discovery. IMPPAT is the largest database on phytochemicals of Indian medicinal plants to date, and this resource is a culmination of our eforts to digitize the wealth of information contained within traditional Indian medicine. IMPPAT provides an integrated platform to apply cheminformatic approaches to accelerate natural product based drug discovery. IMPPAT, a manually curated database of 1742 Indian Medicinal Plants, 9596 Phytochemicals, And 1124 Therapeutic uses spanning 27074 plant-phytochemical associations and 11514 plant-therapeutic associations. Notably, the curation efort led to a non-redundant in silico library of 9596 phytochemicals with standard chemical identifers and structure information. Using cheminformatic approaches, they have computed the physicochemical, ADMET (absorption, distribution, metabolism, excretion, toxicity) and drug-likeliness properties of the IMPPAT phytochemicals. IMPPAT, containing 1742 Indian Medicinal Plants, 9596 Phytochemicals, And 1124 Therapeutic uses. In addition, the IMPPAT database has linked Indian medicinal plants to 974 openly accessible traditional Indian medicinal formulations. Importantly, our curation eforts have led to a non-redundant in silico chemical library of 9596 phytochemicals with two-dimensional (2D) and three-dimensional (3D) chemical structures. For the 9596 phytochemicals in this database, they have computed physicochemical properties and predicted Absorption, distribution, metabolism, excretion and toxicity (ADMET) properties using cheminformatic tools . Then employed cheminformatic approaches to evaluate the drug-likeliness of the phytochemicals in our in silico chemical library using multiple scoring schemes such as Lipinski’s rule of fve (RO5), Oral PhysChem Score (Trafc Lights), GlaxoSmithKline’s (GSK’s) 4/400, Pfzer’s 3/75, Veber rule and Egan rule. They found a subset of 960 phytochemicals of Indian medicinal plants that are potentially druggable in our chemical library of 9596 phytochemicals based on multiple scoring schemes. They also provide predicted interactions between phytochemicals in our database and human target proteins from STITCH database Snapshot of the result of a standard query for phytochemicals of an Indian medicinal plant. In this example, we show the plantphytochemical association for Ocimum tenuiforum, commonly known as Tulsi, from IMPPAT database. Snapshot of the dedicated page containing detailed information on 2D and 3D chemical structure, physicochemical properties, druggability scores, predicted ADMET properties and predicted target human proteins for a chosen phytochemical. From the dedicated page for each phytochemical, users can download the 2D and 3D structure of the phytochemical in the form of a SDF or MOL or MOL2 or PDB or PDBQT fle. Snapshot of the result of a standard query for therapeutic uses of an Indian medicinal plant. In this example, we show the therapeutic uses of Ocimum tenuiforum from IMPPAT database. Snapshot of the advanced search options which enable users to flter phytochemicals based on their physiochemical properties or druggability scores or chemical similarity with a query compound. (a) Pie chart shows the distribution of the 1742 Indian medicinal plants in IMPPAT database across diferent taxonomic families. (b) Pie chart shows the distribution of the 9596 IMPPAT phytochemicals across diferent chemical super-classes obtained from ClassyFire50. (c) Histogram of the number of Indian medicinal plants which produce a given phytochemical in our database. (d) Histogram of the number of therapeutic uses per Indian medicinal plant in our database. (e–j) Histogram of the molecular weight (in g/mol), logP, TPSA (in Å2 ), number of hydrogen bond donors, number of hydrogen bond acceptors and number of rotatable bonds of the phytochemicals in our database. Comparison of the physicochemical properties of IMPPAT phytochemicals with other small molecule collections. (a) Box plot shows the distribution of the stereochemical complexity of the small molecule collections CC, DC’, NP, IMPPAT phytochemicals and TCM-Mesh phytochemicals. Te median, mean and standard deviation (SD) of the stereochemical complexity for each small molecule collection is shown below the box plot. (b) Box plot shows the distribution of the Fsp3 for the small molecule collections CC, DC’, NP, IMPPAT phytochemicals and TCM-Mesh phytochemicals. Te median, mean and SD of the Fsp3 for each small molecule collection is shown below the box plot. Note the lower end of the box shows the frst quartile, upper end of the box shows the third quartile, brown line shows the median and green line shows the mean of the distribution of stereochemical complexity or Fsp3 in the two box plots. (c) Median, mean and SD of six physicochemical properties, namely, molecular weight, logP, TPSA, number of hydrogen bond donors, number of hydrogen bond acceptors and number of rotatable bonds for the small molecule collections CC, DC’, NP, IMPPAT phytochemicals and TCM-Mesh phytochemicals ATDB: a uni-database platform for animal toxins Venomous animals possess an arsenal of toxins for predation and defense. These toxins have great diversity in function and structure as well as evolution and therefore are of value in both basic and applied research. Recently, toxinomics researches using cDNA library sequencing and proteomics profiling have revealed a large number of new toxins. Although several previous groups have attempted to manage these data, most of them are restricted to certain taxonomic groups and/or lack effective systems for data query and access. In addition, the description of the function and the classification of toxins is rather inconsistent resulting in a barrier against exchanging and comparing the data. Here, we report the ATDB database and website which contains more than 3235 animal toxins from UniProtKB/Swiss-Prot and TrEMBL and related toxin databases as well as published literature. A new ontology (Toxin Ontology) was constructed to standardize the toxin annotations, which includes 745 distinct terms within four term spaces. Furthermore, more than 8423 TO terms have been manually assigned to 2132 toxins by trained biologists. • Schematic overview of the pipeline of data integration in ATDB. All sequence data were downloaded by December 2006. Signal peptide sequences were extracted by an inhouse Perl script. Taking these sequences as probes, we searched the NCBI-RefSeq database by BLASTP and filtered by the key word ‘venom gland’ in tissue specificity annotations. Toxin ontology construction and annotation were mainly done manually by trained biologists. steps 1) (1) Users can focus and expand the branches of the tree by clicking leaves (Serpents suborder in this figure) and detailed information about the taxonomic group will be shown in the right table. 2) (2) If you want to get all toxins related to the term, just click the ‘getSequence’ button to display the toxin list. 3) (3, 4 and 5) Users can select toxins manually and filter them by keywords via a filter. 4) (6) The selected sequences can be downloaded smoothly as Excel file and FASTA file by clicking the ‘Excel download’ and ‘Fasta download’ buttons, respectively. Informations Category: • This term space describes the issue of toxin classification including two top branches: functional categories and species categories. The first one is based on molecular functions across species. The second one follows species classification at top level and then follows the characterization of structure or function, which are accepted by related communities. • Bio-activity: • This term space covers most of the mechanisms by which the toxins take effect, such as cytolysis, membrane interaction, channel transport regulation, vesicle transport regulation. Informations Targests: This term space has three branches to describe the targets of toxin. The Organism branch mentions the species or tissues affected by toxin. The ‘Mammal’ term (TX : 0000075) assignment to a toxin means the toxin can act on mammals. The Cell branch describes the type of cell and organelles affected by toxin. The Molecule branch contains detailed classification of the molecules, which interact with toxins such as enzymes, GPCRs (G protein-coupled receptor) or ion channels. Symptom: This term space has two branches. The first one (individual symptom) describes the symptoms that appear in an individual animal. These effects are divided into two parts: local/regional effects and systemic effects. The other branch covers physiological model symptoms which records the symptoms of certain physiological preparations (such as nerve–muscle preparation) induced by a toxin. information • manually assigned to 2132 toxins based on annotations of the Tox-Prot, GO annotations and related publications. Each term assignment was independently reviewed by at least two biologists to avoid artificial errors. Additionally, we defined five TO evidence codes to describe how these annotations were assigned and what is the type of the evidence to support an annotation. Future updates There is a major release of ATDB every 3 months with incremental updates as appropriate. Current and future work includes populating the database with more data entries. The system of TO will be further examined and optimized to accommodate the development of toxinomics research. Additionally, it is planned to integrate the multi alignment tool ClusterW (14) into ATDB for fast sequence comparison. Nowdays an updated version of the Animal Toxin Database (ATDB 2.0) that provides a new bioinformatics resource for analyzing toxin-channel (T-C) interactions. Data on more than 54,000 T-C interactions, including 9193 high-confidence interactions, has been extracted, formatted and mapped to toxin and ion channel databases.