Web Interface to Dictionary of Natural © Products at Astra Zeneca Péter Várkonyi, DECS Cheminformatics Astra Zeneca R&D Mölndal, Sweden email: peter.varkonyi@astrazeneca.com Web-based application and database exploiting DNP data The purpose of the application is to make DNP data easily accessible AstraZeneca-wide using a simply straigthforward query interface. The query has to incorporate chemical structure search, and the results have to be obtained in electronic form including chemical structures. ABSTRACT Dictionary of Natural Products (The Chapman & Hall) is a collection of chemical substances of natural sources updated twice a year. The data are catalogized, and this is the only way to retrieve information from the database. It was decided to make the database searchable and available to all researchers at the company. The data were placed in an ORACLE database including the chemical structures of the substances. The structures are converted and processed with JChem. The query interface is a web application employing the Java Server Pages (JSP) technology. The most important fields (id, full chemistry name, CAS number, molecular weight, importance, and pharmaceutical importance) and the chemical structure are selected to be searchable. The chemical structure entry is done with MarvinSketch applet in the query. The alphanumeric search is conducted in ORACLE using JDBC and the chemical structure search is using JChem. The chemical structures in the results of the query are presented with MarvinView applet. Besides the searchable fields and the chemical structure, the bibliographic references are displayed. Dictionary of Natural Products of The Chapman & Hall In pharmaceutical research the substances produced by natural organism always played a significant role. Dictionary of Natural Products is one of the most extensive collection of these substances. It contains nearly 200 000 records corresponding to approximately 40 000 parent compounds. DNP contains beside the name and chemical structure of the substance, some of its most important properties, description of its significance, information about the hazard associated with it, and bibliographic references to it. Table 1 The data are coming in 2 files: - a text file containing the compound identifiers and all the alphanumeric data the "records" of the text file are meant to facilitate printing the record in publishing quality not to search content of the fields. - an SDF file it contains compound identifiers and the chemical structures in MDL's SD format. The available fields can be divided into 4 types in the text file: identifiers and catalogue numbers, like: UKEY DNP record identifier CASM Chemical Abstracts registry number ALDR Aldrich catalogue number properties, like: MOLF Molecular formula OPTR Optical rotation LGPE Partition co-efficient data (experimental) use or importance, like: UIMP use, importance DUIM pharmaceutical importance HAZD hazard bibliographic reference data, like DATE date of reference INIT initials of author The full list of available fields in DNP is shown in Table 1. aldr boil casm ctfl dens derd devs diag docn docl docs dref entr exno exnx fluk gens hazd hazf indx lgpc lgpe melt misc molf molw name optr phys pkas prog rare rgrp rtec sigm solp sorc sref stra subs supc supp syns tocn uimp ukey vard xcas - Fields in Directory of Natural Products Aldrich catalogue number Boiling point/sublimation point CAS Registry number Connection table status code Relative density Derivative descriptor Development status Diagram code Entry number(s) in printed work(s) Latest DOCN Special DOCN Derivative cross-reference Control number Exchange number (Chapman & Hall number) Old exchange number Fluka catalogue number General statement Hazard Hazard flag Index name Partition co-efficient data (calculated) Partition co-efficient data (experimental) Melting point/freezing point Miscellaneous information Molecular formula Molecular weight Entry name Optical rotation Physical description, solvent of recrystallization pKa value Progress code Rare Chemicals Library number Linear diagram RTECS registry number Sigma catalogue number Solubility Source, synthesis Diagram cross-reference Structure by analogy Subset code Supelco catalogue Supplier data Synonym Type of compound code Use, importance Unique key Variant descriptor Additional entry specific CAS registry number As a first step we designed the application to search and display the most relevant information beside chemical structure. The application is designed to be easily expandable with further fields to be either searched or displayed. The selected fields are as follows: CAS number DNP record id Molecular Formula Molecular Weight, Use, importance Pharmaceutical importance Name The database environment we used to store and search the data is ORACLE. The chemistry information is dealt with ChemAxon's JChem JDBC link to the ORACLE database. The tables of the database and their relation is shown in Fig 1. Fig. 1 The tables and their relationship in the ORACLE database REFERENCE DNPSTRUCT UKEY REFERENCE ORDER one to many realtionship CD_ID CD_STRUCTURE CD_SMILES CD_FORMULA CD_MOLWEIGHT CD_HASH CD_FLAGS CD_TIMESTAMP CD_FP1 . . . CD_FP16 UKEY SINGLE1 UKEY NAME CASM MOLF MOLW UIMP DUIM one to one relationship The application interface is using Java Server Pages (JSP) technology. The query form is using ChemAxon's MarvinSketch applet to enter the chemical structure. The results form incorporates ChemAxon's MarvinView applet to display the chemical structure. The bibliographic references are displayed on a separate form on demand. The chemistry functionality of the search and exporting the results to various electronic formats are done by JChem as well. Query Form Result Form Reference Form Bibliographic reference fields date etal init page ptee rkey rsrt rtag surn titl voln xtra Date of reference Multiple author indicator Initials of author Page number Patentee Reference unique key Sort key for references References contents tag Author’s surname Journal or book title Volume number Extra information The data are categorized by the parent compounds and sorted into chemical structure groups (e.g.: carbohydrates, oxygen heterocycles, polyketides, etc). A parent compound can have several derivatives and variants, and the derivatives also can have several variants. References Dictionary of Natural Products on CD-ROM. The Chapman & Hall, 2006 Chapman & Hall Export Format, Version 9.0, July, 1997. Acknowledgement In developing this application the JSP example application in JChem manual was consulted. Szabolcs Csepregi and Szilárd Dóránt of ChemAxon helped me through the initial learning phase of using JChem.