UNIVERSITY OF FLORIDA CODESSA PRO User’s Manual 2005 Table of contents CHAPTER 1 ...................................................................................................................... 6 INSTALLING CODESSA PRO ...................................................................................... 6 1.1 SYSTEM REQUIREMENTS: ................................................................................... 6 1.2 INSTALLATION OF THE PROGRAM .................................................................. 7 CHAPTER 2 ...................................................................................................................... 8 DESCRIPTION OF CODESSA PRO ............................................................................. 8 2.1 CONCEPTS AND DEFINITIONS ............................................................................ 8 Objects: ....................................................................................................................... 8 Storage: ....................................................................................................................... 8 Snapshot (Cache File): ................................................................................................ 8 Workspace: ................................................................................................................. 8 Folder: ......................................................................................................................... 8 Structure: ..................................................................................................................... 9 Descriptors: ................................................................................................................. 9 Property: ...................................................................................................................... 9 Descriptor/Property Matrix: ........................................................................................ 9 The Targets of the Analysis: ..................................................................................... 10 Correlation: ............................................................................................................... 10 List: ........................................................................................................................... 10 2.2 CODESSA PRO VISUAL INTERFACE ............................................................... 11 2.2.2 WORK AREA ........................................................................................................ 14 2.2.2.1 Structure View Window ................................................................................... 14 2.2.2.2 Correlation View Window ................................................................................ 15 2.2.3 OBJECT WINDOW .............................................................................................. 16 2.2.4 LOG WINDOW ..................................................................................................... 17 CHAPTER 3 .................................................................................................................... 18 USING CODESSA PRO TO RUN A PROJECT......................................................... 18 3.1 STARTING A NEW PROJECT ............................................................................. 18 3.2 CREATING STRUCTURE FILES ......................................................................... 18 Molfiles ..................................................................................................................... 18 SCF files.................................................................................................................... 19 Force Files ................................................................................................................. 19 2 3.3 CREATING A PROPERTY FILE .......................................................................... 19 3.4 CREATING A STORAGE....................................................................................... 20 3.5 PREPARING DATA IN A STORAGE .................................................................. 20 3.6 CALCULATION OF MOLECULAR DESCRIPTORS ....................................... 23 3.7 CALCULATION OF THE QSAR/QSPR CORRELATIONS ............................. 24 3.7.1 MLR ........................................................................................................................ 24 3.7.2 BMLR...................................................................................................................... 24 3.7.3 HMPRO .................................................................................................................. 25 3.7.4 PLS .......................................................................................................................... 29 3.7.5 PCR ......................................................................................................................... 30 3.8 VIEWING CORRELATIONS ................................................................................ 31 3.9 PRINTING OUT RESULTS.................................................................................... 31 3.10 MANIPULATING LISTS ...................................................................................... 31 CHAPTER 4 .................................................................................................................... 33 CODESSA PRO DESCRIPTORS ................................................................................. 33 4.1 Constitutional Descriptors ...................................................................................... 39 4.2 Topological Descriptors .......................................................................................... 40 4.3 Geometrical Descriptors ......................................................................................... 46 4.4 Electrostatic Descriptors ......................................................................................... 50 4.5 CPSA Descriptors ................................................................................................... 54 4.6 MO Related Descriptors ......................................................................................... 68 4.7 Quantum Chemical Descriptors .............................................................................. 71 4.8 Thermodynamic Descriptors ................................................................................... 77 CHAPTER 5 .................................................................................................................... 80 METHODS OF QSAR/QSPR MODEL DEVELOPMENT ....................................... 80 5.1 Multilinear Regression ............................................................................................ 80 5.2 Selection of descriptors ........................................................................................... 82 5.3 Multivariate methods .............................................................................................. 85 5.4 Test of the results .................................................................................................... 87 5.5 External validation set............................................................................................. 87 5.6 Leave-One-Out crossvalidation .............................................................................. 88 5.7 Leave-Many-Out crossvalidation............................................................................ 89 3 5.8 Randomization test.................................................................................................. 89 4 INTRODUCTION CODESSA (Comprehensive Descriptors for Structural and Statistical Analysis) PRO is a comprehensive program for developing quantitative structure-activity/property relationships (QSAR/QSPR) by integrating all necessary mathematical and computational tools to (i) calculate a large variety of molecular descriptors on the basis of the 3D geometrical structure and/or quantum-chemical wave function of chemical compounds; (ii) develop (multi)linear and non-linear QSPR models on the chemical and physical properties or biological activity of chemical compounds; (iii) carry out cluster analysis of the experimental data and molecular descriptors; (iv) interpret the developed models; (v) predict property values for any chemical compound with known molecular structure. This manual provides further information concerning the features and methods available in the CODESSA PRO program, their purpose, and the interpretation of the results. In addition to the overall description of the program, this manual gives the guidelines for the development of a QSAR/QSPR models using the CODESSA PRO program. The CODESSA PRO program is designed to operate in the following Microsoft Windows environments: Windows 2000, Windows XP, Windows NT and Windows 9x. 5 Chapter 1 Installing CODESSA PRO 1.1 System requirements: Hardware: Processor: Pentium class systems - minimum. All processors developed hereafter by Intel Corp. are supported on the assembly level optimization. All AMD current processors work as old Pentium with higher clock freqency (no special optimization). Memory: 128MB minimum, 256MB default tuning. CD-ROM: A CD-ROM or a compatible DVD device is required to install CODESSA PRO software. Other: A 3D graphics accelerator is highly recommended because of extensive use of OpenGL for presentation of molecules. 2-button mouse is required. Software: Operation system: Windows XP Professional, Windows XP Home, Windows 2000, Windows NT (with limitation on using network drives) or later versions Operation system extensions: Internet Explorer 4.0 SP2 or newer version for authorization. 6 1.2 Installation of the program 1. Insert the CODESSA PRO CD in the CD-ROM drive of your computer. 2. If the Windows Autorun feature is turned on, the installation options will be displayed automatically. If the installation options do not appear, open the Windows Start Menu and choose Run. The RUN dialog box opens. 3. Enter the following command in the open text box (assuming D: is your CDROM: D:\setup.exe 4. Choose OK. The CODESSA PRO setup program initializes and the CODESSA PRO setup dialog box opens. 5. Follow the on screen instructions for installation. The CODESSA PRO setup program will detect an existing CODESSA PRO version in your computer and offers the choice of uninstalling or keeping the older version of the program. 7 Chapter 2 Description of CODESSA PRO 2.1 Concepts and definitions The use of CODESSA PRO requires understanding of the following concepts: Objects: These are the individual files or items used by CODESSA PRO. There are four types of objects: structures, descriptors, properties, and models. Using the CODESSA PRO program will frequently require manipulation of single objects or lists of objects. Most objects (e.g. descriptors and models) are named by CODESSA PRO explicitly. Storage: All items that are connected with a project are stored in a single location within the file system - in the storage. To change a storage location, click Option on the main menu-bar and select Storage. The dialogue box that opens allows the user to control the storage location of files related to chemical structures, QSAR/QSPR correlations, molecular descriptors, lists of objects, etc. Snapshot (Cache File): The snapshot (CODESSA PRO cache file) is a binary (compressed) representation of all items in storage. This file is located in RAM whenever work is being performed on the CODESSA PRO Visual Interface (CVI) module. The cache file reloads each time the CVI detects a change in the storage contents. The snapshot can be reloaded manually at anytime by opening the pull-down menu Calculate on the menu-bar, then selecting Load storage, or by using F5 keyboard shortcut. Workspace: The workspace is a window within the CVI that is related to the snapshot. It is displayed in the workspace area on the top left side of the CVI frame. Folder: A folder is used to store lists of certain type of objects. CODESSA PRO uses four main folders, for the chemical structures, molecular descriptors, properties, and QSAR/QSPR models (correlations), respectively. Each of the first three folders has a system-generated list named All. These lists will contain all structures, descriptors or properties, respectively, for the current session. The folder Descriptors also includes system-defined descriptor lists according to the descriptor group and type. A particular descriptor item may be present in several lists. The folder Correlations contains lists for 8 each property item in the property folder. The list is generated automatically by the system when a property item is created. When correlations are calculated, 50 (by default) correlation items with the best statistical characteristics will be stored for any single property. Structure: A structure is a representation of an individual chemical object having a precise chemical constitution. Examples of structures include a single molecule, a monomeric unit of a polymer, or a molecular cluster with a definite composition. The minimum information for a structure must include the types of atoms involved and their connectivity. Each structure is linked to three files containing: (1) information about 3-D structure of the molecular structure in the format of MDL molfile, (2) information obtained from the quantum chemical self-consistent field molecular orbital (SCF MO) calculations using MOPAC program package, (3) information obtained from the quantum chemical force calculations using MOPAC program package. The molfile structures are stored in the 3D MOL(MOPAC) Subfolder, SCF MO output files are stored in the SCF Subfolder, and the output files of the quantum chemical force calculations are stored in the THERMO Subfolder. SCF and Force structures will be created and properly stored automatically by the CMol3D module The name for a particular structure is exactly the same in each directory. Structure names are in the form Sddddddd.xxx and are generated automatically in sequential order by CODESSA PRO. Descriptors: Descriptors are defined as numerical characteristics associated with chemical structures. These are derived proceeding from the chemical constitution, topology, geometry, wave function, potential energy surface or some combination of these items for a given chemical structure. The values of a particular descriptor can be provided externally by the user or calculated automatically by the CODESSA PRO program. Each descriptor value must be associated with a previously defined structure. Descriptors calculated by CODESSA PRO have specific pre-defined names; renaming of the descriptors is not recommended. Property: Property is a physical or chemical characteristic, biological activity, or other characteristic of interest. Each property value must be associated with a structure located in the 3D MOL(MOPAC) Subfolder. Descriptor/Property Matrix: The descriptor/property matrix consists of descriptor (all columns except the last) and property (the last column) values. The horizontal dimension of the matrix is the descriptor/property ID sequence and the vertical dimension is the structure ID sequence. 9 The matrix has two (binary and text) presentations. The binary presentation is used within the CODESSA PRO program, while the text presentation in the CSV format can be used for the data transport to for other software packages The Targets of the Analysis: Any property, list of structures or descriptors can be set up for a certain workflow within the CODESSA PRO treatment. The default targets for a new snapshot file are: Object Property List of descriptors List of structures Default value First selected All group All group The selected targets are used in the formation of the descriptor/property matrix. Correlation: A correlation represents the results of a (multi)linear regression between a property of interest ( y ) and one or more selected descriptors ( xi ). y a0 ai xi The information stored for each correlation includes - regression coefficients ( ai ) - correlation coefficient (R2) - squared standard deviation (s2) - Fisher criterion F value for the set of structures used in the derivation of the correlation. By default, each correlation is named by its number of chemical structures involved (N), correlation coefficient (R2), crossvalidated correlation coefficient (R2CV), Fisher criterion value (F), and squared standard deviation (s2). List: A list is a collection of chemical structures, descriptors, physical and chemical or other property of interest, or models (correlations). Lists can be either system type or user type. System lists cannot be deleted by the user. 10 2.2 CODESSA PRO Visual Interface O Obbjjeecctt Figure 1. The CODESSA PRO Visual Interface The CODESSA PRO Visual Interface has the following screen areas: Menu Bar: Menus related to different tasks performed by CODESSA PRO Tool Bar: Shortcuts to commonly performed tasks. Workspace: An on-screen presentation of the current cache file. Work Area: The screen space for various view windows. Object Window: The object window depicts information on almost all objects (structures, properties, descriptors, correlations). Log Window: Protocol of all operations on the storage. 11 2.2.1 Workspace The CODESSA PRO workspace area (Fig. 2) contains information on the current cache file in the form of a directory tree. The folders for different objects, i.e. chemical structures, molecular descriptors, properties, and QSAR/QSPR models are included in this tree. Figure 2. CODESSA PRO Workspace area The Structures folder (Fig. 3) includes always a subfolder All. All structures defined in the given cache file are listed there. The name that is listed will be the same as the name given on the first line of the corresponding molfile. If this line is empty, the CODESSA PRO name of the molfile (e.g. S000000x) will be used instead. Double clicking on the name of a compound in this list opens a structure view window in the work area, depicting the 3-D structure of this compound. A selection of compounds made by the user may be stored in a separate subfolder. Figure 3. The Structures folder The folder Descriptors (Fig. 4) includes always a subfolder All that lists all the descriptors defined within the given cache file. The descriptors transferred to the CODESSA PRO externally by user are listed in the subfolder External. The descriptors calculated by CODESSA PRO are listed in the respective subfolders (Constitutional, Topological, Geometrical, Charge-related, Semi-empirical, Thermodynamical, Molecular type, Atomic type, Bond type). The currently used subfolder is given in boldface font. 12 Figure 4. The Descriptors folder The folder Properties (Fig. 5) includes always a subfolder All that gives the list of all properties considered in the given cache file. The currently treated property is given in the boldface font. The name of each property is given on the first line of the respective property file (P000000x.prp). Figure 5. The Properties folder The folder Models (Fig. 6) includes the list of all QSAR/QSPR models stored within the current cache file. A separate subfolder is created automatically for each property. The details related to each correlation are displayed when the pointer is held over the respective line. Double clicking on a correlation launches correlation view in the work area. Figure 6. The Models folder 13 2.2.2 Work Area 2.2.2.1 Structure View Window Figure 7. The CODESSA PRO Structure View Window Double clicking on the name of a structure in the Structures folder opens a structure view window (Fig. 7) in the work area. The 3-D structures can be viewed as - wire-frame - ball and stick - CPK (Corey-Pauling-Koltun) surface - solvent accessible surface (SASA) - Zefirov’s charges on SASA - MOPAC charges on SASA Right clicking inside the window opens the view and label context pop-up menu. Selecting view provides the six options mentioned above, and the label identifies the number or the type of each atom. When the pointer is positioned over the view window, it changes to a 4-sided arrow that can be dragged to rotate the structure around its mass center. 14 2.2.2.2 Correlation View Window Figure 8. The CODESSA PRO Correlation View Window Double clicking on a correlation object in the Models folder opens the correlation view window (Fig. 8) in the work area. The graph appearing there shows the observed (experimental) property values versus the predicted values for this QSAR/QSPR correlation model. When the window is first opened, all the points for outliers will be blue rather than yellow. Single click on a point will display its properties and their associated values in the properties window. Double clicking on a point will open in the work area a structure view window, for the structure that corresponds to that point. Right clicking on the window will open a context pop-up menu that enables to show structures, show descriptors, create Sublists, mark outliers and mark nonoutliers as desired. Double clicking on a descriptor object will also open a correlation view window. The window will show the linear relationship between the descriptor and the property dimension selected. The value predicted according to any model chosen maybe viewed by selecting that model, clicking the view pull-down menu, and then choosing predicted properties. This will display the observed property values, predicted property value and errors. 15 2.2.3 Object window The object window displays the information about the objects, folders or lists selected in workspace. If a folder or list is selected in the workspace, the number of objects contained is displayed. If an object is selected, the object window will provide varying information, depending on the type of object. The picture on the left (Fig. 9) shows the information given in the object window when a structure object is selected in the workspace. It will also show the experimental value and calculated value of the property of the current structure, which are not shown in the picture now. Figure 9. Data on structure in object window The information provided in the object window when a descriptor object is selected in the workspace is presented on Fig. 10. Figure 10. Data on descriptor in object window When a property object is selected in the workspace, information related to the respective .prp file is displayed in object window (Fig. 11). This includes the name of the property, comments and the number of structures for which the given property values are available. Figure 11. Data on property in object window The object window for correlation objects lists the details of the correlation (Fig. 12). Details displayed include the properties used in the correlation. 16 Figure 12. Data on correlation in object window 2.2.4 Log Window The log window keeps the protocol on each operation that has been performed on the storage. Double clicking on a correlation in the workspace will also give the full description of the correlation in this window (Fig. 13). The text can be cut and pasted directly into a word processing program. Figure 13. The information on a correlation object in the log window. The edit pull-down menu has options to select all log, copy from log, paste to log and clear all log. Transferring the information for several correlations to a word processing program can be accomplished by clearing all log, double clicking on each of the correlations, then clicking select all log, and finally copy from log. 17 Chapter 3 Using CODESSA PRO to run a project 3.1 Starting a new project It is important to decide whether to use a general storage area for several projects or separate storage areas for each individual project. A general storage area allows a research group to avoid repeating the calculations of molecular descriptors as well as to economize the amount of storage space used. If a general storage area is to be used, it will be necessary to keep the global index of the files and chemical structure and property names. The names of the chemical structures are defined in the name section of the respective molfile. The name of the property investigated is defined in the first line of the property file (.prp). 3.2 Creating structure files Molfiles Structures can be created using any chemical drawing program that will convert 2-D structures to 3-D structures, e.g. HyperChem [www.hyper.com], ISIS Draw [www.mdl.com], ChemDraw [www.cambridgesoft.com], or PCModel [www.serenasoft.com]. The 3-D geometries must be optimized in a two-step process using MOPAC software [http://ccl.net/cca/software/SOURCES/FORTRAN/mopac7_sources/index.shtml]. The first optimization uses the keywords: AM1 and the second optimization uses the keywords: AM1 GNORM=0.01 PRECISE. The additional keyword MMOK should be used for any molecule with peptide linkages. If either optimization fails, it should be repeated using additional keywords until successful. The following list contains the progression of additional keywords to add. EF XYZ EF XYZ EF XYZ GEO-OK If none of the keywords or combinations listed above lead to a complete optimization of the structure, a different starting geometry described by the respective Zmatrix has to be used. Note that dummy atoms used for keeping the symmetry of the structure must be deleted prior to performing SCF and thermodynamic calculations. The optimized structures should be converted to molfile format, given the extension .mol, and stored in the mol3dmop directory of the CODESSA PRO storage. 18 SCF files The SCF file is the output file of the MOPAC calculation for a given chemical structure (molecule, molecular cluster) using the following keywords: AM1 VECTORS BONDS PI POLAR PRECISE ENPART EF. The files should have an extension .mno and stored in the mopscf directory of the CODESSA PRO storage. Force Files The Force file is the output file of the MOPAC calculation for a given chemical structure (molecule, molecular cluster) using the following keywords: AM1 FORCE PRECISE THERMO ROT=1. The files should also have an extension .mno and stored in the moptherm directory of the CODESSA PRO storage. 3.3 Creating a property file Once the property values of interest have been collected and carefully checked for each structure, the data must be assembled into a property (.prp) file. It is essentially a text file. Spreadsheet programs may be used to simplify the process of assembling data, but the file should be therefore saved as text. The line format of the file must be exactly as follows: 1st line 2nd line 3rd line …. …. Name of property Comments Structure1 Value1 Structure2 Value2 Structure3 Value3 The individual lines have the following content: Name of property: The name that is assigned to the property object. The object will appear within one or more lists in the property folder. A list with same name will also be generated automatically by CODESSA PRO and stored in the correlations folder. Comments: Any comments written in this location will be listed in the comments section of the object window whenever the property is selected. Structure1: The number of the first structure (not the filename). Zeros in this number are ignored; the structure S0000023.mol can be listed as 00023, 023, or just 23. Value1: The property value associated with Structure1. There must be a delimiter (space or tab) between the structure number and the property value. The type or number of delimiters is irrelevant. 19 On the CODESSA PRO window, click the edit pull-down menu and choose add property. Then input the name, comment, and property values according to the format which is indicated in the notepad, and, finally, use Load storage from Calculate menu to load the data. 3.4 Creating a storage First of all, the user has to create the main (root) folder for a given project using any of Windows options for this procedure (e.g. Windows Explorer). To create storage for a new project, this folder has to be specified using the storage function in the option pull-down menu. After clicking OK in this window, the project folder will contain a list of subfolders, an example of which appears in the following picture (Fig. 14). The user may rename all subfolders or keep their default names shown in the shaded area. Figure 14. Definition of the project storage All the storage subfolders should be empty before starting the project calculations. In the first step of filling the storage, the previously created molfiles of all chemical structures related to the project should be stored either in the MOL subfolder or in another specified location. Regardless of the original naming of molfiles, CODESSA PRO, will automatically name the files as Sddddddd.mol, where d stays for sequential numbers. 3.5 Preparing data in a storage After saving all molfiles for the chemical structures in the MOL Subfolder, selecting Load storage from Calculate menu automatically prepares all the data that are necessary for carrying out the correlation analysis. If the location of molfiles is different, 20 then it can be included by the selection of the edit pull-down menu and choosing there add structure function. The process through which CODESSA PRO prepares all the data, except property files, includes 11 stages. Each of these 11 stages is performed automatically in sequence by CODESSA PRO. Stage 1 This is the first model building stage. The program applies simple molecular mechanics preoptimization and transfers the 2-dimensional structures from the MOL subfolder to the 3-dimensional structures that will be stored in the 3D MOL (Molecular Mechanic) subfolder. Missing hydrogen atoms are added according to the formal valence rules. The 3D structure building is done deterministically for the first iteration and stochastically for subsequent iterations. Stage 2 This stage is concerned with the conversion of the file from MDL molfile format in the 3D MOL (Molecular Mechanic) subfolder into the MOPAC input file format for preliminary geometry optimization in the MOPAC Optimization (step1) subfolder. Stage 3 In this stage, the preliminary geometry optimization of the molecular structures using CMOPAC (MOPAC Version 7 clone) is carried out. If the final optimization gradient is less than 5.0, the program automatically proceeds to Stage 4. If the gradient is more than 5.0, then the program automatically go back to Stage 1 and starts the structure building again, stochastically. This process is continued up to 10 times until the optimization stage is passes on the criteria of the gradient value being less than 5.0. If a satisfactory result is not achieved after 10 iterations, the program stops. Stage 4 During this stage, the program converts the MOPAC output files in the MOPAC Optimization (step1) subfolder into a MOPAC input files for precise geometry optimization, and stores them in the MOPAC Optimization (step2) subfolder. Stage 5 CODESSA PRO is performing precise geometry optimization at this stage. When the geometry optimization gradient value falls below 0.5, the program proceeds to Stage 6. If the minimum gradient achieved exceeds 0.5, then the program goes back to Stage 1 and does structure building again, stochastically. This process is continued up to 10 iterations until the gradient value test for the precise optimization is satisfied. If 10 iterations are unsuccessful, the calculation is stopped. Stage 6 At this stage, the CODESSA PRO program converts the MOPAC output files in the MOPAC Optimization (step2) Subfolder into MOPAC input files for calculation of molecular properties. This is done without explicit calculation of the Hessian matrix. The resulting MOPAC input file is stored in SCF Subfolder. 21 Stage 7 At this stage, the CODESSA PRO program produces a set of molecular data calculated using CMOPAC that are used subsequently as the molecular descriptors or as the initial data for the calculation of descriptors. The resulting output files are stored in the SCF Subfolder. Stage 8 At this stage, CODESSA PRO program takes the data on molecular geometries stored in the MOPAC Optimization (step2) Subfolder, prepares MOPAC input files for force calculations, and stores them in the THERMO Subfolder. At this stage keyword ROT=1 is added, which is valid for C1, CI, and CS groups of symmetry. However, if molecular symmetry is different from these point groups, the respective keyword should be edited manually in the corresponding input (.mni) file in MOPAC Optimization (step1) Subfolder or MOPAC Optimization (step2) Subfolder. Stage 9 At this stage, CODESSA PRO program calculates the molecular properties using the Hessian matrix (“force” calculation). If a molecular structure is in a transition state, then the program goes back to Stage 1 and restarts by using stochastic structure generation. If the molecule is in its ground state, then the program proceeds to Stage 10. The program terminates after 10 attempts to find a stable structure. Stage 10 This stage will be reached only when all the geometry tests described above are satisfied. At this stage, the molfiles in the 3D MOL (MOPAC) Subfolder are created, based on formal connectivity information from the respective molfiles in the MOL Subfolder, and atomic coordinates from the MOPAC output file (.mno) in the MOPAC Optimization (step2) Subfolder. Stage 11 At this stage, CODESSA PRO has all necessary information to proceed to the calculation of molecular descriptors. This information is stored in molfiles in the 3D MOL (MOPAC) Subfolder, in the MOPAC output files (.mno) in the SCF Subfolder and in the MOPAC output files (.mno) in the THERMO Subfolder All intermediate files are deleted from storage at this stage. Only MDL MOL files in the MOL Subfolder and the 3D MOL (MOPAC) Subfolder, and MOPAC input and output files in the SCF Subfolder and THERMO Subfolder remains. If the calculation cycle is not finished at Stage 1, changes can be made to the files at the last successful stage manually (usually it is editing of MOPAC keywords) and the calculations restated. If four files (MDL MOL files in the MOL Subfolder and the 3D MOL (MOPAC) Subfolder, and the MOPAC output files in the SCF Subfolder and the THERMO Subfolder) are present and up-to-date, no further calculation will be done. The up-to-date state is defined using the modification time of the files. In case it is not, the calculations will be processed starting from the most recently corrected file. To 22 invoke the recalculation, select from the menu Calculate/Load Storage (F5) or Calculate/Descriptors (F6). In the last case, only problematic structures which show in MOPAC Optimization (step1) Subfolder and MOPAC Optimization (step2) Subfolder will be recalculated. If the problems arise as a result of a structure’s improper format, you must redraw the structure in the MOL Subfolder. Before proceeding further, it is advisable to check that all subfolders in the storage are empty except the following: 1. MOL Subfolder- this subfolder should contain molfiles of all structures. 2. SCF Subfolder- this subfolder should contain only .mni and .mno files of all structures. 3. THERMO Subfolder- this subfolder should contain only .mni and .mno files of all structures. 4. 3D MOL (MOPAC) Subfolder- this subfolder should contain only molfiles of all structures which are different from the molfiles stored in the MOL Subfolder. 3.6 Calculation of molecular descriptors One of the major advantages of CODESSA is its large pool of molecular descriptors, which are calculated for each chemical structure listed in the property (.prp) file. The descriptors are divided into several lists. The first list contains external descriptors, added by the user manually. The remaining lists can be divided into two categories. The first category includes the lists classified according to the origin of calculation of descriptors. It contains therefore the following lists: constitutional, topological, geometrical, charge related, semi-empirical, and thermodynamic. The second category of descriptor lists is based on their classification according to the sub(molecular) unit they refer to. Therefore, the second category contains the following lists: molecular type, atomic type, and bond type. The more detailed information on descriptors available in the program is given in Chapter 4. Descriptors are automatically calculated for all structures added to the storage. Calculation of quantum chemical or thermodynamical descriptors can be turned off by choosing Descriptor calculation… from the Option menu and selecting the respective options in the Descriptor calculation dialog panel (Fig. 15). 23 Figure 15. Descriptor calculation options To recalculate the descriptors linked to the currently selected property, choose Descriptors from the Calculate menu. 3.7 Calculation of the QSAR/QSPR correlations The structures, descriptors, and properties of interest should be properly selected (they should all appear in boldface font in the workspace). If not, the right click on the name of structures, descriptors or property shows the respective pop-up menu, where the necessary items can be selected by using the function make current and highlighting them afterwards. The method used for the development of QSAR/QSPR models can be chosen by activating the calculate pull-down menu and selecting the method and its limitation characteristics. The correlation will be finished automatically by CODESSA PRO. In order to calculate only descriptors the function descriptors has to be selected in the calculate pull-down menu (or use F7 keyboard shortcut). The selection of the function form matrix in this menu creates the property-descriptor data matrix. 3.7.1 MLR MLR (multilinear regression) is the simplest method that builds a single correlation using all selected descriptors. To use the method, select MLR from the calculate pull-down menu. MLR method has no options. 3.7.2 BMLR The BMLR command from the calculate pull-down menu starts search for the multiparameter regression with the maximum predicting ability using best multilinear regression methodology. Figure 16. BMLR options 24 Control parameters of the method can be adjusted in the Options dialog box that can be opened by BMLR… command from the Option pull-down menu (Fig. 16). The following criteria can be flexibly altered in order to obtain the maximum effectiveness of the search: - the minimum improvement of the linear correlation coefficient R2 after adding a new descriptor (default value 0.01); - the upper limit of the square of the linear correlation coefficient R 2 for two descriptor scales to be considered orthogonal (default value 0.1); - the lower limit of the square of the linear correlation coefficient R2 for two descriptor scales to be considered non-collinear (default value 0.6); - the lower limit of the square of the linear correlation coefficient R2 for 3parameter correlation (default value 0.6); - increment of the lower limit of the square of the linear correlation coefficient R 2 for higher order (more than 3) correlations (default value 0.02). 3.7.3 HMPRO HMPRO descriptor selection procedure can be started by using the BMLR command from the calculate pull-down menu. HMPRO parameter selection dialog can be opened by selecting the HMPRO… command from Option menu. Figure 17. HMPRO weight parameters The parameters in the Figure 17 correspond to the exponentials in the general fitness function (correlation weight) defined as follows: 25 w ( R 2 ) xa F x2 n x3 ( s 2 ) x4 N x5 , where R2 is a squared correlation coefficient, F is Fisher criteria, n is number of the data points, s2 is standard error (non-normalized), and N is number of the descriptors in the correlation. The xi denote the weight parameters and have the following default values {1, 1, 1, -1, -1 }. Figure 18. HMPRO preselection options for 1-parameters correlations Figure 19. HMPRO preselection options for 1-parameter correlations 26 In the windows of Figures 18 and 19, the minimum allowed values of the statistical characteristics for the single-parameter correlations can be established. Zero values mean that the respective criterion is not applied. Figure 20. HMPRO forming parameters for 2-descriptor correlations In the window of Figure 20, the criteria for the ordering the 2-parameter correlations are established. This can be one of the following statistical characteristics: R2 – squared correlation coefficient F – Fisher criterion s2 – squared standard error R2cvOO – one-leave-out cross-validated correlation coefficient w – correlation weight In addition, the upper limits of the intercorrelation coefficient between the two descriptors and the criterion for including the two-parameter correlations into the correlations pool are set up. 27 Figure 21. HMPRO expanding parameters In the window of Figure 21, the following parameters for the development of the successive multiparameter correlations are set up: - the size of the correlations pool; - the maximum number of descriptors involved in the models developed; - the maximum number of iterations for finding the best correlations; - the time limit for the development of the correlations. Figure 22. HMPRO models’ testing parameters 28 In the window of Figure 22, the parameters for the cross-validation testing of the correlations are determined. Those include the size of the training set for leave-many-out crossvalidation (as a fraction of the full set), and the maximum numbers of the crossvalidation and the randomization tests. Figure 23. HMPRO output parameters In the window of Figure 23, the following parameters are determined for the correlations to be included in the output: - maximum intercorrelation coefficient between the descriptors involved in the given correlation; - maximum allowed difference between the multiple correlation coefficient and the one-leave-out correlation coefficient for a given correlation (as the percentage of the value of the squared multiple correlation coefficient); - maximum allowed difference between the multiple correlation coefficient and the many-leave-out correlation coefficient for a given correlation (as the percentage of the value of the squared multiple correlation coefficient); - maximum allowed fraction of the randomization tests: - maximum number of correlations of the given size (number of descriptors involved). 3.7.4 PLS PLS (partial least squares) options dialog can be opened by PLS… command from the Option pull-down menu (Fig. 24) and the respective method started by PLS command from the Calculate pull-down menu. 29 Figure 24. PLS options PLS is performed on centered and scaled variables by NIPALS algorithm that is stopped when one of the following is true: - residual sum of squares is less than its preset minimum value (default is 0.01); - maximum number of components has been reached (default is 5 components). 3.7.5 PCR PCR (principal components regression) options dialog can be opened by PCR… command from the Option pull-down menu (Fig. 25) and the respective calculation started by PCR command from the Calculate pull-down menu. Figure 25. PCR options PCR analysis is conducted the same way as BMLR except that principal components are used instead of descriptors. The “PCA analysis” part of the option selection panel allows to select the descriptor scaling method (default is centered and normalized), PCA analysis method (default is “real error”) and method’s convergence criterion (default is 0.01). Options in “Best regression” part correspond to BMLR parameters described in section 3.7.2. 30 3.8 Viewing correlations After the QSAR/QSPR model development is completed and the best correlations found, those can be viewed in the correlation view window (section 2.2.2.2) 3.9 Printing out results. The standard print function from the file menu bar is applicable to print the correlation plot from the work area. 3.10 Manipulating lists CODESSA PRO automatically produces one or more lists in each folder, but for the analysis of a property or correlation, it is sometimes helpful to create user-defined lists, e.g. a list of structures containing phenyl rings. The simplest method for creating a list in CODESSA PRO is to select several objects from one or more existing lists (i.e. all) while holding down the CTRL key, right clicking to open the context pop-up menu and left clicking on create list. A list can also be formed from a group of objects that are linked to a common object. If a descriptor is chosen to be the common object, then all the structures that are defined for that descriptor could be selected for a list. To create the list from a common object, the function Select Linked in the context pop-up menu enables the selection of the type of objects for this list. The AND operation allows to create a list from multiple common objects. 31 Figure 26. List logic of CODESSA PRO The List Logic module (Fig. 26) in CODESSA PRO combines the objects from two lists and can be opened by clicking any object, right clicking to show the context pop-up menu, and choosing lists logic, or just using the F4 keyboard shortcut instead. Choose the type of object that will be contained by the new list by clicking on the appropriate button in the Sub-List type box (descriptors is selected in the example above). Choose the two lists to combine and the appropriate Operation. The objects from the two lists that meet the criteria of the logical operation will be displayed in the Results box. The results can then be copied to a new list (Create) or be selected (Select) for further comparison. 32 Chapter 4 CODESSA PRO Descriptors CODESSA PRO molecular descriptors are divided into several classes, depending on their origin of calculation or on the structural item in the chemical structure (molecule, atom or bond). The respective classification is given in Table 1. It is followed by the formulation of each descriptor or descriptor class. Table 1. CODESSA PRO Molecular Descriptors Group Type Notation Full name 1 Constitutional M NA 2 Constitutional A NX,NX,r absolute and relative numbers of atoms of certain chemical identity (C, H, O, N, F, etc.) in the molecule 3 Constitutional M NY,NY,r absolute and relative numbers of certain chemical groups and functionalities in the molecule 4 Constitutional M NB 5 Constitutional M 6 Constitutional M NR,NR,r total number of rings, number of rings divided by the total number of atoms 7 Constitutional M NR6,NR6,r total and relative number of 6-atoms aromatic rings 8 Constitutional M M, Mr molecular weight and average atomic weight 9 Topological M W Wiener index 10 Topological M Randi 's molecular connectivity index 11 Topological M Randi indices of different orders 12 Topological M J Balaban's J index 13 Topological M v Kier total number of atoms in the molecule total number of bonds in the molecule NS,ND,NT, absolute and relative numbers of single, double, triple, aromatic or other bonds in NS,r,ND,r,NT,r the molecule m m 33 and Hall valence connectivity indices Kier shape indices Kier flexibility index IC Mean information content index SIC Structural information content index k CIC Complementary index k BIC Bonding information content index M TnE Topological electronic indices Geometrical M SM Molecular surface area 22 Geometrical M SSA Solvent-accessible molecular surface area 23 Geometrical M VM Molecular volume 24 Geometrical M VM,SE Solvent-excluded molecular volume 25 Geometrical M Gp,Gb Gravitational indexes 26 Geometrical M IX,IY,IZ Principal moments molecule 27 Geometrical M SXY,SYZ,SXZ Shadow areas of a molecule 28 Geometrical M SXY,r, SYZ,r, Relative shadow areas of a molecule SXZ,r 29 Electrostatic A Qi Gasteiger-Marsili empirical atomic partial charges 30 Electrostatic A Qi Zefirov's empirical atomic partial charges 31 Electrostatic A Qi Mulliken atomic partial charges 32 Electrostatic M Qmax, Qmin 33 Electrostatic M P, P', P" 34 Electrostatic M Dipole moment 35 Electrostatic M Molecular polarizability m 14 Topological M 15 Topological M 16 Topological M 17 Topological M k 18 Topological M 19 Topological M 20 Topological 21 k information of content inertia of a Minimum (most negative) and maximum (most positive) atomic partial charges Polarity parameters 34 36 Electrostatic M 37 CPSA M PPSA1 Partial positively charged surface area 38 CPSA M PPSA2 Total charge weighted partial positively charged surface area 39 CPSA M PPSA3 Atomic charge weighted partial positively charged surface area 40 CPSA M PNSA1 Partial negatively charged surface area 41 CPSA M PNSA2 Total charge weighted partial negatively charged surface area 42 CPSA M PNSA3 Atomic charge weighted negatively charged surface area 43 CPSA M DPSA1 Difference between partial positively and negatively charged surface areas 44 CPSA M DPSA2 Difference between total charge weighted partial positive and negative surface areas 45 CPSA M DPSA3 Difference between atomic charge weighted partial positive and negative surface areas 46 CPSA M FPSA1 Fractional partial positive surface area 47 CPSA M FPSA2 Fractional total charge weighted partial positive surface area 48 CPSA M FPSA3 Fractional atomic charge weighted partial positive surface area 49 CPSA M FNSA1 Fractional partial negative surface area 51 CPSA M FNSA2 Fractional total charge weighted partial negative surface area 52 CPSA M FNSA3 Fractional atomic charge weighted partial negative surface area 53 CPSA M WPSA1 Surface weighted charged partial positive charged surface area 54 CPSA M WPSA2 Surface weighted charged partial positive charged surface area Molecular hyperpolarizability 35 partial 55 CPSA M WPSA3 Surface weighted charged partial positive charged surface area 56 CPSA M WNSA1 Surface weighted charged partial negative charged surface area 57 CPSA M WNSA2 Surface weighted charged partial negative charged surface area 58 CPSA M WNSA3 Surface weighted charged partial negative charged surface area 59 CPSA M RPCG Relative positive charge 60 CPSA M RNCG Relative negative charge 61 CPSA M HDSA1 Hydrogen bonding donor ability of the molecule 62 CPSA M HDSA2 Area-weighted surface charge hydrogen bonding donor atoms 63 CPSA M HASA1 Hydrogen bonding acceptor ability of the molecule 64 CPSA M HASA2 Area-weighted surface charge hydrogen bonding acceptor atoms 65 CPSA M HDCA1 Hydrogen bonding donor ability of the molecule 66 CPSA M HDCA2 Area-weighted surface charge hydrogen bonding donor atoms 67 CPSA M HACA1 Hydrogen bonding acceptor ability of the molecule 68 CPSA M HACA2 Area-weighted surface charge hydrogen bonding acceptor atoms 69 CPSA M FHDSA1 Fractional hydrogen ability of the molecule 70 CPSA M FHDSA2 Fractional area-weighted surface charge of hydrogen bonding donor atoms 71 CPSA M FHASA1 Fractional hydrogen bonding acceptor ability of the molecule 36 bonding of of of of donor 72 CPSA M FHASA2 Fractional area-weighted surface charge of hydrogen bonding acceptor atoms 73 CPSA M FHDCA1 Fractional hydrogen ability of the molecule 74 CPSA M FHDCA2 Fractional area-weighted surface charge of hydrogen bonding donor atoms 75 CPSA M FHACA1 Fractional hydrogen bonding acceptor ability of the molecule 76 CPSA M FHACA2 Fractional area-weighted surface charge of hydrogen bonding acceptor atoms 77 MO related M HOMO Highest occupied (HOMO) energy 78 MO related M LUMO Lowest unoccupied molecular orbital (LUMO) energy 79 MO related M EA Fukui index atomic nucleophilic reactivity 80 MO related M NA Fukui index atomic electrophilic reactivity 81 MO related M RA Fukui index atomic one-electron reactivity 82 MO related B PAB Mulliken bond orders 83 MO related A Vf,A Free valence M Etot Total energy of the molecule M Eel Total electronic energy of the molecule M H0f Standard heat of formation 84 85 86 Quantum chemical Quantum chemical Quantum chemical bonding molecular donor orbital 87 Quantum chemical AT Eee,A Electron-electron repulsion energy for a given atomic species 88 Quantum chemical AT Ene,A Nuclear-electron attraction energy for a given atomic species 89 Quantum chemical B Eee,AB Electron-electron repulsion between two given atoms 37 90 Quantum chemical B Ene,AB Nuclear-electron attraction between two given atoms 91 Quantum chemical B Enn,AB Nuclear repulsion energy between two given atoms 92 Quantum chemical B Eexc,AB Electronic exchange energy between two given atoms 93 Quantum chemical BT ER,AB Resonance energy between given two atomic species 94 Quantum chemical BT EC,AB Total electrostatic interaction energy between two given atomic species 95 Quantum chemical BT Etot,AB Total interaction energy between two given two atomic species 96 Quantum chemical M Eee,tot Total molecular one-center electron repulsion energy electron- 97 Quantum chemical M Ene,tot Total molecular one-center nuclear attraction energy electron- 98 Quantum chemical M EC,tot Total intramolecular interaction energy M K M ΔHprot 102 Thermodynamic M HV Vibrational enthalpy of the molecule 103 Thermodynamic M HT Translational enthalpy of the molecule 104 Thermodynamic M SV Vibrational entropy of the molecule 105 Thermodynamic M SR Rotational entropy of the molecule 106 Thermodynamic M ST Translational entropy of the molecule 107 Thermodynamic M CV Vibrational heat capacity of the molecule 100 101 Quantum chemical Quantum chemical energy electrostatic Electron kinetic energy density Energy of protonation 38 4.1 Constitutional Descriptors The constitutional descriptors depend on atomic constitution of the chemical structure (molecule). They also include the descriptors related to the types of bonds and the presence of rings in the molecule. total number of atoms in the molecule absolute and relative numbers of atoms of certain chemical identity (C, H, O, N, F, etc.) in the molecule absolute and relative numbers of certain chemical groups and functionalities in the molecule total number of bonds in the molecule absolute and relative numbers of single, double, triple, aromatic or other bonds in the molecule total number of rings, number of rings divided by the total number of atoms total and relative number of 6-atoms aromatic rings molecular weight and average atomic weight 39 4.2 Topological Descriptors 4.2.1 Wiener index Definition: 1 NSA W dij 2 (i , j ) dij - the number of bonds in the shortest path connecting the pair of atoms i and j NSA - the number of non-hydrogen atom in the molecule Reference: H. Wiener, J. Am. Chem. Soc., 1947, 69, 17. 4.2.2 Randić's molecular connectivity index Definition: D D 1 2 i j edges ij Di and Dj - the edge degrees (atom connectivities) of the molecular graph. 4.2.3 Randić's indices of different orders Definition: m Di Dj...Dk 1 2 path References: 1. M. Randić, J. Am. Chem. Soc., 1975, 97, 6609. 2. L. B. Kier, L. H. Hall, Molecular Connectivity in Structure-Activity Analysis, J. Wiley & Sons, New York, 1986. 4.2.4 Balaban's J index Definition: J q Si S j 1 2 1 edges ij q - number of edges in the molecular graph 40 m = (q – n + 1) - the cyclomatic number of the molecular graph n – number of atoms in the molecular graph Si - distance sums calculated as the sums over the rows or columns of the topological distance matrix of the molecule, D. References: 1. A. T. Balaban, Chem. Phys. Lett., 1981, 89, 399. 2. A. T. Balaban, Pure and Appl. Chem., 1983, 55, 199. 4.2.5 Kier and Hall valence connectivity indices Definition: m N s m 1 12 1 v i 1 k 1 k v ( Z kv H k ) - valence connectivity for the k-th atom in the molecular graph ( Z k Z kv 1) v k Zk - the total number of electrons in the k-th atom Z kv - the number of valence electrons in the k-th atom Hk - the number of hydrogen atoms directly attached to the kth non-hydrogen atom m = 0 - atomic valence connectivity indices m = 1 - one bond path valence connectivity indices m = 2 - two bond fragment valence connectivity indices m = 3 three contiguous bond fragment valence connectivity indices etc. References: 1. L. B. Kier, L. H. Hall, Eur. J. Med. Chem., 1977, 12, 307. 2. L. B. Kier, L. H. Hall, J. Pharm. Sci., 1981, 70, 583. 3. L. B. Kier, L. H. Hall, Molecular Connectivity in Structure-Activity Analysis, J. Wiley & Sons, New York, 1986. 4.2.6 Kier shape indices Definition: ( N SA )( N SA 1) 2 (1P ) 2 1 41 ( N SA 1)( N SA 2) 2 ( 2P ) 2 2 ( N S A 1)( N S A 3)2 ( 3P ) 2 if N SA is odd ( N SA 3)( N SA 2) 2 ( 3P ) 2 if N SA is even 3 3 NSA - the number of non-hydrogen atom in the molecule nP - the number of paths of the length nin the molecular graph ri 1 rCi ri - atomic radius of a given atom rCi - atomic radius of the carbon atom in the sp3 hybridization state References: 1. L. B. Kier, Quant. Struct.-Act. Relat., 1985, 4, 109. 2. L. B. Kier, in: Computational Chemical Graph Theory, D. H. Rouvray (Ed.), Nova Science Publishers, New York 1990. 4.2.7 Kier flexibility index Definition: (1 2 ) N SA 1 and 2 - Kier shape indices NSA - the number of non-hydrogen atom in the molecule Reference: 1. L. B. Kier, in: Computational Chemical Graph Theory, D. H. Rouvray (Ed.), Nova Science Publishers, New York 1990. 4.2.8 Mean information content index 42 Definition: k k ni n log 2 i n i 1 n IC ni - number of atoms in the ith class n - the total number of atoms in the molecule k - number of atomic layers in the coordination sphere around a given atom that are accounted for Reference: 1. L. B. Kier, J. Pharm. Sci., 1980, 69, 807. 4.2.9 Structural information content index Definition: k k SIC k IC / log 2 n k ni n log 2 i n i 1 n IC ni - number of atoms in the ith class n - the total number of atoms in the molecule k – number of atomic layers in the coordination sphere around a given atom that are accounted for Reference: 1. S.C. Basak, D. K. Harriss, V. R. Magnuson, J. Pharm. Sci., 1984, 73, 429. 4.2.10 Complementary information content index Definition: CIC log 2 n k IC k 43 k k ni n log 2 i n i 1 n IC ni - number of atoms in the ith class n - the total number of atoms in the molecule k - number of atomic layers in the coordination sphere around a given atom that are accounted for Reference: 1. S.C. Basak, D. K. Harriss, V. R. Magnuson, J. Pharm. Sci., 1984, 73, 429. 4.2.11 Bonding information content index Definition: k k BIC kIC / log 2 q k ni n log 2 i n i 1 n IC ni - number of atoms in the ith class n - the total number of atoms in the molecule k - number of atomic layers in the coordination sphere around a given atom that are accounted for q - number of edges in the molecular graph Reference: 1. S.C. Basak, D. K. Harriss, V. R. Magnuson, J. Pharm. Sci., 1984, 73, 429 () 4.2.12 Topological electronic indices Definition: E T1 N SA | qi q j | ( i j ) rij2 44 T2 E Nb | qi q j | (i j ) rij2 qi- partial charge on the i-th atom rij – distance between i-th and j-th atoms NSA – number of non-hydrogen atom in the molecule Nb – number of bonds between non-hydrogen atom in the molecule Reference: 1. K. Osmialowski, J. Halkiewicz, R. Kaliszan, J. Chromatogr., 1986, 63, 361. 45 4.3 Geometrical Descriptors 4.3.1 Molecular surface area Definition: (i ) S M SVW Sov i (i ) SVW - van der Waals area of the i-th constituent atom of a molecule Sov – van der Waals area of atoms inside overlapping atomic envelopes Reference: 1. M. Karelson, Molecular Descriptors in QSAR/QSPR, J. Wiley & Sons, New York, 2000. 4.3.2 Solvent-accessible molecular surface area Definition: S SA A As A A+ - convex areas of a molecule As - saddle areas of a molecule A- - concave areas of a molecule Reference: M. L. Connolly, J. Appl. Crystallogr., 1983, 16, 548-558. 4.3.3 Molecular volume Definition: (i ) VM VVW Vov i (i ) VVW - van der Waals volume of the i-th constituent atom of a molecule Vov – volume of overlapping van der Waals atomic envelopes 46 Reference: 1. F. M. Richards, Annu. Rev. Biophys. Bioeng., 1977, 6, 151-176 4.3.4 Solvent-excluded molecular volume Definition: Vmol ( SE ) V p V Vs V Vac Vnc Vp - volume of internal polyhedron V - volume pieces between the center of an atom and the convex face of the solventaccessible surface Vs - volume of saddle pieces V - volume of concave pieces Vac, Vnc - cusp volume pieces Reference: 1. M. L. Connolly, J. Am. Chem. Soc., 1985, 107, 1118-1124 4.3.5 Gravitational indexes Definition: Na mi m j i j rij2 Gp Nb Gb mi m j i j rij2 mi, mj - atomic masses of atoms i and j rij - interatomic distance of atoms i and j Na - number of atoms in the molecule Nb - number of chemical bonds in the molecule Reference: 1. A. R. Katritzky, L. Mu, V. S. Lobanov, M. Karelson, J. Phys. Chem., 1996, 100, 10400-10407. 47 4.3.6 Principal moments of inertia of a molecule Definition: I k mi r i 2 ik mi - atomic weights of constituent atoms of a molecule rik - distance of the i-th atomic nucleus from the k-th main rotational axes (k = X, Y or Z) Reference: 1. Handbook of Chemistry and Physics, CRC Press, Cleveland OH, 1974, p. F-112. 4.3.7 Shadow areas of a molecule Definition: Sk 1 d d 2 (C) C – contour of the projection of the molecule on the plane defined by two principal axes of the molecule (k = XY, XZ or YZ) - x or y - y or z Reference: 1. R. H. Rohrbaugh, P. C. Jurs, Anal. Chim. Acta, 1987, 199, 99. 4.3.8 Relative shadow areas of a molecule Definition: S kr d d (C) S (k ) C – contour of the projection of the molecule on the plane defined by two principal axes of the molecule (k = XY, XZ or YZ plane) - x or y - y or z 48 S ( k ) X Y ; X Z or Y Z Reference: 1. M. Karelson, Molecular Descriptors in QSAR/QSPR, J. Wiley & Sons, New York, 2000. 49 4.4 Electrostatic Descriptors 4.4.1Gasteiger-Marsili empirical atomic partial charges Definition: Qi qi qi 1 2 j iv k iv - the contribution to the vi j k iv k atomic charge on the a-th step of iteration of charge iv aiv bivQi civQi2 - electronegativity of n–th orbital on i-th atom aiv I iv0 Eiv0 2 I iv0 Eiv Eiv0 biv 4 I iv I iv0 Eiv Eiv0 biv 4 I iv0 , I iv , E iv0 , and Eiv - the ionization potentials and electron affinities of the neutral atom (superscript 0) and of the positive ion (superscript +), respectively. References: 1. J. Gasteiger, M. Marsili, Tetrahedron Lett., 1978, 3181 2. J. Gasteiger, M. Marsili, Tetrahedron, 1980, 36, 3219-3228 4.4.2 Zefirov's empirical atomic partial charges Definition: Qi f i i – atomic electronegativities 50 1 n 1 0 n i i k k 1 i0 - electronegativities of isolated atoms n – atoms in the first coordination sphere of a given atom References: 1. N. S. Zefirov, M. A. Kirpichenok, F. F. Izmailov, M. I. Trofimov., Dokl. Akad. Nauk SSSR, 1987, 296, 883. 2. M. A. Kirpichenok, N. S. Zefirov, Zh. Org. Khim., 1987, 23, 4 . 4.4.3 Mulliken atomic partial charges Definition: 1 1 Q A Z A Pkk Pkl Plk Z A Pkl 2 l k 2 l k k A ZA - atomic nuclear charge Pkl - atomic population matrix elements Reference: 1. R. S. Mulliken, J. Chem. Phys.,1955 23, 1833-1840. 2. I. G. Csizmadia, Theory and Practice of MO Calculations on Organic Molecules, Elsevier, Amsterdam, 1976. 4.4.4 Minimum (most negative) and maximum (most positive) atomic partial charges Definition: Qmin =min (Q -) Qmax=max (Q+) Q -- negative atomic partial charges Q+ - positive atomic partial charges Reference: 1. I. G. Csizmadia, Theory and Practice of MO Calculations on Organic Molecules, Elsevier, Amsterdam, 1976. 51 4.4.5 Polarity parameters Definitions: P = Qmax - Qmin P Qmax Qmin Rmm P Qmax Qmin 2 Rmm Qmax - the most positive atomic partial charge in the molecule Qmin - the most negative atomic partial charge in the molecule Rmm - distance between the most positive and the most negative atomic partial charges in the molecule Reference: 1. K. Osmialowski, J. Halkiewicz, A. Radecki, R. Kaliszan, J. Chromatogr., 1985, 346, 53. 4.4.6 Dipole moment Definition: occ M i 1 (V ) a 1 i rˆi dv Z a Ra i - molecular orbitals r̂ - electron position operator Za - a-th atomic nuclear charge Ra - position vector of a-th atomic nucleus Reference: 1. P. W. Atkins, Quanta, Oxford University Press, Oxford, 1991. 4.4.7 Molecular polarizability, 52 Definition: E 1 E 2 ... 2 - permanent dipole moment of the molecule ’' - induced dipole moment of the molecule E - external electric field Reference: 1. P. W. Atkins, Quanta, Oxford University Press, Oxford, 1991. 4.4.8 Molecular hyperpolarizability, Definition: E 1 E 2 ... 2 - permanent dipole moment of the molecule ’' - induced dipole moment of the molecule E - external electric field Reference: 1. P. W. Atkins, Quanta, Oxford University Press, Oxford, 1991. 53 4.5 CPSA Descriptors 4.5.1Partial positively charged surface area Definition: A A 0 PPSA1 S A A SA- positively charged solvent-accessible atomic surface area Reference: 1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990, 62, 2323 .; D.T. Stanton, L.M. Egolf, P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992, 32, 306 . 4.5.2 Total charge weighted partial positively charged surface area Definition: PPSA2 q A S A A A A 0 A SA - positively charged solvent-accessible atomic surface area qA - atomic partial charge Reference: 1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990, 62, 2323 .; D.T. Stanton, L.M. Egolf, P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992, 32, 306 . 4.5.3 Atomic charge weighted partial positively charged surface area Definition: PPSA3 q A S A A A 0 A SA - positively charged solvent-accessible atomic surface area qA - atomic partial charge Reference: 1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990 62, 2323 ; D.T. Stanton, L.M. Egolf, P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992 32, 306. 54 4.5.4 Partial negatively charged surface area Definition: A A 0 PNSA1 S A A SA - negatively charged solvent-accessible atomic surface area Reference: 1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990, 62, 2323; D.T. Stanton, L.M. Egolf, P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992, 32, 306. 4.5.5 Total charge weighted partial negatively charged surface area Definition: PNSA2 q A S A A A A 0 A SA - negatively charged solvent-accessible atomic surface area qA - atomic partial charge Reference: 1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990, 62, 2323; D.T. Stanton, L.M. Egolf, P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992, 32, 306. 4.5.6 Atomic charge weighted partial negatively charged surface area Definition: PNSA3 q A S A A A 0 A SA - negatively charged solvent-accessible atomic surface area qA - atomic partial charge Reference: 1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990 62, 2323; D.T. Stanton, L.M. Egolf, P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992 32, 306. 4.5.7 Difference between partial positively and negatively charged surface areas 55 Definition: DPSA1 = PPSA1 – PNSA1 PPSA1 - positively charged solvent-accessible molecular surface area PNSA1 - negatively charged solvent-accessible molecular surface area Reference: 1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990 62, 2323; D.T. Stanton, L.M. Egolf, P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992 32, 306. 4.5.8 Difference between total charge weighted partial positive and negative surface areas Definition: DPSA2 = PPSA2 – PNSA2 PPSA2 – total charge weighted partial positively charged molecular surface area PNSA2 - total charge weighted partial negatively charged molecular surface area Reference: 1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990 62, 2323; D.T. Stanton, L.M. Egolf, P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992 32, 306. 4.5.9 Difference between atomic charge weighted partial positive and negative surface areas Definition: DPSA2 = PPSA2 – PNSA2 PPSA2 – total charge weighted partial positively charged molecular surface area PNSA2 - total charge weighted partial negatively charged molecular surface area Reference: 1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990 62, 2323; D.T. Stanton, L.M. Egolf, P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992 32, 306. 4.5.10 Fractional partial positive surface area 56 Definition: FPSA1 PPSA1 TMSA PPSA1– partial positively charged molecular surface area TMSA- total molecular surface area Reference: 1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990 62, 2323; D.T. Stanton, L.M. Egolf, P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992 32, 306. 4.5.11 Fractional total charge weighted partial positive surface area Definition: FPSA2 PPSA2 TMSA PPSA2 – total charge weighted partial positively charged molecular surface area TMSA - total molecular surface area Reference: 1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990 62, 2323; D.T. Stanton, L.M. Egolf, P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992 32, 306. 4.5.12 Fractional atomic charge weighted partial positive surface area Definition: FPSA3 PPSA3 TMSA PPSA3– total charge weighted partial positively charged molecular surface area TMSA -total molecular surface area Reference: 1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990 62, 2323; D.T. Stanton, L.M. Egolf, P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992 32, 306. 4.5.13 Fractional partial negative surface area 57 Definition: FNSA1 PNSA1 TMSA PNSA1 – partial negatively charged molecular surface area TMSA - total molecular surface area Reference: 1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990 62, 2323; D.T. Stanton, L.M. Egolf, P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992 32, 306. 4.5.14 Fractional total charge weighted partial negative surface area Definition: FNSA2 PNSA2 TMSA PNSA2 – total charge weighted partial negatively charged molecular surface area TMSA - total molecular surface area Reference: 1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990 62, 2323; D.T. Stanton, L.M. Egolf, P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992 32, 306. 4.5.15 Fractional atomic charge weighted partial negative surface area Definition: FNSA3 PNSA3 TMSA PNSA3 – total charge weighted partial negatively charged molecular surface area TMSA - total molecular surface area Reference: 1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990 62, 2323; D.T. Stanton, L.M. Egolf, P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992 32, 306. 58 4.5.16 Surface weighted charged partial positive charged surface area WPSA1 Definition: WPSA1 PPSA1 TMSA 1000 PPSA1 – partial positively charged molecular surface area TMSA - total molecular surface area Reference: 1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990 62, 2323; D.T. Stanton, L.M. Egolf, P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992 32, 306. 4.5.17 Surface weighted charged partial positive charged surface area WPSA2 Definition: WPSA2 PPSA2 TMSA 1000 PPSA2 – total charge weighted partial positively charged molecular surface area TMSA - total molecular surface area Reference: 1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990, 62, 2323; D.T. Stanton, L.M. Egolf, P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992, 32, 306. 4.5.18 Surface weighted charged partial positive charged surface area WPSA3 Definition: WPSA3 PPSA3 TMSA PPSA3 – total charge weighted partial positively charged molecular surface area TMSA - total molecular surface area 59 Reference: 1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990, 62, 2323; D.T. Stanton, L.M. Egolf, P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992, 32, 306. 4.5.19 Surface weighted charged partial negative charged surface area WNSA1 Definition: WNSA1 PNSA1 TMSA 1000 PNSA1 – partial negatively charged molecular surface area TMSA -total molecular surface area Reference: 1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990 62, 2323; D.T. Stanton, L.M. Egolf, P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992 32, 306. 4.5.20 Surface weighted charged partial negative charged surface area WNSA2 Definition: WNSA2 PPSA2 TMSA 1000 PNSA2 – total charge weighted partial negatively charged molecular surface area TMSA - total molecular surface area Reference: 1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990 62, 2323; D.T. Stanton, L.M. Egolf, P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992 32, 306. 4.5.21 Surface weighted charged partial negative charged surface area WNSA3 Definition: WNSA3 PNSA3 TMSA 60 PNSA3 – total charge weighted partial negatively charged molecular surface area TMSA -total molecular surface area Reference: 1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990 62, 2323; D.T. Stanton, L.M. Egolf, P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992 32, 306. 4.5.22 Relative positive charge Definition: max RPCG A A A 0 A max - maximum atomic positive charge in the molecule A - positive atomic charge in the molecule Reference: 1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990, 62, 2323; D.T. Stanton, L.M. Egolf, P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992, 32, 306. 4.5.23 Relative negative charge Definition: max RNCG A A A 0 A max - maximum atomic negative charge in the molecule A - negative atomic charge in the molecule Reference: 1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990, 62, 2323; D.T. Stanton, L.M. Egolf, P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992, 32, 306. 4.5.24 Hydrogen bonding donor ability of the molecule HDSA1 61 Definition: HDSA1 sD D H H donor D S D- solvent-accessible surface area of H-bonding donor H atoms Reference: 1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990, 62, 2323; D.T. Stanton, L.M. Egolf, P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992, 32, 306. 4.5.25 Area-weighted surface charge of hydrogen bonding donor atoms HDSA2 Definition: HDSA 2 D qD s D D H H -donor Stot sD solvent-accessible surface area of H-bonding donor H atoms qD partial charge on H-bonding donor H atoms Stot total solvent-accessible molecular surface area Reference: 1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990, 62, 2323; D.T. Stanton, L.M. Egolf, P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992, 32, 306. 4.5.26 Hydrogen bonding acceptor ability of the molecule HASA1 Definition: HASA1 s A A X H acceptor A sD - solvent-accessible surface area of H-bonding acceptor atoms Reference: 1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990, 62, 2323; D.T. Stanton, L.M. Egolf, P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992, 32, 306. 62 4.5.27 Area-weighted surface charge of hydrogen bonding acceptor atoms HASA2 Definition: HASA2 qA sA A X H -acceptor S tot A sA -solvent-accessible surface area of H-bonding acceptor atoms qA - partial charge on H-bonding acceptor atoms Stot - total solvent-accessible molecular surface area Reference: 1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990, 62, 2323; D.T. Stanton, L.M. Egolf, P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992, 32, 306. 4.5.28 Hydrogen bonding donor ability of the molecule HDCA1 Definition: HDCA1 sD D H H donor D sD - solvent-accessible surface area of H-bonding donor H atoms, selected by threshold charge Reference: 1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990, 62, 2323; D.T. Stanton, L.M. Egolf, P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992, 32, 306. 4.5.29 Area-weighted surface charge of hydrogen bonding donor atoms HDCA2 Definition: HDCA2 D qD s D Stot D H H -donor sD - solvent-accessible surface area of H-bonding donor H atoms, selected by threshold charge 63 qD - partial charge on H-bonding donor H atoms, selected by threshold charge Stot -total solvent-accessible molecular surface area Reference: 1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990, 62, 2323; D.T. Stanton, L.M. Egolf, P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992, 32, 306. 4.5.30 Hydrogen bonding acceptor ability of the molecule HACA1 Definition: HACA1 s A A X H acceptor A sA - solvent-accessible surface area of H-bonding acceptor atoms, selected by threshold charge Reference: D.T. Stanton, P.C. Jurs, Anal. Chem., 1990, 62, 2323; D.T. Stanton, L.M. Egolf, P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992, 32, 306. 4.5.31 Area-weighted surface charge of hydrogen bonding acceptor atoms HACA2 Definition: HACA2 A qA s A Stot A X H -acceptor sA - solvent-accessible surface area of H-bonding acceptor atoms, selected by threshold charge qA - partial charge on H-bonding acceptor atoms, selected by threshold charge Stot - total solvent-accessible molecular surface area Reference: 1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990, 62, 2323; D.T. Stanton, L.M. Egolf, P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992, 32, 306. 4.5.32 Fractional hydrogen bonding donor ability of the molecule FHDSA1 64 Definition: FHDSA1 HDSA1 TMSA HDSA1 - hydrogen bonding donor ability TMSA - total molecular surface area Reference: 1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990,62, 2323; D.T. Stanton, L.M. Egolf, P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992, 32, 306. 4.5.33 Fractional area-weighted surface charge of hydrogen bonding donor atoms FHDSA2 Definition: FHDSA 2 HDSA 2 TMSA HDSA2 - area-weighted surface charge on hydrogen bonding donor atoms TMSA - total molecular surface area Reference: 1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990, 62, 2323; D.T. Stanton, L.M. Egolf, P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992, 32, 306. 4.5.34 Fractional hydrogen bonding acceptor ability of the molecule FHASA1 Definition: FHASA1 HASA1 TMSA HASA1 - hydrogen bonding acceptor ability TMSA - total molecular surface area Reference: 1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990, 62, 2323; D.T. Stanton, L.M. Egolf, P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992, 32, 306. 65 4.5.35 Fractional area-weighted surface charge of hydrogen bonding acceptor atoms FHASA2 Definition: FHASA2 HASA2 TMSA HASA2 - area-weighted surface charge on hydrogen bonding acceptor atoms TMSA - total molecular surface area Reference: 1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990, 62, 2323; D.T. Stanton, L.M. Egolf, P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992, 32, 306. 4.5.36 Fractional hydrogen bonding donor ability of the molecule FHDCA1 Definition: FHDCA1 HDCA1 TMSA HDCA1 - hydrogen bonding donor ability of atoms, selected by threshold charge TMSA - total molecular surface area Reference: 1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990,62, 2323; D.T. Stanton, L.M. Egolf, P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992, 32, 306. 4.5.37 Fractional area-weighted surface charge of hydrogen bonding donor atoms FHDCA2 Definition: FHDCA2 HDCA 2 TMSA HDCA2 - area-weighted surface charge on hydrogen bonding donor atoms, selected by threshold charge TMSA - total molecular surface area 66 Reference: 1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990, 62, 2323; D.T. Stanton, L.M. Egolf, P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992, 32, 306. 4.5.38 Fractional hydrogen bonding acceptor ability of the molecule FHACA1 Definition: FHACA1 HACA1 TMSA HASA1 hydrogen bonding acceptor ability TMSA total molecular surface area References: 1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990, 62, 2323 2. D.T. Stanton, L.M. Egolf, P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992, 32, 306 4.5.39 Fractional area-weighted surface charge of hydrogen bonding acceptor atoms FHACA2 Definition: FHACA2 HACA2 TMSA HACA2 - area-weighted surface charge on hydrogen bonding acceptor atoms, selected by threshold charge TMSA - total molecular surface area Reference: 1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990, 62, 2323; D.T. Stanton, L.M. Egolf, P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992, 32, 306. 67 4.6 MO Related Descriptors 4.6.1 Highest occupied molecular orbital (HOMO) energy Definition: HOMO HOMO F̂ HOMO HOMO - highest occupied molecular orbital F̂ - Fock operator References: 1. I. G. Csizmadia, Theory and Practice of MO Calculations on Organic Molecules, Elsevier, Amsterdam, 1976. 2. B.W. Clare, Theoret. Chim. Acta, 1994, 87, 415-430. 4.6.2 Lowest unoccupied molecular orbital (LUMO) energy Definition: LUMO LUMO F̂ LUMO LUMO - lowest unoccupied molecular orbital F̂ - Fock operator References: 1. I.G. Csizmadia, Theory and Practice of MO Calculations on Organic Molecules, Elsevier, Amsterdam, 1976. 2. B.W. Clare, Theoret. Chim. Acta, 1994, 87, 415-430. 4.6.3 Fukui atomic nucleophilic reactivity index Definition: 2 N A ciHOMO /(1 HOMO ) iA or simplified 68 2 N ' A ciHOMO iA HOMO - highest occupied molecular orbital energy ciHOMO - highest occupied molecular orbital MO coefficients Reference: 1. R. Franke, Theoretical Drug Design Methods. Elsevier, Amsterdam, 1984. 4.6.4 Fukui atomic electrophilic reactivity index Definition: E A c 2jLUMO /( LUMO 10) jA or simplified E ' A c 2jLUMO jA LUMO - lowest unoccupied molecular orbital energy c jLUMO - lowest unoccupied molecular orbital MO coefficients Reference: 1. R. Franke, Theoretical Drug Design Methods. Elsevier, Amsterdam, 1984. 4.6.5 Fukui atomic one-electron reactivity index Definition: RA iA c jA c iHOMO jLUMO /( LUMO HOMO ) or simplified R' A iA c jA iHOMO c jLUMO ciHOMO - highest occupied molecular orbital MO coefficients c jLUMO - lowest unoccupied molecular orbital MO coefficients 69 LUMO - lowest unoccupied molecular orbital energy HOMO - highest occupied molecular orbital energy Reference: 1. R. Franke, Theoretical Drug Design Methods. Elsevier, Amsterdam, 1984. 4.6.6 Mulliken bond orders Definition: occ PAB ni ci c j i 1 A B ni- the occupation number of the i-th MO ci , c j - MO coefficients for atomic orbitals and References: 1. I.G. Csizmadia, Theory and Practice of MO Calculations on Organic Molecules, Elsevier, Amsterdam, 1976. 2. A.B. Sannigrahi, Adv. Quant. Chem., 1992, 23, 301-351. 4.6.7 Free valence Definition: V fA Vmax PA Vmax - maximum valence of atom A PA - total electronic population on atom A References: 1. I.G. Csizmadia, Theory and Practice of MO Calculations on Organic Molecules, Elsevier, Amsterdam, 1976. 2. A.B. Sanniraghi, Adv. Quant. Chem., 1992, 23, 301-351. 70 4.7 Quantum Chemical Descriptors 4.7.1 Total energy of the molecule Definition: Etot Eel Z A Z B / RAB A B Eel - total electronic energy of the molecule ZA, ZB - nuclear charges of atoms A and B RAB - distance between nuclei A and B References: 1. I. G. Csizmadia, Theory and Practice of MO Calculations on Organic Molecules, Elsevier, Amsterdam, 1976 2. M. Bodor, Z. Gabanyi, C.-K. Wong, J. Am. Chem. Soc., 1989, 111, 3783 4.7.2 Total electronic energy of the molecule Definition: Eel= 2Tr(RF) - Tr(RG) R - first order density matrix F - matrix representation of the Hartree-Fock operator G - matrix representation of the electron repulsion energy Reference: 1. I. G. Csizmadia, Theory and Practice of MO Calculations on Organic Molecules, Elsevier, Amsterdam, 1976 4.7.3 Standard heat of formation Definition: H 0f H f H af a 71 - quantum-chemically calculated total energy of the molecule - quantum-chemically calculated energies of isolated atoms, a Reference: 1. P. W. Atkins, Physical Chemistry, 3rd Edition, Oxford University Press, Oxford, 1988 4.7.4 Electron-electron repulsion energy for a given atomic species Definition: Eee ( A) P P B A , A , B A – given atomic species B – other atoms P, P,- density matrix elements over atomic basis {} - electron repulsion integrals on atomic basis {} Reference: 1. E. Clementi, Computational Aspects of Large Chemical Systems, Springer Verlag, New York, 1980 4.7.5 Nuclear-electron attraction energy for a given atomic species Definition: Ene AB B P , A ZB RiB A - given atomic species B - other atoms P - density matrix elements over atomic basis {} ZB - charge of atomic nucleus, B RiB - distance between the electron and atomic nucleus, B ZB - electron-nuclear attraction integrals on atomic basis {} RiB 72 Reference: 1. E. Clementi, Computational Aspects of Large Chemical Systems, Springer Verlag, New York, 1980 4.7.6 Electron-electron repulsion between two given atoms Definition: Eee ( AB) P P , A , B A – given atomic species B – another atomic species P, P,- density matrix elements over atomic basis {} - electron repulsion integrals on atomic basis {} Reference: 1. E. Clementi, Computational Aspects of Large Chemical Systems, Springer Verlag, New York, 1980 4.7.7 Nuclear-electron attraction energy between two given atoms Definition: Ene AB B P , A ZB RiB A – given atomic species B – another atomic species P - density matrix elements over atomic basis {} ZB - charge of atomic nucleus, B RiB - distance between the electron and atomic nucleus, B ZB - electron-nuclear attraction integrals on atomic basis {} RiB Reference: 1. E. Clementi, Computational Aspects of Large Chemical Systems, Springer Verlag, New York, 1980 73 4.7.8 Nuclear repulsion energy between two given atoms Definition: E nn ( AB) Z AZ B R AB A – given atomic species B – another atomic species ZA - charge of atomic nucleus, A ZB - charge of atomic nucleus, B RiB - distance between the atomic nuclei, A and B Reference: 1. E. Clementi, Computational Aspects of Large Chemical Systems, Springer Verlag, New York, 1980. 4.7.9 Electronic exchange energy between two given atoms Definition: Eexc ( AB) P P , A , B A – given atomic species B – another atomic species P, P, - density matrix elements over atomic basis {} - electron repulsion integrals on atomic basis { } Reference: 1. E. Clementi, Computational Aspects of Large Chemical Systems, Springer Verlag, New York, 1980 4.7.10 Resonance energy between given two atomic species Definition: E R ( AB) P A B A – given atomic species 74 B – another atomic species P - density matrix elements over atomic basis {} - resonance integrals on atomic basis {} Reference: 1. E. Clementi, Computational Aspects of Large Chemical Systems, Springer Verlag, New York, 1980. 4.7.11 Total electrostatic interaction energy between two given atomic species Definition: EC ( AB) Eee ( AB) Ene ( AB) Enn ( AB) A – given atomic species B – another atomic species Eee(AB) - electronic repulsion energy between two atomic species Ene(AB) - electron-nuclear attraction energy between two atomic species Enn(AB) - nuclear repulsion energy between two atomic species Reference: 1. E. Clementi, Computational Aspects of Large Chemical Systems, Springer Verlag, New York, 1980. 4.7.12 Total interaction energy between two given two atomic species Definition: Etot ( AB) EC ( AB) Eexc ( AB) A – given atomic species B – another atomic species EC(AB) - electrostatic interaction energy between two atomic species Eexc(AB) - electronic exchange energy between two atomic species Reference: 1. E. Clementi, Computational Aspects of Large Chemical Systems, Springer Verlag, New York, 1980. 75 4.7.13 Total molecular one-center electron-electron repulsion energy Definition: Eee (tot) Eee ( A) A A - given atomic species Eee(A) - electron-electron repulsion energy for atom A Reference: 1. E. Clementi, Computational Aspects of Large Chemical Systems, Springer Verlag, New York, 1980. 4.7.14 Total molecular one-center electron-nuclear attraction energy Definition: A – given atomic species Ene(A) - electron-nuclear attraction energy for atom A Reference: 1. E. Clementi, Computational Aspects of Large Chemical Systems, Springer Verlag, New York, 1980. 4.7.15 Total intramolecular electrostatic interaction energy Definition: EC ( tot ) 1 EC ( A ) 2 A A - given atomic species EC(A) - electrostatic energy for atom A Reference: 1. E. Clementi, Computational Aspects of Large Chemical Systems, Springer Verlag, New York, 1980. 76 4.8 Thermodynamic Descriptors 4.8.1 Vibrational enthalpy of the molecule Definition: H vib h j exp h j 2kT 1 a h j 2 j 1 1 exp h j 2kT j - frequencies of normal vibrations in the molecule h - Planck's constant k - Boltzmann's constant T - absolute temperature (K) References: 1. D.A. McQuarrie, Statistical Thermodynamics, Harper & Row Publishers, New York, 1973. 2. A.I. Akhiezer, S.V. Peltminskii, Methods of Statistical Physics, Pergamon Press, Oxford, 1981. 4.8.2 Translational enthalpy of the molecule Definition: H tr 2 2 p p e 2mkT dp 2m p - momentum of the molecule m - mass of the molecule k - Boltzmann's constant T - absolute temperature (K) References: 1. D.A. McQuarrie, Statistical Thermodynamics, Harper & Row Publishers, New York, 1973. 2. A.I. Akhiezer, S.V. Peltminskii, Methods of Statistical Physics, Pergamon Press, Oxford, 1981. 77 4.8.3 Vibrational entropy of the molecule Definition: h j exp h j 2kT S vib ln 1 exp h j 2kT j 1 kT 1 exp h j 2kT j - frequencies of normal vibrations in the molecule h - Planck's constant k - Boltzmann's constant T - absolute temperature (K) References: 1. D.A. McQuarrie, Statistical Thermodynamics, Harper & Row Publishers, New York, 1973. 2. A.I. Akhiezer, S.V. Peltminskii, Methods of Statistical Physics, Pergamon Press, Oxford, 1981. 4.8.4 Rotational entropy of the molecule Definition: S rot 1/ 2 Nk ln 8 2 I j kT 2 h j 1 3 1/ 2 Ij - principal moments of inertia of the molecule - symmetry number of the molecule h - Planck's constant k - Boltzmann's constant T - absolute temperature (K) References: 1. D.A. McQuarrie, Statistical Thermodynamics, Harper & Row Publishers, New York, 1973. 4.8.5 Translational entropy of the molecule 78 Definition: 2mkT Str ln 2 h 1/ 2 Ve5 / 2 NA V - volume of the system NA - Avogadro's number m - mass of the molecule h - Planck's constant k - Boltzmann's constant T - absolute temperature (K) References: 1. D.A. McQuarrie, Statistical Thermodynamics, Harper & Row Publishers, New York, 1973. 2. A.I. Akhiezer, S.V. Peltminskii, Methods of Statistical Physics, Pergamon Press, Oxford, 1981. 4.8.6 Vibrational heat capacity of the molecule Definition: exp h j / 2kT h j k j 1 kT 1 exp h j / 2kT ) cv ,vib 2 j - frequencies of normal vibrations in the molecule h - Planck's constant k - Boltzmann's constant T - absolute temperature (K) References: 1. D.A. McQuarrie, Statistical Thermodynamics, Harper & Row Publishers, New York, 1973. 2. A.I. Akhiezer, S.V. Peltminskii, Methods of Statistical Physics, Pergamon Press, Oxford, 1981. 79 Chapter 5 Methods of QSAR/QSPR Model Development The QSAR/QSPR models developed by CODESSA PRO software are essentially based on the multilinear regression method. 5.1 Multilinear Regression The multilinear regression method can be described compactly and conveniently using matrix notation.[1, 2, 3] Suppose that there are n property values in Y and n associated calculated values for each k molecular descriptor in X columns. Then Yi, Xik, and ei can represent the ith value of the Y variable (property), the ith value of each of the X descriptors, and the ith unknown residual value, respectively. Collecting these terms into matrices we have: The multiple regression model in matrix notation then can be expressed as where b is a column vector of coefficients (b1 is for the intercept) and k is the number unknown regression coefficients for the descriptors. We recall that the goal of multiple regression is to minimize the sum of the squared residuals: Regression coefficients that satisfy this criterion are found by solving the system of linear equations (multiplying both sides by X’ from left) When the X variables are linearly independent (an X'X matrix which is of full rank), there is a unique solution to the system of linear equations. One of the ways for solving the system above is to premultiply both sides of the matrix formula for the normal equations by the inverse matrix X'X to give 80 The other way is to solve directly the system above using LS (underdetermined, n < k) or QR factorization for the overdetermined (n > k) system. This method is more general and does not require time-consuming matrix inversion. Singular value decomposition methods can also be used, but usually such methods are significantly more time-consuming and only advantageous when a strong linear dependence exists that would diminish quality of models. The third way to solve the problem of linear dependency of variables (determinant of the X’X matrix is above zero) is by general matrix inversion, but this is usually outside the sphere of QSPR. A fundamental principle of least squares methods, the multiple linear regression in particular, is that variance of the dependent variable can be partitioned (divided into parts) according to the source. Suppose that a dependent variable (property) is regressed on one or more descriptors and, for convenience, the dependent variable is scaled so that its mean is 0. Next, a basic least squares identity is calculated in which the total sum of squared values on the dependent variable equals the sum of squared predicted values plus the sum of squared residual values. Stated more generally, where the term on the left is the total sum of squared deviations of the observed values on the dependent variable from the dependent variable mean, and the terms on the right are: (i) the sum of the squared deviations of the predicted values for the dependent variable from the dependent variable mean and (ii) the sum of the squared deviations of the observed values on the dependent variable from the predicted values, that is, the sum of the squared residuals. Stated yet another way, Note that the SSTotal is always the same for any particular data set, but SSModel and the SSError vary with the regression equation. Assuming again that the dependent variable is scaled so that its mean is 0, the SSModeland SSError can be computed using Assuming that X'Xis full-rank, 81 where r2is squared correlation coefficient which is the measure of the quality of model fitnessto the property, s2 is an unbiased estimate of the residual or error variance, and F is Fisher criteria of (k, n - k - 1) degrees of freedom. If X'Xis not full rank, rank(X’X) + 1is substituted for k. References: 1. http://www.statsoft.com/textbook/stathome.html 2. Darlington, R. B. Regression and linear models. New York: McGraw-Hill, 1990. 3. Neter, J.; Wasserman, W.; Kutner, M. H. Applied linear regression models (2nd ed.). Homewood, IL: Irwin, 1989. 5.2 Selection of descriptors A rigorously correct solution for descriptor selection requires a full search procedure of the discrete descriptor space. Unfortunately, combinatorial explosion does not allow the application of a full search procedure to real tasks. For example, if we search for a 5-parameter correlation on a space of 1000 descriptors (numbers are realistic for a typical search), we would have to test over 8*1012 correlations for their ability to match some criterion (usually the squared correlation coefficient). Modern machines have achieved sufficient productivity to calculate one correlation each 0.0001-0.0002 seconds using highly optimized linear algebra libraries (CODESSA PRO uses an Intel MKL – LAPACK [3] clone), and highly optimized low level code. Even with this high level of optimization, the time required for the solution of the aforementioned task, using a full search procedure, is 8*1012*0.0001 seconds (about 26 years). Because it is impossible to solve typical tasks in a reasonable amount of time using full search, methods for simplification have been developed. Such methods of descriptor selection can be categorized as either deterministic or stochastic. Throughout years of research, many algorithms for non-full searches were developed. The best known deterministic algorithms [1] are forward entry/backward removal of effects (in our case – descriptors). The methods of forward and backward stepwise searches combine the entering and removal of the effects at each step. Each of the methods mentioned above have many limitations, [2] the majority of which are concerned with the absence of a consistent set of the correlations (models) which represent the upper segment of the search space. The best-subset methods (proposed in this work) are the next alternative to the full-search procedure, bur possess such limitations. 82 Two methods for reducing full search procedure were utilized by our group: [4] the heuristic method and the best multi-linear regression. The heuristic method for descriptor selection proceeds with a pre-selection of descriptors by sequentially eliminating descriptors that do not match any of the following criteria: (i) Fisher F-criteria greater than 1.0; (ii) R2 value less than a value defined at the start; (iii) Student’s t-criterion less than a defined value; (iv) duplicate descriptors having a higher squared intercorrelation coefficient than a predetermined level (retaining the descriptor with higher R2 with reference to the property). The descriptors that remain are then listed in decreasing order of correlation coefficients when used in global search for 2-parameter correlations. Each significant 2-parameter correlation by F-criteria is recursively expanded to an n-parameter correlation till the normalized F-criteria remains greater than the startup value. The best N correlations by R2, as well as by F-criterion, are saved. The best multilinear regression method is based on the (i) selection of the orthogonal descriptor pairs, (ii) extension of the correlation (saved on the previous step) with the addition of new descriptors until the F-criteria becomes less than that of the best 2-parameter correlation. The best N correlations (by R2) are saved. Both methods successfully solve the initial selection problem by reducing the number of pairs of descriptors in the "starting set". The major limitations of these methods are the pairwise selection on the first step and the low consistence of the presentation of the upper (according to the selected criteria) segment of the search (N in both cases is 400) due to the small size of the correlation selection. The HMPRO procedure applied in CODESSA PRO is a superset of the abovedescribed heuristic and best multilinear regression methods, as well as forward/backward selections methods, forward/backward stepwise search methods, and a full search procedure. Through the alteration of parameters of the HMPRO, it is possible to emulate the behavior of all above mentioned methods. The HMPRO algorithm consists from 4 major parts: 1) The 1-parameter descriptor selection. The selection is based on squared correlation coefficient, Fisher F-criteria, Student t-criteria. The highly intercorrelated descriptors and descriptors with insignificant variance are eliminated. 2) Pair-wise selection. This selection is based on the analysis of the squared correlation coefficient and Fisher F-criteria. 3) Expanding/contracting stage. The obtained correlations with the best statistical characteristics are expanded by adding a previously unselected descriptor. The new correlations are validated by the values of squared intercorrelation coefficient, F-criteria (normalized or not), and standard error. The correlations with the maximum allowed number of descriptors (a parameter pre-set by the user) will not be expanded further. The process 83 begins from the correlations with the largest values of the fitness function (correlation weight) defined as follows: w = (R2Fn)/(Ns2) where R2 is the squared correlation coefficient, F is the Fisher criterion, n is number of the data points, N is the number of descriptors in the model and s is standard deviation. This stage is repeated until a stop event appears which can be any/all of following: i. The correlation set in the memory exceeds the capacity. The correlations with the lowest values of the fitness function are not stored. If a new correlation with a fitness function value higher than the current lowest fitness function is found, it is substituted for that correlation with the lowest fitness function. ii. The in-memory correlation set is overfilled. (It can be conditionally unlimited when the correlation value of fitness function is less then minimal in the set, it is not stored at all after overfilling. In this case, if correlation is inserted into the set, the correlation with the worst value of the fitness function is eliminated). iii. Maximum number of iterations is achieved. iv. Time limit is achieved. v. The correlation set does not have any correlation to expand/contract (full search is finished). 4) The output stage. The given number of the “best” correlations is printed out. Each iteration, available for selecting printed correlations, begins with the “best” correlations. For all of the correlations in a cycle, a full set of statistical parameters is calculated including intercorrelation of the descriptors (one to all others), cross-validated squared correlation coefficient, etc. The parameters of the method are defined by the set of the selection criteria. For any correlation, a full list of the predecessors, in calculation order, can be printed and on the basis of the best correlations with subsets of the descriptors until the 1-parameter correlation is printed. A general fitness function (correlation weight) above is defined as follows: w ( R 2 ) xa F x2 n x3 ( s 2 ) x4 N x5 , where R2 is a squared correlation coefficient, F is Fisher criteria, n is number of the data points, s2 is standard error (non-normalized), and N is number of the descriptors in the correlation. The xi denote the weight parameters and have the following default values {1, 1, 1, -1, -1 }. The MDA module of CODESSA PRO provides two additional validation methods for correlation validation: MCCV [5,6], and the Monte Carlo extension of the randomization test.[7] One of the best random number generator is used.[8] 84 References: 1. Darlington, R. B. Regression and linear models. New York: McGraw-Hill, 1990. 2. Neter, J.; Wasserman, W.; Kutner, M. H. Applied linear regression models (2nd ed.). Homewood, IL: Irwin, 1989. 3. Anderson, E; Bai, Z.; Bischof, C.; Demmel, J.; Dongarra, J.; Ducroz, J.; Greenbaum, A.; Hammarling, S.; McKenney, A.; Sorensen, D. LAPACK: A portable linear algebra library for high-performance computers. Computer Science Dept. Technical Report CS-90-105, University of Tennessee: Knoxville, TN, 1990 4. Katritzky, A.R.; Lobanov V.; Karelson, M. CODESSA Reference Manual. University of Florida, Gainesville, 1996. 5. Xu Q.-S.; Liang Y.-Z. Monte Carlo cross validation Chemom. Intell. Lab. Syst. 2001, 56, 1-11. 6. Martens, H.A.; Dardenne, P. Validation and verification of regression in small data sets 1998, 44, 99-121. 7. Edington E. S. Randomization tests: Marcel Dekker, Inc.: New York and Basel, 1980, pp.195-216. 8. Matsumoto, M.; Nishimura, T. Mersenne Twister: A 623-dimensionally equidistributed uniform pseudorandom number generator, ACM Trans. on Modeling and Computer Simulation 1998, 8, 3-30. 5.3 Multivariate methods The principal component analysis (PCA) is generally described as an ordination technique for describing the variation in a multivariate data set.[1, 2, 3] The first axis (the first principal component, or PC1) describes the maximum variation in the whole data set; the second describes the maximum variance remaining, and so forth, with each axis orthogonal to the preceding axis. Principal components are eigenvectors of a covariance X’X or correlation X’Y matrix. The number of principal components that can be extracted will typically exceed the maximum of the number of Yand Xvariables. The principal component analysis and factor analysis are based on the separation of the original matrix X into two matrixes: factor scores matrix T and loading matrix Q. In matrix form: The columns in a factor score matrix are linear independent. Usually, the columns in X and Y matrix are centered (by subtracting their means) and scaled (by dividing by their standard deviations). Suppose we have a data set with response variables Y (in matrix form) and a large number of predictor variables X (in matrix form), some of which are highly correlated. A regression, using factor extraction for this type of data, computes the factor score matrix 85 for an appropriate weight matrix W, and then considers the linear regression model where Q is a matrix of regression coefficients (loadings) for T, and E is an error (noise) term. Once the loadings Q are computed, the above regression model is equivalent to which can be used as a predictive regression model. The factor scores and loadings can be obtained in many different ways. NIPALS algorithm was developed in 1923, [4] later modified in 1966, [5] and SIMPLS algorithm [6] resulted from work by de Jong in 1993. Singular value decomposition is another commonly used method for calculating scores and loading.[7] Principal components regression (PCR) and partial least squares (PLS) regression differ in the methods used for extracting factor scores.[1] PLR produces the weight matrix W reflecting the covariance structure between the predictor variables, while PLS regression produces the weight matrix W reflecting the covariance structure between the predictor and response variables. In PLSregression, prediction functions are represented by factors extracted from the Y'XX'Y matrix. For establishing the model, PLS regression produces a weight matrix W for X such that T=XW, i.e., the columns of W are weight vectors for the X columns producing the corresponding factor score matrix T. These weights are computed so that each of them maximizes the covariance between responses and the corresponding factor scores. Ordinary least squares procedures for the regression of Y on T are then performed to produce Q, the loadings for Y(or weights for Y) such that Y=TQ+E. Once Q is computed, we have Y=XB+E, where B=WQ, and the prediction model is complete. One additional matrix which is necessary for a complete description of partial least squares regression procedures is the factor loading matrix P which gives a factor model X=TP+F, where F is the unexplained part of the X scores. References: 1. http://www.statsoft.com/textbook/stathome.html 2. Manly, B. F. J. Multivariate Statistical Methods. A Primer. London-NY: Chapman and Hall, 1986. 3. Nilsson, J. Multiway calibration in 3D QSAR: applications to dopamine receptor ligands; Groningen: University Library Groningen, 1998, Online Resource. 4. Fisher, R.; MacKensie, W. Journal of Agricaltural Science 1923, 13, 311-320. 86 5. Wold, H., In: Research papers in Statistics, ed. David, F., NY: Wiley & Sons, 1966, pp.411-444. 6. de Jong, S. SIMPLS: An Alternative Approach to Partial Least Squares Regression, Chemometrics and Intelligent Laboratory Systems 1993, 18, 251-263 7. Mandel, J. American Statistician 1982, 36, 15-24 . 5.4 Test of the results Most QSPR models are useful, but occasionally models are produced that not at all reliable.[1] The problems of reliability can validly be classified as (i) overfitting and (ii) models by chance. The latter problem can only be solved by subjective human judgment based on the justifiability of any assessment of the uncertainty of a particular prediction.[2, 3] The problem of overfitting is one of the more common problems in the development of the any kind of model and QSPR is no exception. The problem is usually solved using objective validation criterions. The most common method of validation in chemometrics is crossvalidation. [4] References: 1. Eriksson, L.; Johansson, E.; Muller, M.; Wold, S. On the selection of the training set in environmental QSAR analysis when compounds are clustered J. Chemometrics 2000, 14, 599-616. 2. Stone M.; Jonathan P. Statistical Thinking and Technique for QSAR and related studies. 1. Genaral Theory J. Chemometrics 1993, 7, 455-475. 3. Stone M.; Jonathan P. Statistical Thinking and Technique for QSAR and related studies. 2. Specific Methods J. Chemometrics 1994, 6, 1-20. 4. Xu Q.-S.; Liang Y.-Z. Monte Carlo cross validation Chemom. Intell. Lab. Syst. 2001, 56, 1-11. 5.5 External validation set The easiest method of the correlation testing is use of an external validation set. [1] In this method, the correlation is used for predict a property value for a chemical structure that was not used in the creation of the correlation; some test statistics are calculated for the external dataset; the difference between the test statistics in the training and validation datasets is a measure of the reliability of the correlation. The widely used measure is the prediction error sum of squares (PRESS) and is defined as 87 where ye,i are experimental values of the property and yp,i are predicted values for external validation test. Often the RMSPE criterion is prefered: because it gives error on a ‘per compound’ basis. [1] The method is a particular case of the leave-many-out cross-validation method. References: 1. Stone M.; Jonathan P. Statistical Thinking and Technique for QSAR and related studies. 1. Genaral Theory J. Chemometrics 1993, 7, 455-475. 5.6 Leave-One-Out crossvalidation The simplest, and a commonly used method of crossvalidation in chemometrics is the “leave-one-out” method. The idea behind this method is to predict the property value for a compound from the data set, which is in turn predicted from the regression equation calculated from the data for all other compounds. For evaluation, predicted values can be used for PRESS, RMSPE, and squared correlation coefficient criteria (r2cv). The method tends to include unnecessary components in the model, and has been provided [2] to be asymptotically incorrect. Furthermore, the method does not work well for data with strong clusterization, [1] and underestimates the true predictive error. [3] References: 1. Eriksson, L.; Johansson, E.; Muller, M.; Wold, S. On the selection of the training set in environmental QSAR analysis when compounds are clustered J. Chemometrics 2000, 14, 599-616. 2. Stone M. An Asymptotic Equivalence of Choice of Model by Cross-Validation and Akaike’s Criterion J. R. Stat. Soc., B 1977, 38, 44-47. 3. Martens, H.A.; Dardenne, P. Validation and verification of regression in small data sets 1998, 44, 99-121. 88 5.7 Leave-Many-Out crossvalidation The “leave-many-out” crossvalidation method was first described in 1975. [3] Later, the asymptotic consistence of the method was proved. [2] Because of combinatorial complexity of the calculation resulted in low productivity, some simplifications were developed for the method. The evaluation of the results of “leavemany-out” crossvalidation can be done using Monte Carlo approach. [1] References: 1. Xu Q.-S.; Liang Y.-Z. Monte Carlo cross validation Chemom. Intell. Lab. Syst. 2001, 56, 1-11. 2. Stone M. An Asymptotic Equivalence of Choice of Model by Cross-Validation and Akaike’s Criterion J. R. Stat. Soc., B 1977, 38, 44-47. 3. Geisser, S. The Predictive Sample Reuse Method with Application J. Amer. Stat.Ass. 1975, 70, 320-328. 5.8 Randomization test Randomization tests [1] are statistical tests in which the data are repeatedly elaborated; a test statistic (r2, t-criteria, F-criteria, etc.) is computed for each data permutation and the proportion of the data permutations, with test statistics values as large as the value for the obtained results, determines the significance of the results. For the testing of the multilinear correlation, the vector Y permutations are processed through multilinear regression procedures with fixed columns of matrix X. Due to a factorial increase in time spent from the size of the vector Y, Monte-Carlo method is often be used for producing randomization test. 89 Chapter 6 CODESSA Publications The following is a selection of publications on the research done by using the CODESSA PRO (CODESSA) software (as of January 1st, 2005). 1. Quantitative structure-property relationship modeling of -cyclodextrin complexation free energies (A.R. Katritzky, D.C Fara, H.F. Yang, M. Karelson, T. Suzuki, V.P. Solov'ev, A. Varnek) J. Chem. Inf. Comput. Sci., 2004, 44, 529. 2. A quantitative structure-property relationship study of lithium cation basicities (K. Tämm, D.C. Fara, A.R. Katritzky, P. Burk, M. Karelson, J. Phys. Chem. A, 2004, 108, 4812. 3. Quantitative Aspects of Solvent Polarity, (A.R. Katritzky, D. Fara, H. Yang, T. Tamm, M. Karelson) Chem. Rev., 2004, 104, 175. 4. "In silico" design of new uranyl extractants based on phosphoryl-containing podands: QSPR studies, generation and screening of virtual combinatorial library, and experimental tests (A. Varnek, D. Fourches, V.P. Solov'ev, V.E. Baulin, A.N. Turanov, V.K. Karandashev, D. Fara, A.R. Katritzky) J. Chem. Inf. Comput. Sci., 2004, 44, 1365. 5. QSPR of 3-aryloxazolidin-2-one antibacterials (A.R. Katritzky, D.C. Fara, M. Karelson) Bioorg & Med. Chem 2004, 12, 3027. 6. QSPR treatment of rat blood : air, saline : air and olive oil: air partition coefficients using theoretical molecular descriptors (A.R. Katritzky, M. Kuanar, D.C. Fara, M. Karelson, W.E. Acree) Bioorg & Med. Chem 2004, 12, 4735. 7. QSPR models for various physical properties of carbohydrates based on molecular mechanics and quantum chemical calculations. (J.D. Dyekjaer, S.Ó. Jónsdóttir) Carbohydr. Res. 2004, 339, 269. 8. Quantum-connectivity descriptors in modeling solubility of environmentally important organic compounds (E. Estrada, E.J. Delgado, J.B. Alderete, G.A. Jana) J. Comput. Chem., 2004, 25, 1787. 9. Quantitative structure-activity relationships of alpha(1) adrenergic antagonists (S. Eric, T. Solmajer, J. Zupan, M. Novic, M. Oblak, D. Agbaba) J. Mol. Modeling, 2004, 10, 139. 10. Quantitative structure-activity relationships of Streptococcus pneumoniae MurD transition state analogue inhibitors (M. Kotnik, M. Oblak, J. Humljan, S. Gobec, U. Urleb, T. Solmajer) QSAR & Comb. Sci. 2004, 23, 399. 90 11. Study of quantitative structure-mobility relationship of carboxylic and sulphonic acids in capillary electrophoresis (C.X. Xue, H.X. Liu, X.J. Yao, M.C. Liu, Z. Hu, B.T. Fan, J. Chromatogr. A 2004, 1048, 233. 12. QSAR for toxicities of polychlorodibenzofurans, polychlorodibenzo-1,4dioxins, and polychlorobiphenyls (A. Beteringhe, A.T. Balaban) ARKIVOC 2004 (1) 163. 13. QSAR study using topological indices for inhibition of carbonic anhydrase II by sulfanilamides and Schiff bases (A.T. Balaban, S.C. Basak, A. Beteringhe, D. Mills, C.T. Supuran) Molecular Diversity, 2004, 8, 401. 14. Using Variable and Fixed Topological Indices for the Prediction of Reaction Rate Constants of Volatile Unsaturated Hydrocarbons with OH Radicals (M.Pompe, M. Veber, M Randić, A.T. Balaban) Molecules, 2004, 9, 1160. 15. Application of Artificial Neural Networks to the QSPR Study, (M. Novič, A. Roncaglioni) Kem. Ind. 2004, 53, 323. 16. Tuning Neural and Fuzzy-Neural Networks for Toxicity Modeling (P. Mazzatorta, E. Benfenati, C.-D. Neagu, G. Gini) J. Chem. Inf. Comput. Sci., 2003, 43, 513. 17. A general treatment of solubility. 1. The QSPR correlation of solvation free energies of single solutes in series of solvents (A.R. Katritzky, A.A. Oliferenko, P.V. Oliferenko, R. Petrukhin, D.B. Tatham, U. Maran, A. Lomaka, W.E. Acree) J. Chem. Inf. Comput. Sci., 2003, 43, 1794. 18. A general treatment of solubility. 2. QSPR prediction of free energies of solvation of specified solutes in ranges of solvents (A.R. Katritzky, A.A. Oliferenko, P.V. Oliferenko, R. Petrukhin, D.B. Tatham, U. Maran, A. Lomaka, W.E. Acree) J. Chem. Inf. Comput. Sci., 2003, 43, 1806. 19. QSPR models based on molecular mechanics and quantum chemical calculations. 2. Thermodynamic properties of alkanes, alcohols, polyols, and ethers. (J.D. Dyekjaer, S.Ó. Jónsdóttir) Ind. & Eng. Chem. Res., 2003, 42, 4241. 20. Chain melting temperature estimation for phosphatidyl cholines by quantum mechanically derived quantitative structure property relationships (A.J. Holder, D.M. Yourtee, D.A. White, A.G. Glaros, R. Smith) J. Comp.-Aid. Mol. Des., 2003, 17, 223. 21. Nitrobenzene toxicity: QSAR correlations and mechanistic interpretations (A.R. Katritzky, P. Oliferenko, A. Oliferenko, A. Lomaka, M. Karelson) J. Phys. Org. Chem., 2003, 16, 811. 22. Layer matrices and distance property descriptors (M.V. Diudea, O. Ursu) Ind. J. Chem A., 2003, 42, 1283. 91 23. Correlation between molecular structures and relative electrophoretic mobility in capillary electrophoresis: Alkylpyridines, (X.J. Yao, B.T. Fan, J.P. Doucet, A. Panaye, M.C. Liu, R.S. Zhang, Z.D. Hu, Chines. J. Chem. 2003, 21, 1247. 24. Prediction of boiling points of ketones using a quantitative structureproperty relationship treatment (A. Komasa) Pol. J. Chem. 2003, 77, 1491. 25. Thermodynamic Study of the Solubility of Some Sulfonamides in Cyclohexane (F. Martínez, C.M. Ávila, A. Gómez), J. Braz. Chem. Soc., 2003, 14, 803. 26. Mutagenicity of aminoazo dyes and their reductive-cleavage metabolites: a QSAR/QPAR investigation (L. Sztandera, A. Garg, S. Hayik, K.L. Bhat, C.W. Bock) Dyes and Pigments, 2003, 59, 117. 27. QSAR modeling of genotoxicity on non-congeneric sets of organic compounds (U. Maran, S. Sild) Artif. Intell. Rev., 2003, 20, 13. 28. Predicting melting points of quaternary ammonium ionic liquids (D.M. Eike, J.F. Brennecke, E.J. Maginn) Green Chem., 2003, 5, 323. 29. Six-membered cyclic ureas as HIV-1 protease inhibitors: A QSAR study based on CODESSA PRO approach (A.R. Katritzky, A. Oliferenko, A. Lomaka, M. Karelson) Bioorg & Med. Chem. Lett. 2002, 12, 3453. 30. QSPR models based on molecular mechanics and quantum chemical calculations - 1. Construction of Boltzmann-averaged descriptors for alkanes, alcohols, diols, ethers and cyclic compounds. (J.D. Dyekjaer, K. Rasmussen, S.Ó. Jónsdóttir), J. Mol. Modeling, 2002, 8, 277. 31. Charge-transfer interactions in the inhibition of MAO-A by phenylisopropylamines - a QSAR study (G. Vallejos, M.C. Rezende, B.K. Cassels) J. Comput-Aid. Mol. Design, 2002, 16, 95. 32. Correlation of liquid viscosity with molecular structure for organic compounds using different variable selection methods (B. Lucic, I. Basic, D. Nadramija, A. Milicevic, N. Trinajstic, T. Suzuki, R. Petrukhin, M. Karelson, A.R. Katritzky) ARKIVOC, 2002, 4, 45. 33. QSPR models derived for the kinetic data of the gas-phase homolysis of the carbon-methyl bond (R. Hiob, M. Karelson) Comput. & Chem. 2002, 26, 237. 34. Constructing a useful tool for characterizing amino acid conformers by means of quantum chemical and graph theory indices (C. Cardenas, M. Obregon, E.J. Llanos, E. Machado, H.J. Bohorquez, J.L. Villaveces, M. E. Patarroyo) Comput. & Chem. 2002, 26, 667. 35. Novel potent 5-HT3 receptor Ligands based on the pyrrolidone structure: Synthesis, biological evaluation, and computational rationalization of the 92 ligand-receptor interaction modalities (A. Cappelli, M. Anzini, S. Vomero, L. Mennuni, F. Makovec, E. Doucet, M. Hamon, M.C. Menziani, P.G. De Benedetti, G. Giorgi, C. Ghelardini, S. Collina) Bioorg & Med. Chem. 2002, 10, 779. 36. The Present Utility and Future Potential to Medicinal Chemistry of QSAR/QSPR with Whole Molecule Descriptors, (A.R. Katritzky, D. Tatham, D. Fara, U. Maran, A. Lomaka, M. Karelson) Current Topics in Medicinal Chemistry, 2002, 2, 1333. 37. QSPR correlation of the melting point for pyridinium bromides, potential ionic liquids (A.R. Katritzky, A. Lomaka, R. Petrukhin, R. Jain, M. Karelson, A.E. Visser, R.D. Rogers) J. Chem. Inf. Comput. Sci., 2002, 42, 71. 38. Correlation of the melting points of potential ionic liquids (imidazolium bromides and benzimidazolium bromides) using the CODESSA program (A.R. Katritzky, R. Jain, A. Lomaka, R. Petrukhin, M. Karelson, A.E. Visser, R.D. Rogers) J. Chem. Inf. Comput. Sci., 2002, 42, 225. 39. A General QSPR Treatment for Dielectric Constants of Organic Compounds (S. Sild, M. Karelson) J. Chem. Inf. Comput. Sci., 2002, 42, 360. 40. Prediction of Ultraviolet Spectral Absorbance using Quantitative StructureProperty Relationships, (W.L. Fitch, M. McGregor, A.R. Katritzky, A. Lomaka, R. Petrukhin, M. Karelson) J. Chem. Inf. Comput. Sci., 2002, 42, 830. 41. General and class specific models for prediction of soil sorption using various physicochemical descriptors (P.L. Andersson, U. Maran, D. Fara, M. Karelson, J.L.M. Hermens) J. Chem. Inf. Comput. Sci., 2002, 42, 1450. 42. A QSPR study of sweetness potency using the CODESSA program (A.R. Katritzky, R. Petrukhin, S. Perumal,, M. Karelson, I. Prakash, N. Desai) Croat. Chem. Acta, 2002, 75, 475. 43. A QSPR model for the prediction of the gas-phase free energies of activation of rotation around the N-C(O) bond (J. Leis, M. Karelson) Comput. & Chem. 2002, 26, 667. 44. Mutagenicity of aminoazobenzene dyes and related structures: a QSAR/QPAR investigation (A. Garg, K.L. Bhat, C.W. Bock) Dyes and Pigments, 2002, 55, 35. 45. QSAR of Flavonoids: 4. Differential Inhibition of Aldose Reductase and p56lck Protein Tyrosine Kinase (A. Štefanič-Petek, A. Krbačič, T. Šolmajer) Croat. Chem. Acta., 2002, 75, 517. 46. QSPR Correlation of Free-Radical Polymerization Chain-Transfer Constants for Styrene (A.R. Katritzky, F.H. Ignatz-Hoover, R. Petrukhin, M. Karelson) J. Chem. Inf. Comput. Sci., 2001, 41, 295. 93 47. Quantum mechanical quantitative structure activity relationships to avoid mutagenicity in dental monomers (D. Yourtee, A.J. Holder, R. Smith, J.A. Morrill, E. Kostoryz, W. Brockmann, A. Glaros, C. Chappelow, D. Eick) J. Biomater. Sci. – Polym. Sci. 2001, 12, 89. 48. Correlation of the Solubilities of Gases and Vapors in Methanol and Ethanol with Their Molecular Structures (A.R. Katritzky, D.B. Tatham, U. Maran) J. Chem. Inf. Comput. Sci., 2001, 41, 358. 49. CODESSA-Based Theoretical QSPR Model for Hydantoin HPLC-RT Lipophilicities (A.R. Katritzky, S. Perumal, R. Petrukhin, E. Kleinpeter) J. Chem. Inf. Comput. Sci., 2001, 41, 569. 50. The variable connectivity index (1)chi(f) versus the traditional molecular descriptors: A comparative study of (1)chi(f) against descriptors of CODESSA (M. Randic, M. Pompe) J. Chem. Inf. Comput. Sci., 2001, 41, 631. 51. Factors influencing predictive models for toxicology (E. Benfenati, N. Piclin, A. Roncaglioni, M.R. Vari) SAR QSAR Environm. Res. 2001, 12, 593. 52. A QSRR-Treatment of Solvent Effects on the Decarboxylation of 6Nitrobenzisoxazole-3-carboxylates Employing Molecular Descriptors (A.R. Katritzky, S. Perumal, R. Petrukhin) J. Org. Chem., 2001, 66, 4036. 53. Interpretation of Quantitative Structure - Property and -Activity Relationships (A.R. Katritzky, R. Petrukhin, D. Tatham, S. Basak, E. Benfenati, M. Karelson, U. Maran) J. Chem. Inf. Comput. Sci., 2001, 41, 679. 54. Theoretical Descriptors for the Correlation of Aquatic Toxicity of Environmental Pollutants by Quantitative Structure-Toxicity Relationships (A.R. Katritzky, D.B. Tatham, U. Maran) J. Chem. Inf. Comput. Sci., 2001, 41, 1162. 55. QSPR Analysis of Flash Points (A.R. Katritzky, R. Petrukhin, R. Jain, M. Karelson) J. Chem. Inf. Comput. Sci., 2001, 41, 1521. 56. Perspective on the Relationship between Melting Points and Chemical Structure (A.R. Katritzky, R. Jain, A. Lomaka, R. Petrukhin, U. Maran, M. Karelson) Crystal Growth & Design, 2001, 1, 261. 57. QSAR Correlations of the Algistatic Activity of 5-Amino-1-aryl-1Htetrazoles (A.R. Katritzky, R. Jain, R. Petrukhin, S.N. Denisenko, T. Schelenz) S. Q. Env. Res., 2001, 12, 259. 58. QSPR Correlation and Predictions of GC Retention Indexes for MethylBranched Hydrocarbons Produced by Insects (A.R. Katritzky, K. Chen, U. Maran, D.A. Carlson) Anal. Chem., 2000, 72, 101. 59. Non-linear QSAR Treatment of Genotoxicity (M. Karelson, S.Sild, U. Maran) Mol. Simulat., 2000, 24, 229. 94 60. Structurally Diverse Quantitative Structure-Property Relationship Correlations of Technologically Relevant Physical Properties (A.R. Katritzky, U. Maran, V.S. Lobanov, M. Karelson) J. Chem. Inf. Comput. Sci., 2000, 40, 1. 61. Prediction of Liquid Viscosity for Organic Compounds by a Quantitative Structure-Property Relationship (A.R. Katritzky, K. Chen, Y. Wang, M. Karelson, B. Lucic, N. Trinajstic, T. Suzuki, G. Shuurmann) J. Phys. Org. Chem., 2000, 13, 80. 62. Quantitative relationship between rate constants of the gas-phase homolysis of C-X bonds and molecular descriptors (R. Hiob, M. Karelson) J. Chem. Inf. Comput. Sci., 2000, 40, 1062. 63. Quantitative structure-activity relationship of flavonoid analogues. 3. Inhibition of p56(lck) protein tyrosine kinase (M. Oblak, M. Randic, T. Solmajer) J. Chem. Inf. Comput. Sci., 2000, 40, 994. 64. QSAR equations for N-methyl-(substituted-phenyl)-urethanes by PRECLAV program (L. Tarko) Rev. Roum. Chim., 2000, 45, 809. 65. Improved QSARs for predictive toxicology of halogenated hydrocarbons, (S. Trohalaki, E. Gifford, R. Pachter) Comput. & Chem. 2000, 24, 421. 66. A Comprehensive QSAR Treatment of the Genotoxicity of Heteroaromatic and Aromatic Amines (A.R. Katritzky, U. Maran, M. Karelson) Quant. Struct. Act. Relat. Pharmacol. Chem. Biol., 1999, 18, 3. 67. A New Efficient Approach for Variable Selection Based on Multiregression: Prediction of Gas Chromatographic Retention Times and Response Factors (A.R. Katritzky, B. Lucic, N. Trinajstic, S. Sild, M. Karelson,) J. Chem. Inf. Comput. Sci., 1999, 39, 610. 68. Development of quantitative structure-property relationships using calculated descriptors for the prediction of the physicochemical properties (n(D), p, bp, epsilon, eta) of a series of organic solvents (M. Cocchi, P.G. De Benedetti, R. Seeber, L. Tassi, A. Ulrici) J. Chem. Inf. Comput. Sci., 1999, 39, 1190. 69. QSPR Treatment of Solvent Scales (A.R. Katritzky, T. Tamm, Y. Wang, S. Sild, M. Karelson) J. Chem. Inf. Comput. Sci., 1999, 39, 684. 70. A Unified Treatment of Solvent Properties (A.R. Katritzky, T. Tamm, Y. Wang, M. Karelson) J. Chem. Inf. Comput. Sci., 1999, 39, 692. 71. QSPR prediction of densities of organic liquids (M. Karelson, A. Perkson) Comput. & Chem. 1999, 23, 49. 72. Estimation of the liquid viscosity of organic compounds with a quantitative structure-property model (O. Ivanciuc, T. Ivanciuc, P.A. Filip, D. Cabrol-Bass) J. Chem. Inf. Comput. Sci., 1999, 39, 515. 95 73. QSPR and QSAR Models Derived Using Large Molecular Descriptor Spaces. A Review of CODESSA Applications (M. Karelson, U. Maran, Y. Wang, A. R. Katritzky) Collect. Czech. Chem. Commun., 1999, 64, 1551. 74. Insights into Sulfur Vulcanization from QSPR Quantitative StructureProperty Relationship Studies (A.R. Katritzky, F. Ignatz-Hoover, V.S. Lobanov, M. Karelson) Rubber Chem. Technol., 1999, 72, 318. 75. Relevance of theoretical molecular descriptors in quantitative structureactivity relationship analysis of alpha 1-adrenergic receptor antagonists (M.C. Menziani, M. Montorsi, P.G. De Benedetti, M. Karelson) Bioorg & Med. Chem. 1999, 7, 2437. 76. QSAR study in the N-aryl-anthranylic acids class (T. Laszlo) Revista de Chemie, 1999, 50, 864. 77. QSPR and QSAR Models Derived with CODESSA Multipurpose Statistical Analysis Software. (M. Karelson, U. Maran, Y. Wang, A.A. Katritzky), AAAI Tech. Report, 1999, SS-99-01, 12. 78. Evaluation of quantitative structure-activity relationship methods for largescale prediction of chemicals binding to the estrogen receptor (W. Tong, D.R. Lowis, R. Perkins, Y. Chen, W.J. Welsh, D.W. Goddette, T.W. Heritage, D.M. Sheehan) J. Chem. Inf. Comput. Sci., 1998, 38, 669. 79. Normal Boiling Points for Organic Compounds: Correlation and Prediction by a Quantitative Structure - Property Relationship (A.R. Katritzky, V.S. Lobanov, M. Karelson) J. Chem. Inf. Comput. Sci., 1998, 38, 28. 80. Quantitative Structure - Property Relationship (QSPR) Correlation of Glass Transition Temperatures of High Molecular Weight Polymers (A.R. Katritzky, S. Sild, V. Lobanov, M. Karelson) J. Chem. Inf. Comput. Sci., 1998, 38, 300. 81. Correlation of the Aqueous Solubility of Hydrocarbons and Halogenated Hydrocarbons with Molecular Structure (A.R. Katritzky, P.D.T. Huibers) J. Chem. Inf. Comput. Sci., 1998, 38, 283. 82. Relationship of Critical Temperatures to Calculated Molecular Properties (A.R. Katritzky, L. Mu, M. Karelson) J. Chem. Inf. Comput. Sci., 1998, 38, 293. 83. Quantitative structure-property relationship study of normal boiling points for halogen/oxygen/sulfur-containing organic compounds using the CODESSA program (O. Ivanciuc, T. Ivanciuc, A.T. Balaban), Tetrahedron, 1998, 54, 9129. 84. QSPR Studies on Vapor Pressure, Aqueous Solubility, and the Prediction of Water-Air Partition Coefficients (A.R. Katritzky, Y. Wang, S. Sild, T. Tamm, M. Karelson) J. Chem. Inf. Comput. Sci., 1998, 38, 720. 96 85. General Quantitative Structure-Property Relationship Treatment of the Refractive Index of Organic Compounds (A.R. Katritzky, S. Sild, M. Karelson) J. Chem. Inf. Comput. Sci., 1998, 38, 840. 86. Theoretical descriptors in quantitative structure - Affinity and selectivity relationship study of potent N4-substituted arylpiperazine 5-HT1(Alpha) receptor antagonists (M.C. Menziani, P.G. De Benedetti, M. Karelson) Bioorg & Med. Chem. 1998, 6, 535. 87. Correlation and Prediction of the Refractive Indices of Polymers by QSPR (A.R. Katritzky, S. Sild, M. Karelson) J. Chem. Inf. Comput. Sci., 1998, 38, 1171. 88. Prediction of Critical Micelle Concentration Using a Quantitative StructureProperty Relationship Approach. 2. Anionic Surfactants (A.R. Katritzky, P.D.T. Huibers, V.S. Lobanov, D.O. Shah, M. Karelson) J. Colloid and Interface Sci., 1997, 187, 113. 89. QSPR as a Means of Predicting and Understanding Chemical and Physical Properties in Terms of Structure (A.R. Katritzky, M. Karelson, V.S. Lobanov) Pure Appl. Chem., 1997, 69, 245. 90. Predicting Surfactant Cloud Point from Molecular Structure (A.R. Katritzky, P.D.T. Huibers, D.O. Shah) J. Colloid and Interface Sci., 1997, 193, 132. 91. Prediction of Melting Points for the Substituted Benzenes: A QSPR Approach (A.R. Katritzky, U. Maran, M. Karelson, V.S. Lobanov) J. Chem. Inf. Comput. Sci., 1997, 37, 913. 92. QSPR Treatment of the Unified Nonspecific Solvent Polarity Scale (A.R. Katritzky, L. Mu, M. Karelson) J. Chem. Inf. Comput. Sci., 1997, 37, 756. 93. QSAR calculation programme (T. Laszlo) Revista de Chemie, 1997, 48, 676. 94. CODESSA version 2.13 for Windows (O. Ivanciuc) J. Chem. Inf. Comput. Sci., 1997, 37, 405. 95. Prediction of Critical Micelle Concentration Using a Quantitative StructureProperty Relationship Approach. 1. Nonionic Surfactants (A.R. Katritzky, P.D.T. Huibers, V.S. Lobanov, D.O. Shah, M. Karelson) Langmuir, 1996, 12, 1462. 96. Correlation of Boiling Points with Molecular Structure. 1. A Training Set of 298 Diverse Organics and a Test Set of 9 Simple Inorganics (A.R. Katritzky, L. Mu, V.S. Lobanov, M. Karelson) J. Phys. Chem., 1996, 100, 10400. 97. Quantum-Chemical Descriptors in QSAR/QSPR Studies (M. Karelson, V.S. Lobanov, A.R. Katritzky) Chem. Rev., 1996, 96, 1027. 97 98. Prediction of Polymer Glass Transition Temperatures Using A General Quantitative Structure-Property Relationship Treatment (A.R. Katritzky, P. Rachwal, K.W. Law, M. Karelson, V.S. Lobanov) J. Chem. Inf. Comput. Sci., 1996, 36, 879. 99. A QSPR Study of the Solubility of Gases and Vapors in Water (A.R. Katritzky, L. Mu, M. Karelson) J. Chem. Inf. Comput. Sci., 1996, 36, 1162. 100. Comprehensive Descriptors for Structural and Statistical Analysis. 1 Correlations Between Structure and Physical Properties of Substituted Pyridines (A.R. Katritzky, V.S. Lobanov, M. Karelson, R. Murugan, M.P. Grendze, J.E. Toomey Jr.) Rev. Roum. Chim., 1996, 41, 851. 101. QSPR: The Correlation and Quantitative Prediction of Chemical and Physical Properties from Structure (A.R. Katritzky, V.S. Lobanov, M. Karelson) Chem. Soc. Rev., 1995, 279. 102. Predicting Physical Properties from Molecular Structure (A.R. Katritzky, R.Murugan, M.P. Grendze, J.E. Toomey, Jr, M. Karelson, V. Lobanov, P. Rachwal) Chemtech., 1994, 24, 17. 103. Prediction of Gas Chromatographic Retention Times and Response Factors Using a General Quantitative Structure-Property Relationship Treatment (A.R. Katritzky, E.S. Ignatchenko, R.A. Barcock, V.S. Lobanov, M. Karelson) Anal. Chem., 1994, 66, 1799. 98