Comprehensive Descriptors

advertisement
UNIVERSITY OF FLORIDA
CODESSA PRO
User’s Manual
2005
Table of contents
CHAPTER 1 ...................................................................................................................... 6
INSTALLING CODESSA PRO ...................................................................................... 6
1.1 SYSTEM REQUIREMENTS: ................................................................................... 6
1.2 INSTALLATION OF THE PROGRAM .................................................................. 7
CHAPTER 2 ...................................................................................................................... 8
DESCRIPTION OF CODESSA PRO ............................................................................. 8
2.1 CONCEPTS AND DEFINITIONS ............................................................................ 8
Objects: ....................................................................................................................... 8
Storage: ....................................................................................................................... 8
Snapshot (Cache File): ................................................................................................ 8
Workspace: ................................................................................................................. 8
Folder: ......................................................................................................................... 8
Structure: ..................................................................................................................... 9
Descriptors: ................................................................................................................. 9
Property: ...................................................................................................................... 9
Descriptor/Property Matrix: ........................................................................................ 9
The Targets of the Analysis: ..................................................................................... 10
Correlation: ............................................................................................................... 10
List: ........................................................................................................................... 10
2.2 CODESSA PRO VISUAL INTERFACE ............................................................... 11
2.2.2 WORK AREA ........................................................................................................ 14
2.2.2.1 Structure View Window ................................................................................... 14
2.2.2.2 Correlation View Window ................................................................................ 15
2.2.3 OBJECT WINDOW .............................................................................................. 16
2.2.4 LOG WINDOW ..................................................................................................... 17
CHAPTER 3 .................................................................................................................... 18
USING CODESSA PRO TO RUN A PROJECT......................................................... 18
3.1 STARTING A NEW PROJECT ............................................................................. 18
3.2 CREATING STRUCTURE FILES ......................................................................... 18
Molfiles ..................................................................................................................... 18
SCF files.................................................................................................................... 19
Force Files ................................................................................................................. 19
2
3.3 CREATING A PROPERTY FILE .......................................................................... 19
3.4 CREATING A STORAGE....................................................................................... 20
3.5 PREPARING DATA IN A STORAGE .................................................................. 20
3.6 CALCULATION OF MOLECULAR DESCRIPTORS ....................................... 23
3.7 CALCULATION OF THE QSAR/QSPR CORRELATIONS ............................. 24
3.7.1 MLR ........................................................................................................................ 24
3.7.2 BMLR...................................................................................................................... 24
3.7.3 HMPRO .................................................................................................................. 25
3.7.4 PLS .......................................................................................................................... 29
3.7.5 PCR ......................................................................................................................... 30
3.8 VIEWING CORRELATIONS ................................................................................ 31
3.9 PRINTING OUT RESULTS.................................................................................... 31
3.10 MANIPULATING LISTS ...................................................................................... 31
CHAPTER 4 .................................................................................................................... 33
CODESSA PRO DESCRIPTORS ................................................................................. 33
4.1 Constitutional Descriptors ...................................................................................... 39
4.2 Topological Descriptors .......................................................................................... 40
4.3 Geometrical Descriptors ......................................................................................... 46
4.4 Electrostatic Descriptors ......................................................................................... 50
4.5 CPSA Descriptors ................................................................................................... 54
4.6 MO Related Descriptors ......................................................................................... 68
4.7 Quantum Chemical Descriptors .............................................................................. 71
4.8 Thermodynamic Descriptors ................................................................................... 77
CHAPTER 5 .................................................................................................................... 80
METHODS OF QSAR/QSPR MODEL DEVELOPMENT ....................................... 80
5.1 Multilinear Regression ............................................................................................ 80
5.2 Selection of descriptors ........................................................................................... 82
5.3 Multivariate methods .............................................................................................. 85
5.4 Test of the results .................................................................................................... 87
5.5 External validation set............................................................................................. 87
5.6 Leave-One-Out crossvalidation .............................................................................. 88
5.7 Leave-Many-Out crossvalidation............................................................................ 89
3
5.8 Randomization test.................................................................................................. 89
4
INTRODUCTION
CODESSA (Comprehensive Descriptors for Structural and Statistical
Analysis) PRO is a comprehensive program for developing quantitative
structure-activity/property relationships (QSAR/QSPR) by integrating all
necessary mathematical and computational tools to
(i) calculate a large variety of molecular descriptors on the basis of
the 3D geometrical structure and/or quantum-chemical wave function of
chemical compounds;
(ii) develop (multi)linear and non-linear QSPR models on the
chemical and physical properties or biological activity of chemical
compounds;
(iii) carry out cluster analysis of the experimental data and
molecular descriptors;
(iv) interpret the developed models;
(v) predict property values for any chemical compound with
known molecular structure.
This manual provides further information concerning the features and methods
available in the CODESSA PRO program, their purpose, and the interpretation of the
results. In addition to the overall description of the program, this manual gives the
guidelines for the development of a QSAR/QSPR models using the CODESSA PRO
program.
The CODESSA PRO program is designed to operate in the following
Microsoft Windows environments: Windows 2000, Windows XP, Windows
NT and Windows 9x.
5
Chapter 1
Installing CODESSA PRO
1.1 System requirements:
Hardware:
Processor: Pentium class systems - minimum. All processors developed hereafter
by Intel Corp. are supported on the assembly level optimization.
All AMD current processors work as old Pentium with higher clock
freqency (no special optimization).
Memory: 128MB minimum, 256MB default tuning.
CD-ROM: A CD-ROM or a compatible DVD device is required to install
CODESSA PRO software.
Other:
A 3D graphics accelerator is highly recommended because of extensive
use of OpenGL for presentation of molecules.
2-button mouse is required.
Software:
Operation system: Windows XP Professional, Windows XP Home, Windows 2000,
Windows NT (with limitation on using network drives) or later versions
Operation system extensions: Internet Explorer 4.0 SP2 or newer version for
authorization.
6
1.2 Installation of the program
1. Insert the CODESSA PRO CD in the CD-ROM drive of your computer.
2. If the Windows Autorun feature is turned on, the installation options will be
displayed automatically. If the installation options do not appear, open the
Windows Start Menu and choose Run. The RUN dialog box opens.
3. Enter the following command in the open text box (assuming D: is your CDROM:
D:\setup.exe
4. Choose OK. The CODESSA PRO setup program initializes and the CODESSA
PRO setup dialog box opens.
5. Follow the on screen instructions for installation.
The CODESSA PRO setup program will detect an existing CODESSA PRO
version in your computer and offers the choice of uninstalling or keeping the older
version of the program.
7
Chapter 2
Description of CODESSA PRO
2.1 Concepts and definitions
The use of CODESSA PRO requires understanding of the following concepts:
Objects:
These are the individual files or items used by CODESSA PRO. There are four
types of objects: structures, descriptors, properties, and models. Using the CODESSA
PRO program will frequently require manipulation of single objects or lists of objects.
Most objects (e.g. descriptors and models) are named by CODESSA PRO explicitly.
Storage:
All items that are connected with a project are stored in a single location within
the file system - in the storage. To change a storage location, click Option on the main
menu-bar and select Storage. The dialogue box that opens allows the user to control the
storage location of files related to chemical structures, QSAR/QSPR correlations,
molecular descriptors, lists of objects, etc.
Snapshot (Cache File):
The snapshot (CODESSA PRO cache file) is a binary (compressed)
representation of all items in storage. This file is located in RAM whenever work is being
performed on the CODESSA PRO Visual Interface (CVI) module. The cache file reloads
each time the CVI detects a change in the storage contents. The snapshot can be reloaded
manually at anytime by opening the pull-down menu Calculate on the menu-bar, then
selecting Load storage, or by using F5 keyboard shortcut.
Workspace:
The workspace is a window within the CVI that is related to the snapshot. It is
displayed in the workspace area on the top left side of the CVI frame.
Folder:
A folder is used to store lists of certain type of objects. CODESSA PRO uses four
main folders, for the chemical structures, molecular descriptors, properties, and
QSAR/QSPR models (correlations), respectively. Each of the first three folders has a
system-generated list named All. These lists will contain all structures, descriptors or
properties, respectively, for the current session. The folder Descriptors also includes
system-defined descriptor lists according to the descriptor group and type. A particular
descriptor item may be present in several lists. The folder Correlations contains lists for
8
each property item in the property folder. The list is generated automatically by the
system when a property item is created. When correlations are calculated, 50 (by default)
correlation items with the best statistical characteristics will be stored for any single
property.
Structure:
A structure is a representation of an individual chemical object having a precise
chemical constitution. Examples of structures include a single molecule, a monomeric
unit of a polymer, or a molecular cluster with a definite composition. The minimum
information for a structure must include the types of atoms involved and their
connectivity. Each structure is linked to three files containing:
(1) information about 3-D structure of the molecular structure in the format of
MDL molfile,
(2) information obtained from the quantum chemical self-consistent field
molecular orbital (SCF MO) calculations using MOPAC program package,
(3) information obtained from the quantum chemical force calculations using
MOPAC program package.
The molfile structures are stored in the 3D MOL(MOPAC) Subfolder, SCF MO output
files are stored in the SCF Subfolder, and the output files of the quantum chemical force
calculations are stored in the THERMO Subfolder. SCF and Force structures will be
created and properly stored automatically by the CMol3D module The name for a
particular structure is exactly the same in each directory. Structure names are in the form
Sddddddd.xxx and are generated automatically in sequential order by CODESSA PRO.
Descriptors:
Descriptors are defined as numerical characteristics associated with chemical
structures. These are derived proceeding from the chemical constitution, topology,
geometry, wave function, potential energy surface or some combination of these items for
a given chemical structure. The values of a particular descriptor can be provided
externally by the user or calculated automatically by the CODESSA PRO program. Each
descriptor value must be associated with a previously defined structure. Descriptors
calculated by CODESSA PRO have specific pre-defined names; renaming of the
descriptors is not recommended.
Property:
Property is a physical or chemical characteristic, biological activity, or other
characteristic of interest. Each property value must be associated with a structure located
in the 3D MOL(MOPAC) Subfolder.
Descriptor/Property Matrix:
The descriptor/property matrix consists of descriptor (all columns except the last)
and property (the last column) values. The horizontal dimension of the matrix is the
descriptor/property ID sequence and the vertical dimension is the structure ID sequence.
9
The matrix has two (binary and text) presentations. The binary presentation is used within
the CODESSA PRO program, while the text presentation in the CSV format can be used
for the data transport to for other software packages
The Targets of the Analysis:
Any property, list of structures or descriptors can be set up for a certain workflow
within the CODESSA PRO treatment. The default targets for a new snapshot file are:
Object
Property
List of descriptors
List of structures
Default value
First selected
All group
All group
The selected targets are used in the formation of the descriptor/property matrix.
Correlation:
A correlation represents the results of a (multi)linear regression between a property
of interest ( y ) and one or more selected descriptors ( xi ).
y  a0   ai xi
The information stored for each correlation includes
- regression coefficients ( ai )
- correlation coefficient (R2)
- squared standard deviation (s2)
- Fisher criterion F value
for the set of structures used in the derivation of the correlation. By default, each
correlation is named by its number of chemical structures involved (N), correlation
coefficient (R2), crossvalidated correlation coefficient (R2CV), Fisher criterion value
(F), and squared standard deviation (s2).
List:
A list is a collection of chemical structures, descriptors, physical and chemical or
other property of interest, or models (correlations). Lists can be either system type or user
type. System lists cannot be deleted by the user.
10
2.2 CODESSA PRO Visual Interface
O
Obbjjeecctt
Figure 1. The CODESSA PRO Visual Interface
The CODESSA PRO Visual Interface has the following screen areas:
Menu Bar: Menus related to different tasks performed by CODESSA PRO
Tool Bar: Shortcuts to commonly performed tasks.
Workspace: An on-screen presentation of the current cache file.
Work Area: The screen space for various view windows.
Object Window: The object window depicts information on almost all objects
(structures, properties, descriptors, correlations).
Log Window: Protocol of all operations on the storage.
11
2.2.1 Workspace
The CODESSA PRO workspace area (Fig. 2)
contains information on the current cache file in the
form of a directory tree. The folders for different
objects, i.e. chemical structures, molecular
descriptors, properties, and QSAR/QSPR models are
included in this tree.
Figure 2. CODESSA PRO Workspace area
The Structures folder (Fig. 3) includes
always a subfolder All. All structures defined in the
given cache file are listed there. The name that is
listed will be the same as the name given on the first
line of the corresponding molfile. If this line is empty,
the CODESSA PRO name of the molfile (e.g.
S000000x) will be used instead. Double clicking on
the name of a compound in this list opens a structure
view window in the work area, depicting the 3-D
structure of this compound.
A selection of compounds made by the user
may be stored in a separate subfolder.
Figure 3. The Structures folder
The folder Descriptors (Fig. 4) includes
always a subfolder All that lists all the descriptors
defined within the given cache file. The descriptors
transferred to the CODESSA PRO externally by user
are listed in the subfolder External. The descriptors
calculated by CODESSA PRO are listed in the
respective subfolders (Constitutional, Topological,
Geometrical, Charge-related, Semi-empirical,
Thermodynamical, Molecular type, Atomic type,
Bond type). The currently used subfolder is given in
boldface font.
12
Figure 4. The Descriptors folder
The folder Properties (Fig. 5) includes
always a subfolder All that gives the list of all
properties considered in the given cache file. The
currently treated property is given in the boldface
font. The name of each property is given on the first
line of the respective property file (P000000x.prp).
Figure 5. The Properties folder
The folder Models (Fig. 6)
includes the list of all QSAR/QSPR
models stored within the current cache
file. A separate subfolder is created
automatically for each property. The
details related to each correlation are
displayed when the pointer is held over
the respective line. Double clicking on
a correlation launches correlation
view in the work area.
Figure 6. The Models folder
13
2.2.2 Work Area
2.2.2.1 Structure View Window
Figure 7. The CODESSA PRO Structure View Window
Double clicking on the name of a structure in the Structures folder opens a
structure view window (Fig. 7) in the work area. The 3-D structures can be viewed as
- wire-frame
- ball and stick
- CPK (Corey-Pauling-Koltun) surface
- solvent accessible surface (SASA)
- Zefirov’s charges on SASA
- MOPAC charges on SASA
Right clicking inside the window opens the view and label context pop-up menu.
Selecting view provides the six options mentioned above, and the label identifies the
number or the type of each atom. When the pointer is positioned over the view window,
it changes to a 4-sided arrow that can be dragged to rotate the structure around its mass
center.
14
2.2.2.2 Correlation View Window
Figure 8. The CODESSA PRO Correlation View Window
Double clicking on a correlation object in the Models folder opens the
correlation view window (Fig. 8) in the work area. The graph appearing there shows the
observed (experimental) property values versus the predicted values for this
QSAR/QSPR correlation model. When the window is first opened, all the points for
outliers will be blue rather than yellow. Single click on a point will display its properties
and their associated values in the properties window. Double clicking on a point will
open in the work area a structure view window, for the structure that corresponds to
that point. Right clicking on the window will open a context pop-up menu that enables to
show structures, show descriptors, create Sublists, mark outliers and mark nonoutliers as desired.
Double clicking on a descriptor object will also open a correlation view window.
The window will show the linear relationship between the descriptor and the property
dimension selected. The value predicted according to any model chosen maybe viewed
by selecting that model, clicking the view pull-down menu, and then choosing predicted
properties. This will display the observed property values, predicted property value
and errors.
15
2.2.3 Object window
The object window displays the information about the objects, folders or lists
selected in workspace. If a folder or list is selected in the workspace, the number of
objects contained is displayed. If an object is selected, the object window will provide
varying information, depending on the type of object.
The picture on the left (Fig. 9) shows the
information given in the object window
when a structure object is selected in the
workspace. It will also show the
experimental value and calculated value of
the property of the current structure, which
are not shown in the picture now.
Figure 9. Data on structure in object window
The information provided in the object
window when a descriptor object is
selected in the workspace is presented
on Fig. 10.
Figure 10. Data on descriptor in object window
When a property object is selected in the
workspace, information related to the
respective .prp file is displayed in object
window (Fig. 11). This includes the name
of the property, comments and the
number of structures for which the given
property values are available.
Figure 11. Data on property in object window
The object window for correlation objects
lists the details of the correlation (Fig. 12).
Details displayed include the properties used
in the correlation.
16
Figure 12. Data on correlation in object window
2.2.4 Log Window
The log window keeps the protocol on each operation that has been performed on
the storage. Double clicking on a correlation in the workspace will also give the full
description of the correlation in this window (Fig. 13). The text can be cut and pasted
directly into a word processing program.
Figure 13. The information on a correlation object in the log window.
The edit pull-down menu has options to select all log, copy from log, paste to
log and clear all log. Transferring the information for several correlations to a word
processing program can be accomplished by clearing all log, double clicking on each of
the correlations, then clicking select all log, and finally copy from log.
17
Chapter 3
Using CODESSA PRO to run a project
3.1 Starting a new project
It is important to decide whether to use a general storage area for several projects
or separate storage areas for each individual project. A general storage area allows a
research group to avoid repeating the calculations of molecular descriptors as well as to
economize the amount of storage space used. If a general storage area is to be used, it
will be necessary to keep the global index of the files and chemical structure and property
names. The names of the chemical structures are defined in the name section of the
respective molfile. The name of the property investigated is defined in the first line of the
property file (.prp).
3.2 Creating structure files
Molfiles
Structures can be created using any chemical drawing program that will convert
2-D structures to 3-D structures, e.g. HyperChem [www.hyper.com], ISIS Draw
[www.mdl.com],
ChemDraw
[www.cambridgesoft.com],
or
PCModel
[www.serenasoft.com]. The 3-D geometries must be optimized in a two-step process
using
MOPAC
software
[http://ccl.net/cca/software/SOURCES/FORTRAN/mopac7_sources/index.shtml].
The
first optimization uses the keywords: AM1 and the second optimization uses the
keywords: AM1 GNORM=0.01 PRECISE. The additional keyword MMOK should be
used for any molecule with peptide linkages. If either optimization fails, it should be
repeated using additional keywords until successful. The following list contains the
progression of additional keywords to add.
EF
XYZ
EF XYZ
EF XYZ GEO-OK
If none of the keywords or combinations listed above lead to a complete
optimization of the structure, a different starting geometry described by the respective Zmatrix has to be used. Note that dummy atoms used for keeping the symmetry of the
structure must be deleted prior to performing SCF and thermodynamic calculations.
The optimized structures should be converted to molfile format, given the
extension .mol, and stored in the mol3dmop directory of the CODESSA PRO storage.
18
SCF files
The SCF file is the output file of the MOPAC calculation for a given chemical
structure (molecule, molecular cluster) using the following keywords: AM1 VECTORS
BONDS PI POLAR PRECISE ENPART EF.
The files should have an extension .mno and stored in the mopscf directory of the
CODESSA PRO storage.
Force Files
The Force file is the output file of the MOPAC calculation for a given chemical
structure (molecule, molecular cluster) using the following keywords: AM1 FORCE
PRECISE THERMO ROT=1.
The files should also have an extension .mno and stored in the moptherm
directory of the CODESSA PRO storage.
3.3 Creating a property file
Once the property values of interest have been collected and carefully checked for
each structure, the data must be assembled into a property (.prp) file. It is essentially a
text file. Spreadsheet programs may be used to simplify the process of assembling data,
but the file should be therefore saved as text. The line format of the file must be exactly
as follows:
1st line
2nd line
3rd line
….
….
Name of property
Comments
Structure1 Value1
Structure2 Value2
Structure3 Value3
The individual lines have the following content:
Name of property: The name that is assigned to the property object. The object will
appear within one or more lists in the property folder. A list with same name will also be
generated automatically by CODESSA PRO and stored in the correlations folder.
Comments: Any comments written in this location will be listed in the comments section
of the object window whenever the property is selected.
Structure1: The number of the first structure (not the filename). Zeros in this number are
ignored; the structure S0000023.mol can be listed as 00023, 023, or just 23.
Value1: The property value associated with Structure1. There must be a delimiter
(space or tab) between the structure number and the property value. The type or number
of delimiters is irrelevant.
19
On the CODESSA PRO window, click the edit pull-down menu and choose add
property. Then input the name, comment, and property values according to the format
which is indicated in the notepad, and, finally, use Load storage from Calculate menu to
load the data.
3.4 Creating a storage
First of all, the user has to create the main (root) folder for a given project using
any of Windows options for this procedure (e.g. Windows Explorer). To create storage
for a new project, this folder has to be specified using the storage function in the option
pull-down menu. After clicking OK in this window, the project folder will contain a list
of subfolders, an example of which appears in the following picture (Fig. 14). The user
may rename all subfolders or keep their default names shown in the shaded area.
Figure 14. Definition of the project storage
All the storage subfolders should be empty before starting the project calculations.
In the first step of filling the storage, the previously created molfiles of all chemical
structures related to the project should be stored either in the MOL subfolder or in
another specified location. Regardless of the original naming of molfiles, CODESSA
PRO, will automatically name the files as Sddddddd.mol, where d stays for sequential
numbers.
3.5 Preparing data in a storage
After saving all molfiles for the chemical structures in the MOL Subfolder,
selecting Load storage from Calculate menu automatically prepares all the data that are
necessary for carrying out the correlation analysis. If the location of molfiles is different,
20
then it can be included by the selection of the edit pull-down menu and choosing there
add structure function.
The process through which CODESSA PRO prepares all the data, except property
files, includes 11 stages. Each of these 11 stages is performed automatically in sequence
by CODESSA PRO.
Stage 1
This is the first model building stage. The program applies simple molecular
mechanics preoptimization and transfers the 2-dimensional structures from the MOL
subfolder to the 3-dimensional structures that will be stored in the 3D MOL (Molecular
Mechanic) subfolder. Missing hydrogen atoms are added according to the formal
valence rules. The 3D structure building is done deterministically for the first iteration
and stochastically for subsequent iterations.
Stage 2
This stage is concerned with the conversion of the file from MDL molfile format
in the 3D MOL (Molecular Mechanic) subfolder into the MOPAC input file format for
preliminary geometry optimization in the MOPAC Optimization (step1) subfolder.
Stage 3
In this stage, the preliminary geometry optimization of the molecular structures
using CMOPAC (MOPAC Version 7 clone) is carried out. If the final optimization
gradient is less than 5.0, the program automatically proceeds to Stage 4. If the gradient is
more than 5.0, then the program automatically go back to Stage 1 and starts the structure
building again, stochastically. This process is continued up to 10 times until the
optimization stage is passes on the criteria of the gradient value being less than 5.0. If a
satisfactory result is not achieved after 10 iterations, the program stops.
Stage 4
During this stage, the program converts the MOPAC output files in the MOPAC
Optimization (step1) subfolder into a MOPAC input files for precise geometry
optimization, and stores them in the MOPAC Optimization (step2) subfolder.
Stage 5
CODESSA PRO is performing precise geometry optimization at this stage. When
the geometry optimization gradient value falls below 0.5, the program proceeds to Stage
6. If the minimum gradient achieved exceeds 0.5, then the program goes back to Stage 1
and does structure building again, stochastically. This process is continued up to 10
iterations until the gradient value test for the precise optimization is satisfied. If 10
iterations are unsuccessful, the calculation is stopped.
Stage 6
At this stage, the CODESSA PRO program converts the MOPAC output files in
the MOPAC Optimization (step2) Subfolder into MOPAC input files for calculation of
molecular properties. This is done without explicit calculation of the Hessian matrix. The
resulting MOPAC input file is stored in SCF Subfolder.
21
Stage 7
At this stage, the CODESSA PRO program produces a set of molecular data
calculated using CMOPAC that are used subsequently as the molecular descriptors or as
the initial data for the calculation of descriptors. The resulting output files are stored in
the SCF Subfolder.
Stage 8
At this stage, CODESSA PRO program takes the data on molecular geometries
stored in the MOPAC Optimization (step2) Subfolder, prepares MOPAC input files for
force calculations, and stores them in the THERMO Subfolder. At this stage keyword
ROT=1 is added, which is valid for C1, CI, and CS groups of symmetry. However, if
molecular symmetry is different from these point groups, the respective keyword should
be edited manually in the corresponding input (.mni) file in MOPAC Optimization
(step1) Subfolder or MOPAC Optimization (step2) Subfolder.
Stage 9
At this stage, CODESSA PRO program calculates the molecular properties using
the Hessian matrix (“force” calculation). If a molecular structure is in a transition state,
then the program goes back to Stage 1 and restarts by using stochastic structure
generation. If the molecule is in its ground state, then the program proceeds to Stage 10.
The program terminates after 10 attempts to find a stable structure.
Stage 10
This stage will be reached only when all the geometry tests described above are
satisfied. At this stage, the molfiles in the 3D MOL (MOPAC) Subfolder are created,
based on formal connectivity information from the respective molfiles in the MOL
Subfolder, and atomic coordinates from the MOPAC output file (.mno) in the MOPAC
Optimization (step2) Subfolder.
Stage 11
At this stage, CODESSA PRO has all necessary information to proceed to the
calculation of molecular descriptors. This information is stored in molfiles in the 3D
MOL (MOPAC) Subfolder, in the MOPAC output files (.mno) in the SCF Subfolder
and in the MOPAC output files (.mno) in the THERMO Subfolder All intermediate files
are deleted from storage at this stage. Only MDL MOL files in the MOL Subfolder and
the 3D MOL (MOPAC) Subfolder, and MOPAC input and output files in the SCF
Subfolder and THERMO Subfolder remains.
If the calculation cycle is not finished at Stage 1, changes can be made to the files
at the last successful stage manually (usually it is editing of MOPAC keywords) and the
calculations restated. If four files (MDL MOL files in the MOL Subfolder and the 3D
MOL (MOPAC) Subfolder, and the MOPAC output files in the SCF Subfolder and the
THERMO Subfolder) are present and up-to-date, no further calculation will be done.
The up-to-date state is defined using the modification time of the files. In case it is not,
the calculations will be processed starting from the most recently corrected file. To
22
invoke the recalculation, select from the menu Calculate/Load Storage (F5) or
Calculate/Descriptors (F6). In the last case, only problematic structures which show in
MOPAC Optimization (step1) Subfolder and MOPAC Optimization (step2)
Subfolder will be recalculated. If the problems arise as a result of a structure’s improper
format, you must redraw the structure in the MOL Subfolder.
Before proceeding further, it is advisable to check that all subfolders in the
storage are empty except the following:
1. MOL Subfolder- this subfolder should contain molfiles of all structures.
2. SCF Subfolder- this subfolder should contain only .mni and .mno files of all
structures.
3. THERMO Subfolder- this subfolder should contain only .mni and .mno files of
all structures.
4. 3D MOL (MOPAC) Subfolder- this subfolder should contain only molfiles of all
structures which are different from the molfiles stored in the MOL Subfolder.
3.6 Calculation of molecular descriptors
One of the major advantages of CODESSA is its large pool of molecular
descriptors, which are calculated for each chemical structure listed in the property (.prp)
file. The descriptors are divided into several lists. The first list contains external
descriptors, added by the user manually. The remaining lists can be divided into two
categories. The first category includes the lists classified according to the origin of
calculation of descriptors. It contains therefore the following lists: constitutional,
topological, geometrical, charge related, semi-empirical, and thermodynamic. The
second category of descriptor lists is based on their classification according to the
sub(molecular) unit they refer to. Therefore, the second category contains the following
lists: molecular type, atomic type, and bond type. The more detailed information on
descriptors available in the program is given in Chapter 4.
Descriptors are automatically calculated for all structures added to the storage.
Calculation of quantum chemical or thermodynamical descriptors can be turned off by
choosing Descriptor calculation… from the Option menu and selecting the respective
options in the Descriptor calculation dialog panel (Fig. 15).
23
Figure 15. Descriptor calculation options
To recalculate the descriptors linked to the currently selected property, choose
Descriptors from the Calculate menu.
3.7 Calculation of the QSAR/QSPR correlations
The structures, descriptors, and properties of interest should be properly selected
(they should all appear in boldface font in the workspace). If not, the right click on the
name of structures, descriptors or property shows the respective pop-up menu, where the
necessary items can be selected by using the function make current and highlighting
them afterwards. The method used for the development of QSAR/QSPR models can be
chosen by activating the calculate pull-down menu and selecting the method and its
limitation characteristics. The correlation will be finished automatically by CODESSA
PRO. In order to calculate only descriptors the function descriptors has to be selected in
the calculate pull-down menu (or use F7 keyboard shortcut). The selection of the
function form matrix in this menu creates the property-descriptor data matrix.
3.7.1 MLR
MLR (multilinear regression) is the simplest method that builds a single correlation using
all selected descriptors. To use the method, select MLR from the calculate pull-down
menu. MLR method has no options.
3.7.2 BMLR
The BMLR command from the calculate pull-down menu starts search for the multiparameter regression with the maximum predicting ability using best multilinear
regression methodology.
Figure 16. BMLR options
24
Control parameters of the method can be adjusted in the Options dialog box that can be
opened by BMLR… command from the Option pull-down menu (Fig. 16). The
following criteria can be flexibly altered in order to obtain the maximum effectiveness of
the search:
- the minimum improvement of the linear correlation coefficient R2 after adding a
new descriptor (default value 0.01);
- the upper limit of the square of the linear correlation coefficient R 2 for two
descriptor scales to be considered orthogonal (default value 0.1);
- the lower limit of the square of the linear correlation coefficient R2 for two
descriptor scales to be considered non-collinear (default value 0.6);
- the lower limit of the square of the linear correlation coefficient R2 for 3parameter correlation (default value 0.6);
- increment of the lower limit of the square of the linear correlation coefficient R 2
for higher order (more than 3) correlations (default value 0.02).
3.7.3 HMPRO
HMPRO descriptor selection procedure can be started by using the BMLR command
from the calculate pull-down menu. HMPRO parameter selection dialog can be opened
by selecting the HMPRO… command from Option menu.
Figure 17. HMPRO weight parameters
The parameters in the Figure 17 correspond to the exponentials in the general fitness
function (correlation weight) defined as follows:
25
w  ( R 2 ) xa F x2 n x3 ( s 2 ) x4 N x5 ,
where R2 is a squared correlation coefficient, F is Fisher criteria, n is number of the data
points, s2 is standard error (non-normalized), and N is number of the descriptors in the
correlation. The xi denote the weight parameters and have the following default values
{1, 1, 1, -1, -1 }.
Figure 18. HMPRO preselection options for 1-parameters correlations
Figure 19. HMPRO preselection options for 1-parameter correlations
26
In the windows of Figures 18 and 19, the minimum allowed values of the statistical
characteristics for the single-parameter correlations can be established. Zero values mean
that the respective criterion is not applied.
Figure 20. HMPRO forming parameters for 2-descriptor correlations
In the window of Figure 20, the criteria for the ordering the 2-parameter correlations are
established. This can be one of the following statistical characteristics:
R2 – squared correlation coefficient
F – Fisher criterion
s2 – squared standard error
R2cvOO – one-leave-out cross-validated correlation coefficient
w – correlation weight
In addition, the upper limits of the intercorrelation coefficient between the two
descriptors and the criterion for including the two-parameter correlations into the
correlations pool are set up.
27
Figure 21. HMPRO expanding parameters
In the window of Figure 21, the following parameters for the development of the
successive multiparameter correlations are set up:
- the size of the correlations pool;
- the maximum number of descriptors involved in the models developed;
- the maximum number of iterations for finding the best correlations;
- the time limit for the development of the correlations.
Figure 22. HMPRO models’ testing parameters
28
In the window of Figure 22, the parameters for the cross-validation testing of the
correlations are determined. Those include the size of the training set for leave-many-out
crossvalidation (as a fraction of the full set), and the maximum numbers of the
crossvalidation and the randomization tests.
Figure 23. HMPRO output parameters
In the window of Figure 23, the following parameters are determined for the correlations
to be included in the output:
- maximum intercorrelation coefficient between the descriptors involved in the
given correlation;
- maximum allowed difference between the multiple correlation coefficient and the
one-leave-out correlation coefficient for a given correlation (as the percentage of
the value of the squared multiple correlation coefficient);
- maximum allowed difference between the multiple correlation coefficient and the
many-leave-out correlation coefficient for a given correlation (as the percentage
of the value of the squared multiple correlation coefficient);
- maximum allowed fraction of the randomization tests:
- maximum number of correlations of the given size (number of descriptors
involved).
3.7.4 PLS
PLS (partial least squares) options dialog can be opened by PLS… command from the
Option pull-down menu (Fig. 24) and the respective method started by PLS command
from the Calculate pull-down menu.
29
Figure 24. PLS options
PLS is performed on centered and scaled variables by NIPALS algorithm that is stopped
when one of the following is true:
- residual sum of squares is less than its preset minimum value (default is 0.01);
- maximum number of components has been reached (default is 5 components).
3.7.5 PCR
PCR (principal components regression) options dialog can be opened by PCR…
command from the Option pull-down menu (Fig. 25) and the respective calculation
started by PCR command from the Calculate pull-down menu.
Figure 25. PCR options
PCR analysis is conducted the same way as BMLR except that principal components are
used instead of descriptors. The “PCA analysis” part of the option selection panel allows
to select the descriptor scaling method (default is centered and normalized), PCA analysis
method (default is “real error”) and method’s convergence criterion (default is 0.01).
Options in “Best regression” part correspond to BMLR parameters described in section
3.7.2.
30
3.8 Viewing correlations
After the QSAR/QSPR model development is completed and the best correlations
found, those can be viewed in the correlation view window (section 2.2.2.2)
3.9 Printing out results.
The standard print function from the file menu bar is applicable to print the
correlation plot from the work area.
3.10 Manipulating lists
CODESSA PRO automatically produces one or more lists in each folder, but for
the analysis of a property or correlation, it is sometimes helpful to create user-defined
lists, e.g. a list of structures containing phenyl rings. The simplest method for creating a
list in CODESSA PRO is to select several objects from one or more existing lists (i.e. all)
while holding down the CTRL key, right clicking to open the context pop-up menu and
left clicking on create list. A list can also be formed from a group of objects that are
linked to a common object. If a descriptor is chosen to be the common object, then all the
structures that are defined for that descriptor could be selected for a list. To create the list
from a common object, the function Select Linked in the context pop-up menu enables
the selection of the type of objects for this list. The AND operation allows to create a list
from multiple common objects.
31
Figure 26. List logic of CODESSA PRO
The List Logic module (Fig. 26) in CODESSA PRO combines the objects from
two lists and can be opened by clicking any object, right clicking to show the context
pop-up menu, and choosing lists logic, or just using the F4 keyboard shortcut instead.
Choose the type of object that will be contained by the new list by clicking on the
appropriate button in the Sub-List type box (descriptors is selected in the example
above). Choose the two lists to combine and the appropriate Operation. The objects from
the two lists that meet the criteria of the logical operation will be displayed in the Results
box. The results can then be copied to a new list (Create) or be selected (Select) for
further comparison.
32
Chapter 4
CODESSA PRO Descriptors
CODESSA PRO molecular descriptors are divided into several classes, depending
on their origin of calculation or on the structural item in the chemical structure (molecule,
atom or bond). The respective classification is given in Table 1. It is followed by the
formulation of each descriptor or descriptor class.
Table 1. CODESSA PRO Molecular Descriptors
Group
Type
Notation
Full name
1
Constitutional
M
NA
2
Constitutional
A
NX,NX,r
absolute and relative numbers of atoms of
certain chemical identity (C, H, O, N, F,
etc.) in the molecule
3
Constitutional
M
NY,NY,r
absolute and relative numbers of certain
chemical groups and functionalities in the
molecule
4
Constitutional
M
NB
5
Constitutional
M
6
Constitutional
M
NR,NR,r
total number of rings, number of rings
divided by the total number of atoms
7
Constitutional
M
NR6,NR6,r
total and relative number of 6-atoms
aromatic rings
8
Constitutional
M
M, Mr
molecular weight and average atomic
weight
9
Topological
M
W
Wiener index
10
Topological
M

Randi 's molecular connectivity index
11
Topological
M

Randi indices of different orders
12
Topological
M
J
Balaban's J index
13
Topological
M
v
Kier
total number of atoms in the molecule
total number of bonds in the molecule
NS,ND,NT, absolute and relative numbers of single,
double, triple, aromatic or other bonds in
NS,r,ND,r,NT,r
the molecule
m
m
33
and
Hall
valence
connectivity
indices

Kier shape indices

Kier flexibility index
IC
Mean information content index
SIC
Structural information content index
k
CIC
Complementary
index
k
BIC
Bonding information content index
M
TnE
Topological electronic indices
Geometrical
M
SM
Molecular surface area
22
Geometrical
M
SSA
Solvent-accessible molecular surface area
23
Geometrical
M
VM
Molecular volume
24
Geometrical
M
VM,SE
Solvent-excluded molecular volume
25
Geometrical
M
Gp,Gb
Gravitational indexes
26
Geometrical
M
IX,IY,IZ
Principal moments
molecule
27
Geometrical
M
SXY,SYZ,SXZ Shadow areas of a molecule
28
Geometrical
M
SXY,r, SYZ,r,
Relative shadow areas of a molecule
SXZ,r
29
Electrostatic
A
Qi
Gasteiger-Marsili empirical atomic partial
charges
30
Electrostatic
A
Qi
Zefirov's empirical atomic partial charges
31
Electrostatic
A
Qi
Mulliken atomic partial charges
32
Electrostatic
M
Qmax, Qmin
33
Electrostatic
M
P, P', P"
34
Electrostatic
M

Dipole moment
35
Electrostatic
M

Molecular polarizability
m
14
Topological
M
15
Topological
M
16
Topological
M
17
Topological
M
k
18
Topological
M
19
Topological
M
20
Topological
21
k
information
of
content
inertia of
a
Minimum (most negative) and maximum
(most positive) atomic partial charges
Polarity parameters
34
36
Electrostatic
M

37
CPSA
M
PPSA1
Partial positively charged surface area
38
CPSA
M
PPSA2
Total charge weighted partial positively
charged surface area
39
CPSA
M
PPSA3
Atomic charge weighted partial positively
charged surface area
40
CPSA
M
PNSA1
Partial negatively charged surface area
41
CPSA
M
PNSA2
Total charge weighted partial negatively
charged surface area
42
CPSA
M
PNSA3
Atomic
charge
weighted
negatively charged surface area
43
CPSA
M
DPSA1
Difference between partial positively and
negatively charged surface areas
44
CPSA
M
DPSA2
Difference between total charge weighted
partial positive and negative surface areas
45
CPSA
M
DPSA3
Difference between atomic charge
weighted partial positive and negative
surface areas
46
CPSA
M
FPSA1
Fractional partial positive surface area
47
CPSA
M
FPSA2
Fractional total charge weighted partial
positive surface area
48
CPSA
M
FPSA3
Fractional atomic charge weighted partial
positive surface area
49
CPSA
M
FNSA1
Fractional partial negative surface area
51
CPSA
M
FNSA2
Fractional total charge weighted partial
negative surface area
52
CPSA
M
FNSA3
Fractional atomic charge weighted partial
negative surface area
53
CPSA
M
WPSA1
Surface weighted charged partial positive
charged surface area
54
CPSA
M
WPSA2
Surface weighted charged partial positive
charged surface area
Molecular hyperpolarizability
35
partial
55
CPSA
M
WPSA3
Surface weighted charged partial positive
charged surface area
56
CPSA
M
WNSA1
Surface weighted charged partial negative
charged surface area
57
CPSA
M
WNSA2
Surface weighted charged partial negative
charged surface area
58
CPSA
M
WNSA3
Surface weighted charged partial negative
charged surface area
59
CPSA
M
RPCG
Relative positive charge
60
CPSA
M
RNCG
Relative negative charge
61
CPSA
M
HDSA1
Hydrogen bonding donor ability of the
molecule
62
CPSA
M
HDSA2
Area-weighted
surface
charge
hydrogen bonding donor atoms
63
CPSA
M
HASA1
Hydrogen bonding acceptor ability of the
molecule
64
CPSA
M
HASA2
Area-weighted
surface
charge
hydrogen bonding acceptor atoms
65
CPSA
M
HDCA1
Hydrogen bonding donor ability of the
molecule
66
CPSA
M
HDCA2
Area-weighted
surface
charge
hydrogen bonding donor atoms
67
CPSA
M
HACA1
Hydrogen bonding acceptor ability of the
molecule
68
CPSA
M
HACA2
Area-weighted
surface
charge
hydrogen bonding acceptor atoms
69
CPSA
M
FHDSA1
Fractional hydrogen
ability of the molecule
70
CPSA
M
FHDSA2
Fractional area-weighted surface charge
of hydrogen bonding donor atoms
71
CPSA
M
FHASA1
Fractional hydrogen bonding acceptor
ability of the molecule
36
bonding
of
of
of
of
donor
72
CPSA
M
FHASA2
Fractional area-weighted surface charge
of hydrogen bonding acceptor atoms
73
CPSA
M
FHDCA1
Fractional hydrogen
ability of the molecule
74
CPSA
M
FHDCA2
Fractional area-weighted surface charge
of hydrogen bonding donor atoms
75
CPSA
M
FHACA1
Fractional hydrogen bonding acceptor
ability of the molecule
76
CPSA
M
FHACA2
Fractional area-weighted surface charge
of hydrogen bonding acceptor atoms
77
MO related
M
HOMO
Highest occupied
(HOMO) energy
78
MO related
M
LUMO
Lowest unoccupied molecular orbital
(LUMO) energy
79
MO related
M
EA
Fukui
index
atomic
nucleophilic
reactivity
80
MO related
M
NA
Fukui
index
atomic
electrophilic
reactivity
81
MO related
M
RA
Fukui
index
atomic
one-electron
reactivity
82
MO related
B
PAB
Mulliken bond orders
83
MO related
A
Vf,A
Free valence
M
Etot
Total energy of the molecule
M
Eel
Total electronic energy of the molecule
M
H0f
Standard heat of formation
84
85
86
Quantum
chemical
Quantum
chemical
Quantum
chemical
bonding
molecular
donor
orbital
87
Quantum
chemical
AT
Eee,A
Electron-electron repulsion energy for a
given atomic species
88
Quantum
chemical
AT
Ene,A
Nuclear-electron attraction energy for a
given atomic species
89
Quantum
chemical
B
Eee,AB
Electron-electron repulsion between two
given atoms
37
90
Quantum
chemical
B
Ene,AB
Nuclear-electron
attraction
between two given atoms
91
Quantum
chemical
B
Enn,AB
Nuclear repulsion energy between two
given atoms
92
Quantum
chemical
B
Eexc,AB
Electronic exchange energy between two
given atoms
93
Quantum
chemical
BT
ER,AB
Resonance energy between given two
atomic species
94
Quantum
chemical
BT
EC,AB
Total electrostatic interaction energy
between two given atomic species
95
Quantum
chemical
BT
Etot,AB
Total interaction energy between two
given two atomic species
96
Quantum
chemical
M
Eee,tot
Total molecular one-center
electron repulsion energy
electron-
97
Quantum
chemical
M
Ene,tot
Total molecular one-center
nuclear attraction energy
electron-
98
Quantum
chemical
M
EC,tot
Total
intramolecular
interaction energy
M
K
M
ΔHprot
102 Thermodynamic
M
HV
Vibrational enthalpy of the molecule
103 Thermodynamic
M
HT
Translational enthalpy of the molecule
104 Thermodynamic
M
SV
Vibrational entropy of the molecule
105 Thermodynamic
M
SR
Rotational entropy of the molecule
106 Thermodynamic
M
ST
Translational entropy of the molecule
107 Thermodynamic
M
CV
Vibrational heat capacity of the molecule
100
101
Quantum
chemical
Quantum
chemical
energy
electrostatic
Electron kinetic energy density
Energy of protonation
38
4.1 Constitutional Descriptors
The constitutional descriptors depend on atomic constitution of the chemical
structure (molecule). They also include the descriptors related to the types of bonds and
the presence of rings in the molecule.

total number of atoms in the molecule

absolute and relative numbers of atoms of certain chemical identity (C, H, O, N,
F, etc.) in the molecule

absolute and relative numbers of certain chemical groups and functionalities in
the molecule

total number of bonds in the molecule

absolute and relative numbers of single, double, triple, aromatic or other bonds in
the molecule

total number of rings, number of rings divided by the total number of atoms

total and relative number of 6-atoms aromatic rings

molecular weight and average atomic weight
39
4.2 Topological Descriptors
4.2.1 Wiener index
Definition:
1 NSA
W   dij
2 (i , j )
dij - the number of bonds in the shortest path connecting the pair of atoms i and j
NSA - the number of non-hydrogen atom in the molecule
Reference:
H. Wiener, J. Am. Chem. Soc., 1947, 69, 17.
4.2.2 Randić's molecular connectivity index
Definition:

 D D 
1 2
i
j
edges ij
Di and Dj - the edge degrees (atom connectivities) of the molecular graph.
4.2.3 Randić's indices of different orders
Definition:
m

 Di Dj...Dk 1 2
path
References:
1. M. Randić, J. Am. Chem. Soc., 1975, 97, 6609.
2. L. B. Kier, L. H. Hall, Molecular Connectivity in Structure-Activity Analysis, J.
Wiley & Sons, New York, 1986.
4.2.4 Balaban's J index
Definition:
J 
q
Si S j 1 2

  1 edges ij
q - number of edges in the molecular graph
40
m = (q – n + 1) - the cyclomatic number of the molecular graph
n – number of atoms in the molecular graph
Si - distance sums calculated as the sums over the rows or columns of the topological
distance matrix of the molecule, D.
References:
1. A. T. Balaban, Chem. Phys. Lett., 1981, 89, 399.
2. A. T. Balaban, Pure and Appl. Chem., 1983, 55, 199.
4.2.5 Kier and Hall valence connectivity indices
Definition:
m
N s m 1

12
1 
    v 
i 1 k 1   k 
v
( Z kv  H k )
 
- valence connectivity for the k-th atom in the molecular graph
( Z k  Z kv  1)
v
k
Zk - the total number of electrons in the k-th atom
Z kv - the number of valence electrons in the k-th atom
Hk - the number of hydrogen atoms directly attached to the kth non-hydrogen atom
m = 0 - atomic valence connectivity indices
m = 1 - one bond path valence connectivity indices
m = 2 - two bond fragment valence connectivity indices
m = 3 three contiguous bond fragment valence connectivity indices etc.
References:
1. L. B. Kier, L. H. Hall, Eur. J. Med. Chem., 1977, 12, 307.
2. L. B. Kier, L. H. Hall, J. Pharm. Sci., 1981, 70, 583.
3. L. B. Kier, L. H. Hall, Molecular Connectivity in Structure-Activity Analysis, J.
Wiley & Sons, New York, 1986.
4.2.6 Kier shape indices
Definition:
  ( N SA   )( N SA    1) 2 (1P   ) 2
1
41
  ( N SA    1)( N SA    2) 2 ( 2P   ) 2
2
  ( N S A    1)( N S A    3)2 ( 3P   ) 2
if N SA is odd
  ( N SA    3)( N SA    2) 2 ( 3P   ) 2
if N SA is even
3
3
NSA - the number of non-hydrogen atom in the molecule
nP - the number of paths of the length nin the molecular graph

ri
1
rCi
ri - atomic radius of a given atom
rCi - atomic radius of the carbon atom in the sp3 hybridization state
References:
1. L. B. Kier, Quant. Struct.-Act. Relat., 1985, 4, 109.
2. L. B. Kier, in: Computational Chemical Graph Theory, D. H. Rouvray (Ed.),
Nova Science Publishers, New York 1990.
4.2.7 Kier flexibility index
Definition:
(1 2 )

N SA
1
and 2 - Kier shape indices
NSA - the number of non-hydrogen atom in the molecule
Reference:
1. L. B. Kier, in: Computational Chemical Graph Theory, D. H. Rouvray (Ed.),
Nova Science Publishers, New York 1990.
4.2.8 Mean information content index
42
Definition:
k
k
ni
n
log 2 i
n
i 1 n
IC  
ni - number of atoms in the ith class
n - the total number of atoms in the molecule
k - number of atomic layers in the coordination sphere around a given atom that are
accounted for
Reference:
1. L. B. Kier, J. Pharm. Sci., 1980, 69, 807.
4.2.9 Structural information content index
Definition:
k
k
SIC k IC / log 2 n
k
ni
n
log 2 i
n
i 1 n
IC  
ni - number of atoms in the ith class
n - the total number of atoms in the molecule
k – number of atomic layers in the coordination sphere around a given atom that are
accounted for
Reference:
1. S.C. Basak, D. K. Harriss, V. R. Magnuson, J. Pharm. Sci., 1984, 73, 429.
4.2.10 Complementary information content index
Definition:
CIC  log 2 n k IC
k
43
k
k
ni
n
log 2 i
n
i 1 n
IC  
ni - number of atoms in the ith class
n - the total number of atoms in the molecule
k - number of atomic layers in the coordination sphere around a given atom that are
accounted for
Reference:
1. S.C. Basak, D. K. Harriss, V. R. Magnuson, J. Pharm. Sci., 1984, 73, 429.
4.2.11 Bonding information content index
Definition:
k
k
BIC  kIC / log 2 q
k
ni
n
log 2 i
n
i 1 n
IC  
ni - number of atoms in the ith class
n - the total number of atoms in the molecule
k - number of atomic layers in the coordination sphere around a given atom that are
accounted for
q - number of edges in the molecular graph
Reference:
1. S.C. Basak, D. K. Harriss, V. R. Magnuson, J. Pharm. Sci., 1984, 73, 429 ()
4.2.12 Topological electronic indices
Definition:
E
T1 
N SA
| qi  q j |
( i j )
rij2

44
T2 
E
Nb
| qi  q j |
(i  j )
rij2

qi- partial charge on the i-th atom
rij – distance between i-th and j-th atoms
NSA – number of non-hydrogen atom in the molecule
Nb – number of bonds between non-hydrogen atom in the molecule
Reference:
1. K. Osmialowski, J. Halkiewicz, R. Kaliszan, J. Chromatogr., 1986, 63, 361.
45
4.3 Geometrical Descriptors
4.3.1 Molecular surface area
Definition:
(i )
S M   SVW
 Sov
i
(i )
SVW
- van der Waals area of the i-th constituent atom of a molecule
Sov – van der Waals area of atoms inside overlapping atomic envelopes
Reference:
1. M. Karelson, Molecular Descriptors in QSAR/QSPR, J. Wiley & Sons, New
York, 2000.
4.3.2 Solvent-accessible molecular surface area
Definition:
S SA  A  As  A
A+ - convex areas of a molecule
As - saddle areas of a molecule
A- - concave areas of a molecule
Reference:
M. L. Connolly, J. Appl. Crystallogr., 1983, 16, 548-558.
4.3.3 Molecular volume
Definition:
(i )
VM  VVW
 Vov
i
(i )
VVW
- van der Waals volume of the i-th constituent atom of a molecule
Vov – volume of overlapping van der Waals atomic envelopes
46
Reference:
1. F. M. Richards, Annu. Rev. Biophys. Bioeng., 1977, 6, 151-176
4.3.4 Solvent-excluded molecular volume
Definition:
Vmol ( SE )  V p  V  Vs  V  Vac  Vnc
Vp - volume of internal polyhedron
V - volume pieces between the center of an atom and the convex face of the solventaccessible surface
Vs - volume of saddle pieces
V - volume of concave pieces
Vac, Vnc - cusp volume pieces
Reference:
1. M. L. Connolly, J. Am. Chem. Soc., 1985, 107, 1118-1124
4.3.5 Gravitational indexes
Definition:
Na
mi m j
i j
rij2
Gp  
Nb
Gb  
mi m j
i j
rij2
mi, mj - atomic masses of atoms i and j
rij - interatomic distance of atoms i and j
Na - number of atoms in the molecule
Nb - number of chemical bonds in the molecule
Reference:
1. A. R. Katritzky, L. Mu, V. S. Lobanov, M. Karelson, J. Phys. Chem., 1996, 100,
10400-10407.
47
4.3.6 Principal moments of inertia of a molecule
Definition:
I k   mi r
i
2
ik
mi - atomic weights of constituent atoms of a molecule
rik - distance of the i-th atomic nucleus from the k-th main rotational axes (k = X, Y or Z)
Reference:
1. Handbook of Chemistry and Physics, CRC Press, Cleveland OH, 1974, p. F-112.
4.3.7 Shadow areas of a molecule
Definition:
Sk 
1
 d  d 
2 (C)
C – contour of the projection of the molecule on the plane defined by two principal axes
of the molecule (k = XY, XZ or YZ)
- x or y
- y or z
Reference:
1. R. H. Rohrbaugh, P. C. Jurs, Anal. Chim. Acta, 1987, 199, 99.
4.3.8 Relative shadow areas of a molecule
Definition:
S kr 
 d  d 
(C)
S (k )
C – contour of the projection of the molecule on the plane defined by two principal axes
of the molecule (k = XY, XZ or YZ plane)
- x or y
- y or z
48
S ( k )  X  Y ; X  Z or Y  Z
Reference:
1. M. Karelson, Molecular Descriptors in QSAR/QSPR, J. Wiley & Sons, New
York, 2000.
49
4.4 Electrostatic Descriptors
4.4.1Gasteiger-Marsili empirical atomic partial charges
Definition:
Qi   qi 

qi 

1
 
2
  j    iv 
 k    iv  

 - the contribution to the
   
 
vi 


j


k

iv
k


atomic charge on the a-th step of iteration of charge
 iv  aiv  bivQi  civQi2 - electronegativity of n–th orbital on i-th atom
aiv 
I iv0  Eiv0
2
I iv0  Eiv  Eiv0
biv 
4
I iv  I iv0  Eiv  Eiv0
biv 
4
I iv0 , I iv , E iv0 , and Eiv - the ionization potentials and electron affinities of the neutral
atom (superscript 0) and of the positive ion (superscript +), respectively.
References:
1. J. Gasteiger, M. Marsili, Tetrahedron Lett., 1978, 3181
2. J. Gasteiger, M. Marsili, Tetrahedron, 1980, 36, 3219-3228
4.4.2 Zefirov's empirical atomic partial charges
Definition:
Qi  f  i 
i – atomic electronegativities
50
1  n 1
 0 n

 i    i   k 
 k 1 
 i0 - electronegativities of isolated atoms
n – atoms in the first coordination sphere of a given atom
References:
1. N. S. Zefirov, M. A. Kirpichenok, F. F. Izmailov, M. I. Trofimov., Dokl. Akad.
Nauk SSSR, 1987, 296, 883.
2. M. A. Kirpichenok, N. S. Zefirov, Zh. Org. Khim., 1987, 23, 4 .
4.4.3 Mulliken atomic partial charges
Definition:
1
1


Q A  Z A    Pkk   Pkl   Plk   Z A   Pkl
2 l k
2 l k 
 k A
ZA - atomic nuclear charge
Pkl - atomic population matrix elements
Reference:
1. R. S. Mulliken, J. Chem. Phys.,1955 23, 1833-1840.
2. I. G. Csizmadia, Theory and Practice of MO Calculations on Organic Molecules,
Elsevier, Amsterdam, 1976.
4.4.4 Minimum (most negative) and maximum (most positive) atomic
partial charges
Definition:
Qmin =min (Q -)
Qmax=max (Q+)
Q -- negative atomic partial charges
Q+ - positive atomic partial charges
Reference:
1. I. G. Csizmadia, Theory and Practice of MO Calculations on Organic Molecules,
Elsevier, Amsterdam, 1976.
51
4.4.5 Polarity parameters
Definitions:
P = Qmax - Qmin
P 
Qmax  Qmin
Rmm
P 
Qmax  Qmin
2
Rmm
Qmax - the most positive atomic partial charge in the molecule
Qmin - the most negative atomic partial charge in the molecule
Rmm - distance between the most positive and the most negative atomic partial charges in
the molecule
Reference:
1. K. Osmialowski, J. Halkiewicz, A. Radecki, R. Kaliszan, J. Chromatogr., 1985,
346, 53.
4.4.6 Dipole moment
Definition:
occ
M
i 1 (V )
a 1

    i rˆi dv   Z a Ra
i - molecular orbitals
r̂ - electron position operator
Za - a-th atomic nuclear charge

Ra - position vector of a-th atomic nucleus
Reference:
1. P. W. Atkins, Quanta, Oxford University Press, Oxford, 1991.
4.4.7 Molecular polarizability, 
52
Definition:
     E 
1
E 2  ...
2
- permanent dipole moment of the molecule
’' - induced dipole moment of the molecule
E - external electric field
Reference:
1. P. W. Atkins, Quanta, Oxford University Press, Oxford, 1991.
4.4.8 Molecular hyperpolarizability, 
Definition:
     E 
1
E 2  ...
2
- permanent dipole moment of the molecule
’' - induced dipole moment of the molecule
E - external electric field
Reference:
1. P. W. Atkins, Quanta, Oxford University Press, Oxford, 1991.
53
4.5 CPSA Descriptors
4.5.1Partial positively charged surface area
Definition:
A   A  0
PPSA1   S A
A
SA- positively charged solvent-accessible atomic surface area
Reference:
1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990, 62, 2323 .; D.T. Stanton, L.M. Egolf,
P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992, 32, 306 .
4.5.2 Total charge weighted partial positively charged surface area
Definition:
PPSA2   q A   S A
A
A   A  0
A
SA - positively charged solvent-accessible atomic surface area
qA - atomic partial charge
Reference:
1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990, 62, 2323 .; D.T. Stanton, L.M. Egolf,
P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992, 32, 306 .
4.5.3 Atomic charge weighted partial positively charged surface area
Definition:
PPSA3   q A  S A
A   A  0
A
SA - positively charged solvent-accessible atomic surface area
qA - atomic partial charge
Reference:
1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990 62, 2323 ; D.T. Stanton, L.M. Egolf,
P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992 32, 306.
54
4.5.4 Partial negatively charged surface area
Definition:
A   A  0
PNSA1   S A
A
SA - negatively charged solvent-accessible atomic surface area
Reference:
1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990, 62, 2323; D.T. Stanton, L.M. Egolf,
P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992, 32, 306.
4.5.5 Total charge weighted partial negatively charged surface area
Definition:
PNSA2   q A   S A
A
A   A  0
A
SA - negatively charged solvent-accessible atomic surface area
qA - atomic partial charge
Reference:
1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990, 62, 2323; D.T. Stanton, L.M. Egolf,
P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992, 32, 306.
4.5.6 Atomic charge weighted partial negatively charged surface area
Definition:
PNSA3   q A  S A
A   A  0
A
SA - negatively charged solvent-accessible atomic surface area
qA - atomic partial charge
Reference:
1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990 62, 2323; D.T. Stanton, L.M. Egolf,
P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992 32, 306.
4.5.7 Difference between partial positively and negatively charged
surface areas
55
Definition:
DPSA1 = PPSA1 – PNSA1
PPSA1 - positively charged solvent-accessible molecular surface area
PNSA1 - negatively charged solvent-accessible molecular surface area
Reference:
1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990 62, 2323; D.T. Stanton, L.M. Egolf,
P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992 32, 306.
4.5.8 Difference between total charge weighted partial positive and
negative surface areas
Definition:
DPSA2 = PPSA2 – PNSA2
PPSA2 – total charge weighted partial positively charged molecular surface area
PNSA2 - total charge weighted partial negatively charged molecular surface area
Reference:
1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990 62, 2323; D.T. Stanton, L.M. Egolf,
P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992 32, 306.
4.5.9 Difference between atomic charge weighted partial positive and
negative surface areas
Definition:
DPSA2 = PPSA2 – PNSA2
PPSA2 – total charge weighted partial positively charged molecular surface area
PNSA2 - total charge weighted partial negatively charged molecular surface area
Reference:
1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990 62, 2323; D.T. Stanton, L.M. Egolf,
P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992 32, 306.
4.5.10 Fractional partial positive surface area
56
Definition:
FPSA1 
PPSA1
TMSA
PPSA1– partial positively charged molecular surface area
TMSA- total molecular surface area
Reference:
1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990 62, 2323; D.T. Stanton, L.M. Egolf,
P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992 32, 306.
4.5.11 Fractional total charge weighted partial positive surface area
Definition:
FPSA2 
PPSA2
TMSA
PPSA2 – total charge weighted partial positively charged molecular surface area
TMSA - total molecular surface area
Reference:
1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990 62, 2323; D.T. Stanton, L.M. Egolf,
P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992 32, 306.
4.5.12 Fractional atomic charge weighted partial positive surface area
Definition:
FPSA3 
PPSA3
TMSA
PPSA3– total charge weighted partial positively charged molecular surface area
TMSA -total molecular surface area
Reference:
1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990 62, 2323; D.T. Stanton, L.M. Egolf,
P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992 32, 306.
4.5.13 Fractional partial negative surface area
57
Definition:
FNSA1 
PNSA1
TMSA
PNSA1 – partial negatively charged molecular surface area
TMSA - total molecular surface area
Reference:
1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990 62, 2323; D.T. Stanton, L.M. Egolf,
P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992 32, 306.
4.5.14 Fractional total charge weighted partial negative surface area
Definition:
FNSA2 
PNSA2
TMSA
PNSA2 – total charge weighted partial negatively charged molecular surface area
TMSA - total molecular surface area
Reference:
1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990 62, 2323; D.T. Stanton, L.M. Egolf,
P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992 32, 306.
4.5.15 Fractional atomic charge weighted partial negative surface area
Definition:
FNSA3 
PNSA3
TMSA
PNSA3 – total charge weighted partial negatively charged molecular surface area
TMSA - total molecular surface area
Reference:
1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990 62, 2323; D.T. Stanton, L.M. Egolf,
P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992 32, 306.
58
4.5.16 Surface weighted charged partial positive charged surface area
WPSA1
Definition:
WPSA1 
PPSA1  TMSA
1000
PPSA1 – partial positively charged molecular surface area
TMSA - total molecular surface area
Reference:
1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990 62, 2323; D.T. Stanton, L.M. Egolf,
P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992 32, 306.
4.5.17 Surface weighted charged partial positive charged surface area WPSA2
Definition:
WPSA2 
PPSA2  TMSA
1000
PPSA2 – total charge weighted partial positively charged molecular surface area
TMSA - total molecular surface area
Reference:
1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990, 62, 2323; D.T. Stanton, L.M. Egolf,
P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992, 32, 306.
4.5.18 Surface weighted charged partial positive charged surface area
WPSA3
Definition:
WPSA3 
PPSA3
TMSA
PPSA3 – total charge weighted partial positively charged molecular surface area
TMSA - total molecular surface area
59
Reference:
1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990, 62, 2323; D.T. Stanton, L.M. Egolf,
P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992, 32, 306.
4.5.19 Surface weighted charged partial negative charged surface area
WNSA1
Definition:
WNSA1 
PNSA1  TMSA
1000
PNSA1 – partial negatively charged molecular surface area
TMSA -total molecular surface area
Reference:
1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990 62, 2323; D.T. Stanton, L.M. Egolf,
P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992 32, 306.
4.5.20 Surface weighted charged partial negative charged surface area
WNSA2
Definition:
WNSA2 
PPSA2  TMSA
1000
PNSA2 – total charge weighted partial negatively charged molecular surface area
TMSA - total molecular surface area
Reference:
1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990 62, 2323; D.T. Stanton, L.M. Egolf,
P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992 32, 306.
4.5.21 Surface weighted charged partial negative charged surface area
WNSA3
Definition:
WNSA3 
PNSA3
TMSA
60
PNSA3 – total charge weighted partial negatively charged molecular surface area
TMSA -total molecular surface area
Reference:
1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990 62, 2323; D.T. Stanton, L.M. Egolf,
P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992 32, 306.
4.5.22 Relative positive charge
Definition:

 max
RPCG 
 A
A   A  0
A

 max
- maximum atomic positive charge in the molecule
A - positive atomic charge in the molecule
Reference:
1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990, 62, 2323; D.T. Stanton, L.M. Egolf,
P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992, 32, 306.
4.5.23 Relative negative charge
Definition:

 max
RNCG 
 A
A   A  0
A

 max
- maximum atomic negative charge in the molecule
A - negative atomic charge in the molecule
Reference:
1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990, 62, 2323; D.T. Stanton, L.M. Egolf,
P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992, 32, 306.
4.5.24 Hydrogen bonding donor ability of the molecule HDSA1
61
Definition:
HDSA1   sD
D  H H donor
D
S
D-
solvent-accessible surface area of H-bonding donor H atoms
Reference:
1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990, 62, 2323; D.T. Stanton, L.M. Egolf,
P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992, 32, 306.
4.5.25 Area-weighted surface charge of hydrogen bonding donor atoms
HDSA2
Definition:
HDSA 2  
D
qD s D
D  H H -donor
Stot
sD  solvent-accessible surface area of H-bonding donor H atoms
qD  partial charge on H-bonding donor H atoms
Stot  total solvent-accessible molecular surface area
Reference:
1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990, 62, 2323; D.T. Stanton, L.M. Egolf,
P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992, 32, 306.
4.5.26 Hydrogen bonding acceptor ability of the molecule HASA1
Definition:
HASA1   s A
A  X H acceptor
A
sD - solvent-accessible surface area of H-bonding acceptor atoms
Reference:
1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990, 62, 2323; D.T. Stanton, L.M. Egolf,
P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992, 32, 306.
62
4.5.27 Area-weighted surface charge of hydrogen bonding acceptor
atoms HASA2
Definition:
HASA2  
qA sA
A  X H -acceptor
S tot
A
sA -solvent-accessible surface area of H-bonding acceptor atoms
qA - partial charge on H-bonding acceptor atoms
Stot - total solvent-accessible molecular surface area
Reference:
1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990, 62, 2323; D.T. Stanton, L.M. Egolf,
P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992, 32, 306.
4.5.28 Hydrogen bonding donor ability of the molecule HDCA1
Definition:
HDCA1   sD
D  H H  donor
D
sD - solvent-accessible surface area of H-bonding donor H atoms, selected by threshold
charge
Reference:
1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990, 62, 2323; D.T. Stanton, L.M. Egolf,
P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992, 32, 306.
4.5.29 Area-weighted surface charge of hydrogen bonding donor atoms
HDCA2
Definition:
HDCA2  
D
qD s D
Stot
D  H H -donor
sD - solvent-accessible surface area of H-bonding donor H atoms, selected by threshold
charge
63
qD - partial charge on H-bonding donor H atoms, selected by threshold charge
Stot -total solvent-accessible molecular surface area
Reference:
1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990, 62, 2323; D.T. Stanton, L.M. Egolf,
P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992, 32, 306.
4.5.30 Hydrogen bonding acceptor ability of the molecule HACA1
Definition:
HACA1   s A
A  X H  acceptor
A
sA - solvent-accessible surface area of H-bonding acceptor atoms, selected by threshold
charge
Reference:
D.T. Stanton, P.C. Jurs, Anal. Chem., 1990, 62, 2323; D.T. Stanton, L.M. Egolf, P.C.
Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992, 32, 306.
4.5.31 Area-weighted surface charge of hydrogen bonding acceptor
atoms HACA2
Definition:
HACA2  
A
qA s A
Stot
A  X H -acceptor
sA - solvent-accessible surface area of H-bonding acceptor atoms, selected by threshold
charge
qA - partial charge on H-bonding acceptor atoms, selected by threshold charge
Stot - total solvent-accessible molecular surface area
Reference:
1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990, 62, 2323; D.T. Stanton, L.M. Egolf,
P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992, 32, 306.
4.5.32 Fractional hydrogen bonding donor ability of the molecule
FHDSA1
64
Definition:
FHDSA1 
HDSA1
TMSA
HDSA1 - hydrogen bonding donor ability
TMSA - total molecular surface area
Reference:
1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990,62, 2323; D.T. Stanton, L.M. Egolf,
P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992, 32, 306.
4.5.33 Fractional area-weighted surface charge of hydrogen bonding
donor atoms FHDSA2
Definition:
FHDSA 2 
HDSA 2
TMSA
HDSA2 - area-weighted surface charge on hydrogen bonding donor atoms
TMSA - total molecular surface area
Reference:
1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990, 62, 2323; D.T. Stanton, L.M. Egolf,
P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992, 32, 306.
4.5.34 Fractional hydrogen bonding acceptor ability of the molecule
FHASA1
Definition:
FHASA1 
HASA1
TMSA
HASA1 - hydrogen bonding acceptor ability
TMSA - total molecular surface area
Reference:
1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990, 62, 2323; D.T. Stanton, L.M. Egolf,
P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992, 32, 306.
65
4.5.35 Fractional area-weighted surface charge of hydrogen bonding
acceptor atoms FHASA2
Definition:
FHASA2 
HASA2
TMSA
HASA2 - area-weighted surface charge on hydrogen bonding acceptor atoms
TMSA - total molecular surface area
Reference:
1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990, 62, 2323; D.T. Stanton, L.M. Egolf,
P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992, 32, 306.
4.5.36 Fractional hydrogen bonding donor ability of the molecule
FHDCA1
Definition:
FHDCA1 
HDCA1
TMSA
HDCA1 - hydrogen bonding donor ability of atoms, selected by threshold charge
TMSA - total molecular surface area
Reference:
1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990,62, 2323; D.T. Stanton, L.M. Egolf,
P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992, 32, 306.
4.5.37 Fractional area-weighted surface charge of hydrogen bonding
donor atoms FHDCA2
Definition:
FHDCA2 
HDCA 2
TMSA
HDCA2 - area-weighted surface charge on hydrogen bonding donor atoms, selected by
threshold charge
TMSA - total molecular surface area
66
Reference:
1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990, 62, 2323; D.T. Stanton, L.M. Egolf,
P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992, 32, 306.
4.5.38 Fractional hydrogen bonding acceptor ability of the molecule
FHACA1
Definition:
FHACA1 
HACA1
TMSA
HASA1  hydrogen bonding acceptor ability
TMSA  total molecular surface area
References:
1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990, 62, 2323
2. D.T. Stanton, L.M. Egolf, P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci.,
1992, 32, 306
4.5.39 Fractional area-weighted surface charge of hydrogen bonding
acceptor atoms FHACA2
Definition:
FHACA2 
HACA2
TMSA
HACA2 - area-weighted surface charge on hydrogen bonding acceptor atoms, selected by
threshold charge
TMSA - total molecular surface area
Reference:
1. D.T. Stanton, P.C. Jurs, Anal. Chem., 1990, 62, 2323; D.T. Stanton, L.M. Egolf,
P.C. Jurs, M.G. Hicks, J. Chem. Inf. Comput. Sci., 1992, 32, 306.
67
4.6 MO Related Descriptors
4.6.1 Highest occupied molecular orbital (HOMO) energy
Definition:
 HOMO   HOMO F̂  HOMO
HOMO - highest occupied molecular orbital
F̂ - Fock operator
References:
1. I. G. Csizmadia, Theory and Practice of MO Calculations on Organic Molecules,
Elsevier, Amsterdam, 1976.
2. B.W. Clare, Theoret. Chim. Acta, 1994, 87, 415-430.
4.6.2 Lowest unoccupied molecular orbital (LUMO) energy
Definition:
 LUMO  LUMO F̂ LUMO
LUMO - lowest unoccupied molecular orbital
F̂ - Fock operator
References:
1. I.G. Csizmadia, Theory and Practice of MO Calculations on Organic Molecules,
Elsevier, Amsterdam, 1976.
2. B.W. Clare, Theoret. Chim. Acta, 1994, 87, 415-430.
4.6.3 Fukui atomic nucleophilic reactivity index
Definition:
2
N A   ciHOMO
/(1   HOMO )
iA
or simplified
68
2
N ' A   ciHOMO
iA
HOMO - highest occupied molecular orbital energy
ciHOMO - highest occupied molecular orbital MO coefficients
Reference:
1. R. Franke, Theoretical Drug Design Methods. Elsevier, Amsterdam, 1984.
4.6.4 Fukui atomic electrophilic reactivity index
Definition:
E A   c 2jLUMO /( LUMO  10)
jA
or simplified
E ' A   c 2jLUMO
jA
LUMO -
lowest unoccupied molecular orbital energy
c jLUMO - lowest unoccupied molecular orbital MO coefficients
Reference:
1. R. Franke, Theoretical Drug Design Methods. Elsevier, Amsterdam, 1984.
4.6.5 Fukui atomic one-electron reactivity index
Definition:
RA  
iA
c
jA
c
iHOMO jLUMO
/( LUMO   HOMO )
or simplified
R' A  
iA
c
jA
iHOMO
c jLUMO
ciHOMO - highest occupied molecular orbital MO coefficients
c jLUMO - lowest unoccupied molecular orbital MO coefficients
69
LUMO - lowest unoccupied molecular orbital energy
HOMO - highest occupied molecular orbital energy
Reference:
1. R. Franke, Theoretical Drug Design Methods. Elsevier, Amsterdam, 1984.
4.6.6 Mulliken bond orders
Definition:
occ
PAB     ni ci c j
i 1 A B
ni- the occupation number of the i-th MO
ci , c j - MO coefficients for atomic orbitals  and 
References:
1. I.G. Csizmadia, Theory and Practice of MO Calculations on Organic Molecules,
Elsevier, Amsterdam, 1976.
2. A.B. Sannigrahi, Adv. Quant. Chem., 1992, 23, 301-351.
4.6.7 Free valence
Definition:
V fA  Vmax  PA
Vmax - maximum valence of atom A
PA - total electronic population on atom A
References:
1. I.G. Csizmadia, Theory and Practice of MO Calculations on Organic Molecules,
Elsevier, Amsterdam, 1976.
2. A.B. Sanniraghi, Adv. Quant. Chem., 1992, 23, 301-351.
70
4.7 Quantum Chemical Descriptors
4.7.1 Total energy of the molecule
Definition:
Etot  Eel   Z A Z B / RAB
A B
Eel - total electronic energy of the molecule
ZA, ZB - nuclear charges of atoms A and B
RAB - distance between nuclei A and B
References:
1. I. G. Csizmadia, Theory and Practice of MO Calculations on Organic Molecules,
Elsevier, Amsterdam, 1976
2. M. Bodor, Z. Gabanyi, C.-K. Wong, J. Am. Chem. Soc., 1989, 111, 3783
4.7.2 Total electronic energy of the molecule
Definition:
Eel= 2Tr(RF) - Tr(RG)
R - first order density matrix
F - matrix representation of the Hartree-Fock operator
G - matrix representation of the electron repulsion energy
Reference:
1. I. G. Csizmadia, Theory and Practice of MO Calculations on Organic Molecules,
Elsevier, Amsterdam, 1976
4.7.3 Standard heat of formation
Definition:
H 0f  H f   H af
a
71
- quantum-chemically calculated total energy of the molecule
- quantum-chemically calculated energies of isolated atoms, a
Reference:
1. P. W. Atkins, Physical Chemistry, 3rd Edition, Oxford University Press, Oxford,
1988
4.7.4 Electron-electron repulsion energy for a given atomic species
Definition:
Eee ( A) 
   P P
 
B  A  , A  , B
A – given atomic species
B – other atoms
P, P,- density matrix elements over atomic basis {}
  - electron repulsion integrals on atomic basis {}
Reference:
1. E. Clementi, Computational Aspects of Large Chemical Systems, Springer Verlag,
New York, 1980
4.7.5 Nuclear-electron attraction energy for a given atomic species
Definition:
Ene  AB   
B
P



, A
ZB

RiB
A - given atomic species
B - other atoms
P - density matrix elements over atomic basis {}
ZB - charge of atomic nucleus, B
RiB - distance between the electron and atomic nucleus, B

ZB
 - electron-nuclear attraction integrals on atomic basis {}
RiB
72
Reference:
1. E. Clementi, Computational Aspects of Large Chemical Systems, Springer Verlag,
New York, 1980
4.7.6 Electron-electron repulsion between two given atoms
Definition:
Eee ( AB) 
  P P
 
 , A  , B
A – given atomic species
B – another atomic species
P, P,- density matrix elements over atomic basis {}
  - electron repulsion integrals on atomic basis {}
Reference:
1. E. Clementi, Computational Aspects of Large Chemical Systems, Springer Verlag,
New York, 1980
4.7.7 Nuclear-electron attraction energy between two given atoms
Definition:
Ene  AB   
B
 P


, A
ZB

RiB
A – given atomic species
B – another atomic species
P - density matrix elements over atomic basis {}
ZB - charge of atomic nucleus, B
RiB - distance between the electron and atomic nucleus, B

ZB
 - electron-nuclear attraction integrals on atomic basis {}
RiB
Reference:
1. E. Clementi, Computational Aspects of Large Chemical Systems, Springer Verlag,
New York, 1980
73
4.7.8 Nuclear repulsion energy between two given atoms
Definition:
E nn ( AB) 
Z AZ B
R AB
A – given atomic species
B – another atomic species
ZA - charge of atomic nucleus, A
ZB - charge of atomic nucleus, B
RiB - distance between the atomic nuclei, A and B
Reference:
1. E. Clementi, Computational Aspects of Large Chemical Systems, Springer Verlag,
New York, 1980.
4.7.9 Electronic exchange energy between two given atoms
Definition:
Eexc ( AB) 
  P P
 
 , A  , B
A – given atomic species
B – another atomic species
P, P, - density matrix elements over atomic basis {}
  - electron repulsion integrals on atomic basis { }
Reference:
1. E. Clementi, Computational Aspects of Large Chemical Systems, Springer Verlag,
New York, 1980
4.7.10 Resonance energy between given two atomic species
Definition:
E R ( AB) 
  P  
A B
A – given atomic species
74
B – another atomic species
P - density matrix elements over atomic basis {}
- resonance integrals on atomic basis {}
Reference:
1. E. Clementi, Computational Aspects of Large Chemical Systems, Springer Verlag,
New York, 1980.
4.7.11 Total electrostatic interaction energy between two given atomic
species
Definition:
EC ( AB)  Eee ( AB)  Ene ( AB)  Enn ( AB)
A – given atomic species
B – another atomic species
Eee(AB) - electronic repulsion energy between two atomic species
Ene(AB) - electron-nuclear attraction energy between two atomic species
Enn(AB) - nuclear repulsion energy between two atomic species
Reference:
1. E. Clementi, Computational Aspects of Large Chemical Systems, Springer Verlag,
New York, 1980.
4.7.12 Total interaction energy between two given two atomic species
Definition:
Etot ( AB)  EC ( AB)  Eexc ( AB)
A – given atomic species
B – another atomic species
EC(AB) - electrostatic interaction energy between two atomic species
Eexc(AB) - electronic exchange energy between two atomic species
Reference:
1. E. Clementi, Computational Aspects of Large Chemical Systems, Springer Verlag,
New York, 1980.
75
4.7.13 Total molecular one-center electron-electron repulsion energy
Definition:
Eee (tot)   Eee ( A)
A
A - given atomic species
Eee(A) - electron-electron repulsion energy for atom A
Reference:
1. E. Clementi, Computational Aspects of Large Chemical Systems, Springer Verlag,
New York, 1980.
4.7.14 Total molecular one-center electron-nuclear attraction energy
Definition:
A – given atomic species
Ene(A) - electron-nuclear attraction energy for atom A
Reference:
1. E. Clementi, Computational Aspects of Large Chemical Systems, Springer Verlag,
New York, 1980.
4.7.15 Total intramolecular electrostatic interaction energy
Definition:
EC ( tot ) 
1
 EC ( A )
2 A
A - given atomic species
EC(A) - electrostatic energy for atom A
Reference:
1. E. Clementi, Computational Aspects of Large Chemical Systems, Springer Verlag,
New York, 1980.
76
4.8 Thermodynamic Descriptors
4.8.1 Vibrational enthalpy of the molecule
Definition:
H vib 
h j exp  h j 2kT 
1 a
h j 

2 j 1
1  exp  h j 2kT 
j - frequencies of normal vibrations in the molecule
h - Planck's constant
k - Boltzmann's constant
T - absolute temperature (K)
References:
1. D.A. McQuarrie, Statistical Thermodynamics, Harper & Row Publishers, New
York, 1973.
2. A.I. Akhiezer, S.V. Peltminskii, Methods of Statistical Physics, Pergamon Press,
Oxford, 1981.
4.8.2 Translational enthalpy of the molecule
Definition:
H tr 



2
2  p
p
e 2mkT dp
2m
p - momentum of the molecule
m - mass of the molecule
k - Boltzmann's constant
T - absolute temperature (K)
References:
1. D.A. McQuarrie, Statistical Thermodynamics, Harper & Row Publishers, New
York, 1973.
2. A.I. Akhiezer, S.V. Peltminskii, Methods of Statistical Physics, Pergamon Press,
Oxford, 1981.
77
4.8.3 Vibrational entropy of the molecule
Definition:
 h j exp  h j 2kT 

S vib   
 ln 1  exp  h j 2kT  

j 1 
 kT 1  exp  h j 2kT 





j - frequencies of normal vibrations in the molecule
h - Planck's constant
k - Boltzmann's constant
T - absolute temperature (K)
References:
1. D.A. McQuarrie, Statistical Thermodynamics, Harper & Row Publishers, New
York, 1973.
2. A.I. Akhiezer, S.V. Peltminskii, Methods of Statistical Physics, Pergamon Press,
Oxford, 1981.
4.8.4 Rotational entropy of the molecule
Definition:
S rot
 1/ 2
 Nk ln 
 

 8 2 I j kT 



2


h
j 1 

3
1/ 2




Ij - principal moments of inertia of the molecule
 - symmetry number of the molecule
h - Planck's constant
k - Boltzmann's constant
T - absolute temperature (K)
References:
1. D.A. McQuarrie, Statistical Thermodynamics, Harper & Row Publishers, New
York, 1973.
4.8.5 Translational entropy of the molecule
78
Definition:
 2mkT 
Str  ln 

2
 h

1/ 2
Ve5 / 2
NA
V - volume of the system
NA - Avogadro's number
m - mass of the molecule
h - Planck's constant
k - Boltzmann's constant
T - absolute temperature (K)
References:
1. D.A. McQuarrie, Statistical Thermodynamics, Harper & Row Publishers, New
York, 1973.
2. A.I. Akhiezer, S.V. Peltminskii, Methods of Statistical Physics, Pergamon Press,
Oxford, 1981.
4.8.6 Vibrational heat capacity of the molecule
Definition:
exp  h j / 2kT 
 h j 

 k  
j 1  kT  1  exp  h j / 2kT )

cv ,vib
2
j - frequencies of normal vibrations in the molecule
h - Planck's constant
k - Boltzmann's constant
T - absolute temperature (K)
References:
1. D.A. McQuarrie, Statistical Thermodynamics, Harper & Row Publishers, New
York, 1973.
2. A.I. Akhiezer, S.V. Peltminskii, Methods of Statistical Physics, Pergamon Press,
Oxford, 1981.
79
Chapter 5
Methods of QSAR/QSPR Model Development
The QSAR/QSPR models developed by CODESSA PRO software are essentially
based on the multilinear regression method.
5.1 Multilinear Regression
The multilinear regression method can be described compactly and conveniently
using matrix notation.[1, 2, 3] Suppose that there are n property values in Y and n
associated calculated values for each k molecular descriptor in X columns. Then Yi, Xik,
and ei can represent the ith value of the Y variable (property), the ith value of each of the
X descriptors, and the ith unknown residual value, respectively. Collecting these terms
into matrices we have:
The multiple regression model in matrix notation then can be expressed as
where b is a column vector of coefficients (b1 is for the intercept) and k is the number
unknown regression coefficients for the descriptors. We recall that the goal of multiple
regression is to minimize the sum of the squared residuals:
Regression coefficients that satisfy this criterion are found by solving the system of linear
equations (multiplying both sides by X’ from left)
When the X variables are linearly independent (an X'X matrix which is of full
rank), there is a unique solution to the system of linear equations. One of the ways for
solving the system above is to premultiply both sides of the matrix formula for the
normal equations by the inverse matrix X'X to give
80
The other way is to solve directly the system above using LS (underdetermined, n
< k) or QR factorization for the overdetermined (n > k) system. This method is more
general and does not require time-consuming matrix inversion. Singular value
decomposition methods can also be used, but usually such methods are significantly more
time-consuming and only advantageous when a strong linear dependence exists that
would diminish quality of models.
The third way to solve the problem of linear dependency of variables (determinant of the
X’X matrix is above zero) is by general matrix inversion, but this is usually outside the
sphere of QSPR.
A fundamental principle of least squares methods, the multiple linear regression
in particular, is that variance of the dependent variable can be partitioned (divided into
parts) according to the source. Suppose that a dependent variable (property) is regressed
on one or more descriptors and, for convenience, the dependent variable is scaled so that
its mean is 0. Next, a basic least squares identity is calculated in which the total sum of
squared values on the dependent variable equals the sum of squared predicted values plus
the sum of squared residual values. Stated more generally,
where the term on the left is the total sum of squared deviations of the observed values on
the dependent variable from the dependent variable mean, and the terms on the right are:
(i) the sum of the squared deviations of the predicted values for the dependent variable
from the dependent variable mean and
(ii) the sum of the squared deviations of the observed values on the dependent variable
from the predicted values, that is, the sum of the squared residuals.
Stated yet another way,
Note that the SSTotal is always the same for any particular data set, but SSModel and
the SSError vary with the regression equation. Assuming again that the dependent variable
is scaled so that its mean is 0, the SSModeland SSError can be computed using
Assuming that X'Xis full-rank,
81
where r2is squared correlation coefficient which is the measure of the quality of model
fitnessto the property, s2 is an unbiased estimate of the residual or error variance, and F is
Fisher criteria of (k, n - k - 1) degrees of freedom. If X'Xis not full rank, rank(X’X) + 1is
substituted for k.
References:
1. http://www.statsoft.com/textbook/stathome.html
2. Darlington, R. B. Regression and linear models. New York: McGraw-Hill, 1990.
3. Neter, J.; Wasserman, W.; Kutner, M. H. Applied linear regression models (2nd
ed.). Homewood, IL: Irwin, 1989.
5.2 Selection of descriptors
A rigorously correct solution for descriptor selection requires a full search
procedure of the discrete descriptor space. Unfortunately, combinatorial explosion does
not allow the application of a full search procedure to real tasks. For example, if we
search for a 5-parameter correlation on a space of 1000 descriptors (numbers are realistic
for a typical search), we would have to test over 8*1012 correlations for their ability to
match some criterion (usually the squared correlation coefficient). Modern machines
have achieved sufficient productivity to calculate one correlation each 0.0001-0.0002
seconds using highly optimized linear algebra libraries (CODESSA PRO uses an Intel
MKL – LAPACK [3] clone), and highly optimized low level code. Even with this high
level of optimization, the time required for the solution of the aforementioned task, using
a full search procedure, is 8*1012*0.0001 seconds (about 26 years).
Because it is impossible to solve typical tasks in a reasonable amount of time
using full search, methods for simplification have been developed. Such methods of
descriptor selection can be categorized as either deterministic or stochastic.
Throughout years of research, many algorithms for non-full searches were
developed. The best known deterministic algorithms [1] are forward entry/backward
removal of effects (in our case – descriptors). The methods of forward and backward
stepwise searches combine the entering and removal of the effects at each step. Each of
the methods mentioned above have many limitations, [2] the majority of which are
concerned with the absence of a consistent set of the correlations (models) which
represent the upper segment of the search space. The best-subset methods (proposed in
this work) are the next alternative to the full-search procedure, bur possess such
limitations.
82
Two methods for reducing full search procedure were utilized by our group: [4]
the heuristic method and the best multi-linear regression.
The heuristic method for descriptor selection proceeds with a pre-selection of
descriptors by sequentially eliminating descriptors that do not match any of the following
criteria: (i) Fisher F-criteria greater than 1.0; (ii) R2 value less than a value defined at the
start; (iii) Student’s t-criterion less than a defined value; (iv) duplicate descriptors having
a higher squared intercorrelation coefficient than a predetermined level (retaining the
descriptor with higher R2 with reference to the property). The descriptors that remain are
then listed in decreasing order of correlation coefficients when used in global search for
2-parameter correlations. Each significant 2-parameter correlation by F-criteria is
recursively expanded to an n-parameter correlation till the normalized F-criteria remains
greater than the startup value. The best N correlations by R2, as well as by F-criterion, are
saved.
The best multilinear regression method is based on the (i) selection of the
orthogonal descriptor pairs, (ii) extension of the correlation (saved on the previous step)
with the addition of new descriptors until the F-criteria becomes less than that of the best
2-parameter correlation. The best N correlations (by R2) are saved.
Both methods successfully solve the initial selection problem by reducing the
number of pairs of descriptors in the "starting set". The major limitations of these
methods are the pairwise selection on the first step and the low consistence of the
presentation of the upper (according to the selected criteria) segment of the search (N in
both cases is 400) due to the small size of the correlation selection.
The HMPRO procedure applied in CODESSA PRO is a superset of the abovedescribed heuristic and best multilinear regression methods, as well as forward/backward
selections methods, forward/backward stepwise search methods, and a full search
procedure. Through the alteration of parameters of the HMPRO, it is possible to emulate
the behavior of all above mentioned methods.
The HMPRO algorithm consists from 4 major parts:
1) The 1-parameter descriptor selection. The selection is based on squared
correlation coefficient, Fisher F-criteria, Student t-criteria. The highly
intercorrelated descriptors and descriptors with insignificant variance are
eliminated.
2) Pair-wise selection. This selection is based on the analysis of the squared
correlation coefficient and Fisher F-criteria.
3) Expanding/contracting stage. The obtained correlations with the best
statistical characteristics are expanded by adding a previously unselected
descriptor. The new correlations are validated by the values of squared
intercorrelation coefficient, F-criteria (normalized or not), and standard error.
The correlations with the maximum allowed number of descriptors (a
parameter pre-set by the user) will not be expanded further. The process
83
begins from the correlations with the largest values of the fitness function
(correlation weight) defined as follows:
w = (R2Fn)/(Ns2)
where R2 is the squared correlation coefficient, F is the Fisher criterion, n is
number of the data points, N is the number of descriptors in the model and s is
standard deviation. This stage is repeated until a stop event appears which can
be any/all of following:
i.
The correlation set in the memory exceeds the capacity. The
correlations with the lowest values of the fitness function are not
stored. If a new correlation with a fitness function value higher than
the current lowest fitness function is found, it is substituted for that
correlation with the lowest fitness function.
ii.
The in-memory correlation set is overfilled. (It can be
conditionally unlimited when the correlation value of fitness function
is less then minimal in the set, it is not stored at all after overfilling. In
this case, if correlation is inserted into the set, the correlation with the
worst value of the fitness function is eliminated).
iii.
Maximum number of iterations is achieved.
iv.
Time limit is achieved.
v.
The correlation set does not have any correlation to
expand/contract (full search is finished).
4) The output stage. The given number of the “best” correlations is printed out.
Each iteration, available for selecting printed correlations, begins with the
“best” correlations. For all of the correlations in a cycle, a full set of statistical
parameters is calculated including intercorrelation of the descriptors (one to
all others), cross-validated squared correlation coefficient, etc. The parameters
of the method are defined by the set of the selection criteria. For any
correlation, a full list of the predecessors, in calculation order, can be printed
and on the basis of the best correlations with subsets of the descriptors until
the 1-parameter correlation is printed.
A general fitness function (correlation weight) above is defined as follows:
w  ( R 2 ) xa F x2 n x3 ( s 2 ) x4 N x5 ,
where R2 is a squared correlation coefficient, F is Fisher criteria, n is number of the data
points, s2 is standard error (non-normalized), and N is number of the descriptors in the
correlation. The xi denote the weight parameters and have the following default values
{1, 1, 1, -1, -1 }.
The MDA module of CODESSA PRO provides two additional validation
methods for correlation validation: MCCV [5,6], and the Monte Carlo extension of the
randomization test.[7] One of the best random number generator is used.[8]
84
References:
1. Darlington, R. B. Regression and linear models. New York: McGraw-Hill, 1990.
2. Neter, J.; Wasserman, W.; Kutner, M. H. Applied linear regression models (2nd
ed.). Homewood, IL: Irwin, 1989.
3. Anderson, E; Bai, Z.; Bischof, C.; Demmel, J.; Dongarra, J.; Ducroz, J.;
Greenbaum, A.; Hammarling, S.; McKenney, A.; Sorensen, D. LAPACK: A
portable linear algebra library for high-performance computers. Computer
Science Dept. Technical Report CS-90-105, University of Tennessee: Knoxville,
TN, 1990
4. Katritzky, A.R.; Lobanov V.; Karelson, M. CODESSA Reference Manual.
University of Florida, Gainesville, 1996.
5. Xu Q.-S.; Liang Y.-Z. Monte Carlo cross validation Chemom. Intell. Lab. Syst.
2001, 56, 1-11.
6. Martens, H.A.; Dardenne, P. Validation and verification of regression in small
data sets 1998, 44, 99-121.
7. Edington E. S. Randomization tests: Marcel Dekker, Inc.: New York and Basel,
1980, pp.195-216.
8. Matsumoto, M.; Nishimura, T. Mersenne Twister: A 623-dimensionally
equidistributed uniform pseudorandom number generator, ACM Trans. on
Modeling and Computer Simulation 1998, 8, 3-30.
5.3 Multivariate methods
The principal component analysis (PCA) is generally described as an ordination
technique for describing the variation in a multivariate data set.[1, 2, 3] The first axis (the
first principal component, or PC1) describes the maximum variation in the whole data
set; the second describes the maximum variance remaining, and so forth, with each axis
orthogonal to the preceding axis. Principal components are eigenvectors of a covariance
X’X or correlation X’Y matrix. The number of principal components that can be
extracted will typically exceed the maximum of the number of Yand Xvariables.
The principal component analysis and factor analysis are based on the separation
of the original matrix X into two matrixes: factor scores matrix T and loading matrix Q.
In matrix form:
The columns in a factor score matrix are linear independent. Usually, the columns
in X and Y matrix are centered (by subtracting their means) and scaled (by dividing by
their standard deviations). Suppose we have a data set with response variables Y (in
matrix form) and a large number of predictor variables X (in matrix form), some of which
are highly correlated. A regression, using factor extraction for this type of data, computes
the factor score matrix
85
for an appropriate weight matrix W, and then considers the linear regression model
where Q is a matrix of regression coefficients (loadings) for T, and E is an error (noise)
term. Once the loadings Q are computed, the above regression model is equivalent to
which can be used as a predictive regression model.
The factor scores and loadings can be obtained in many different ways. NIPALS
algorithm was developed in 1923, [4] later modified in 1966, [5] and SIMPLS algorithm
[6] resulted from work by de Jong in 1993. Singular value decomposition is another
commonly used method for calculating scores and loading.[7]
Principal components regression (PCR) and partial least squares (PLS) regression
differ in the methods used for extracting factor scores.[1] PLR produces the weight
matrix W reflecting the covariance structure between the predictor variables, while PLS
regression produces the weight matrix W reflecting the covariance structure between the
predictor and response variables. In PLSregression, prediction functions are represented
by factors extracted from the Y'XX'Y matrix.
For establishing the model, PLS regression produces a weight matrix W for X
such that T=XW, i.e., the columns of W are weight vectors for the X columns producing
the corresponding factor score matrix T. These weights are computed so that each of
them maximizes the covariance between responses and the corresponding factor scores.
Ordinary least squares procedures for the regression of Y on T are then performed to
produce Q, the loadings for Y(or weights for Y) such that Y=TQ+E. Once Q is
computed, we have Y=XB+E, where B=WQ, and the prediction model is complete.
One additional matrix which is necessary for a complete description of partial
least squares regression procedures is the factor loading matrix P which gives a factor
model X=TP+F, where F is the unexplained part of the X scores.
References:
1. http://www.statsoft.com/textbook/stathome.html
2. Manly, B. F. J. Multivariate Statistical Methods. A Primer. London-NY:
Chapman and Hall, 1986.
3. Nilsson, J. Multiway calibration in 3D QSAR: applications to dopamine receptor
ligands; Groningen: University Library Groningen, 1998, Online Resource.
4. Fisher, R.; MacKensie, W. Journal of Agricaltural Science 1923, 13, 311-320.
86
5. Wold, H., In: Research papers in Statistics, ed. David, F., NY: Wiley & Sons,
1966, pp.411-444.
6. de Jong, S. SIMPLS: An Alternative Approach to Partial Least Squares
Regression, Chemometrics and Intelligent Laboratory Systems 1993, 18, 251-263
7. Mandel, J. American Statistician 1982, 36, 15-24 .
5.4 Test of the results
Most QSPR models are useful, but occasionally models are produced that not at
all reliable.[1] The problems of reliability can validly be classified as (i) overfitting and
(ii) models by chance. The latter problem can only be solved by subjective human
judgment based on the justifiability of any assessment of the uncertainty of a particular
prediction.[2, 3]
The problem of overfitting is one of the more common problems in the
development of the any kind of model and QSPR is no exception. The problem is usually
solved using objective validation criterions. The most common method of validation in
chemometrics is crossvalidation. [4]
References:
1. Eriksson, L.; Johansson, E.; Muller, M.; Wold, S. On the selection of the training
set in environmental QSAR analysis when compounds are clustered J.
Chemometrics 2000, 14, 599-616.
2. Stone M.; Jonathan P. Statistical Thinking and Technique for QSAR and related
studies. 1. Genaral Theory J. Chemometrics 1993, 7, 455-475.
3. Stone M.; Jonathan P. Statistical Thinking and Technique for QSAR and related
studies. 2. Specific Methods J. Chemometrics 1994, 6, 1-20.
4. Xu Q.-S.; Liang Y.-Z. Monte Carlo cross validation Chemom. Intell. Lab. Syst.
2001, 56, 1-11.
5.5 External validation set
The easiest method of the correlation testing is use of an external validation set.
[1] In this method, the correlation is used for predict a property value for a chemical
structure that was not used in the creation of the correlation; some test statistics are
calculated for the external dataset; the difference between the test statistics in the training
and validation datasets is a measure of the reliability of the correlation. The widely used
measure is the prediction error sum of squares (PRESS) and is defined as
87
where ye,i are experimental values of the property and yp,i are predicted values for
external validation test.
Often the RMSPE criterion is prefered:
because it gives error on a ‘per compound’ basis. [1]
The method is a particular case of the leave-many-out cross-validation method.
References:
1. Stone M.; Jonathan P. Statistical Thinking and Technique for QSAR and related
studies. 1. Genaral Theory J. Chemometrics 1993, 7, 455-475.
5.6 Leave-One-Out crossvalidation
The simplest, and a commonly used method of crossvalidation in chemometrics is
the “leave-one-out” method. The idea behind this method is to predict the property value
for a compound from the data set, which is in turn predicted from the regression equation
calculated from the data for all other compounds. For evaluation, predicted values can be
used for PRESS, RMSPE, and squared correlation coefficient criteria (r2cv).
The method tends to include unnecessary components in the model, and has been
provided [2] to be asymptotically incorrect. Furthermore, the method does not work well
for data with strong clusterization, [1] and underestimates the true predictive error. [3]
References:
1. Eriksson, L.; Johansson, E.; Muller, M.; Wold, S. On the selection of the training
set in environmental QSAR analysis when compounds are clustered J.
Chemometrics 2000, 14, 599-616.
2. Stone M. An Asymptotic Equivalence of Choice of Model by Cross-Validation
and Akaike’s Criterion J. R. Stat. Soc., B 1977, 38, 44-47.
3. Martens, H.A.; Dardenne, P. Validation and verification of regression in small
data sets 1998, 44, 99-121.
88
5.7 Leave-Many-Out crossvalidation
The “leave-many-out” crossvalidation method was first described in 1975. [3]
Later, the asymptotic consistence of the method was proved. [2] Because of
combinatorial complexity of the calculation resulted in low productivity, some
simplifications were developed for the method. The evaluation of the results of “leavemany-out” crossvalidation can be done using Monte Carlo approach. [1]
References:
1. Xu Q.-S.; Liang Y.-Z. Monte Carlo cross validation Chemom. Intell. Lab. Syst.
2001, 56, 1-11.
2. Stone M. An Asymptotic Equivalence of Choice of Model by Cross-Validation
and Akaike’s Criterion J. R. Stat. Soc., B 1977, 38, 44-47.
3. Geisser, S. The Predictive Sample Reuse Method with Application J. Amer.
Stat.Ass. 1975, 70, 320-328.
5.8 Randomization test
Randomization tests [1] are statistical tests in which the data are repeatedly
elaborated; a test statistic (r2, t-criteria, F-criteria, etc.) is computed for each data
permutation and the proportion of the data permutations, with test statistics values as
large as the value for the obtained results, determines the significance of the results. For
the testing of the multilinear correlation, the vector Y permutations are processed through
multilinear regression procedures with fixed columns of matrix X. Due to a factorial
increase in time spent from the size of the vector Y, Monte-Carlo method is often be used
for producing randomization test.
89
Chapter 6
CODESSA Publications
The following is a selection of publications on the research done by using the
CODESSA PRO (CODESSA) software (as of January 1st, 2005).
1. Quantitative structure-property relationship modeling of -cyclodextrin
complexation free energies (A.R. Katritzky, D.C Fara, H.F. Yang, M. Karelson,
T. Suzuki, V.P. Solov'ev, A. Varnek) J. Chem. Inf. Comput. Sci., 2004, 44, 529.
2. A quantitative structure-property relationship study of lithium cation
basicities (K. Tämm, D.C. Fara, A.R. Katritzky, P. Burk, M. Karelson, J. Phys.
Chem. A, 2004, 108, 4812.
3. Quantitative Aspects of Solvent Polarity, (A.R. Katritzky, D. Fara, H. Yang, T.
Tamm, M. Karelson) Chem. Rev., 2004, 104, 175.
4. "In silico" design of new uranyl extractants based on phosphoryl-containing
podands: QSPR studies, generation and screening of virtual combinatorial
library, and experimental tests (A. Varnek, D. Fourches, V.P. Solov'ev, V.E.
Baulin, A.N. Turanov, V.K. Karandashev, D. Fara, A.R. Katritzky) J. Chem. Inf.
Comput. Sci., 2004, 44, 1365.
5. QSPR of 3-aryloxazolidin-2-one antibacterials (A.R. Katritzky, D.C. Fara, M.
Karelson) Bioorg & Med. Chem 2004, 12, 3027.
6. QSPR treatment of rat blood : air, saline : air and olive oil: air partition
coefficients using theoretical molecular descriptors (A.R. Katritzky, M.
Kuanar, D.C. Fara, M. Karelson, W.E. Acree) Bioorg & Med. Chem 2004, 12,
4735.
7. QSPR models for various physical properties of carbohydrates based on
molecular mechanics and quantum chemical calculations. (J.D. Dyekjaer, S.Ó.
Jónsdóttir) Carbohydr. Res. 2004, 339, 269.
8. Quantum-connectivity descriptors in modeling solubility of environmentally
important organic compounds (E. Estrada, E.J. Delgado, J.B. Alderete, G.A.
Jana) J. Comput. Chem., 2004, 25, 1787.
9. Quantitative structure-activity relationships of alpha(1) adrenergic
antagonists (S. Eric, T. Solmajer, J. Zupan, M. Novic, M. Oblak, D. Agbaba) J.
Mol. Modeling, 2004, 10, 139.
10. Quantitative structure-activity relationships of Streptococcus pneumoniae
MurD transition state analogue inhibitors (M. Kotnik, M. Oblak, J. Humljan,
S. Gobec, U. Urleb, T. Solmajer) QSAR & Comb. Sci. 2004, 23, 399.
90
11. Study of quantitative structure-mobility relationship of carboxylic and
sulphonic acids in capillary electrophoresis (C.X. Xue, H.X. Liu, X.J. Yao,
M.C. Liu, Z. Hu, B.T. Fan, J. Chromatogr. A 2004, 1048, 233.
12. QSAR for toxicities of polychlorodibenzofurans, polychlorodibenzo-1,4dioxins, and polychlorobiphenyls (A. Beteringhe, A.T. Balaban) ARKIVOC
2004 (1) 163.
13. QSAR study using topological indices for inhibition of carbonic anhydrase II
by sulfanilamides and Schiff bases (A.T. Balaban, S.C. Basak, A. Beteringhe,
D. Mills, C.T. Supuran) Molecular Diversity, 2004, 8, 401.
14. Using Variable and Fixed Topological Indices for the Prediction of Reaction
Rate Constants of Volatile Unsaturated Hydrocarbons with OH Radicals
(M.Pompe, M. Veber, M Randić, A.T. Balaban) Molecules, 2004, 9, 1160.
15. Application of Artificial Neural Networks to the QSPR Study, (M. Novič, A.
Roncaglioni) Kem. Ind. 2004, 53, 323.
16. Tuning Neural and Fuzzy-Neural Networks for Toxicity Modeling (P.
Mazzatorta, E. Benfenati, C.-D. Neagu, G. Gini) J. Chem. Inf. Comput. Sci., 2003,
43, 513.
17. A general treatment of solubility. 1. The QSPR correlation of solvation free
energies of single solutes in series of solvents (A.R. Katritzky, A.A. Oliferenko,
P.V. Oliferenko, R. Petrukhin, D.B. Tatham, U. Maran, A. Lomaka, W.E. Acree)
J. Chem. Inf. Comput. Sci., 2003, 43, 1794.
18. A general treatment of solubility. 2. QSPR prediction of free energies of
solvation of specified solutes in ranges of solvents (A.R. Katritzky, A.A.
Oliferenko, P.V. Oliferenko, R. Petrukhin, D.B. Tatham, U. Maran, A. Lomaka,
W.E. Acree) J. Chem. Inf. Comput. Sci., 2003, 43, 1806.
19. QSPR models based on molecular mechanics and quantum chemical
calculations. 2. Thermodynamic properties of alkanes, alcohols, polyols, and
ethers. (J.D. Dyekjaer, S.Ó. Jónsdóttir) Ind. & Eng. Chem. Res., 2003, 42, 4241.
20. Chain melting temperature estimation for phosphatidyl cholines by quantum
mechanically derived quantitative structure property relationships
(A.J. Holder, D.M. Yourtee, D.A. White, A.G. Glaros, R. Smith) J. Comp.-Aid.
Mol. Des., 2003, 17, 223.
21. Nitrobenzene toxicity: QSAR correlations and mechanistic interpretations
(A.R. Katritzky, P. Oliferenko, A. Oliferenko, A. Lomaka, M. Karelson) J. Phys.
Org. Chem., 2003, 16, 811.
22. Layer matrices and distance property descriptors (M.V. Diudea, O. Ursu)
Ind. J. Chem A., 2003, 42, 1283.
91
23. Correlation between molecular structures and relative electrophoretic
mobility in capillary electrophoresis: Alkylpyridines, (X.J. Yao, B.T. Fan, J.P.
Doucet, A. Panaye, M.C. Liu, R.S. Zhang, Z.D. Hu, Chines. J. Chem. 2003, 21,
1247.
24. Prediction of boiling points of ketones using a quantitative structureproperty relationship treatment (A. Komasa) Pol. J. Chem. 2003, 77, 1491.
25. Thermodynamic Study of the Solubility of Some Sulfonamides in
Cyclohexane (F. Martínez, C.M. Ávila, A. Gómez), J. Braz. Chem. Soc., 2003,
14, 803.
26. Mutagenicity of aminoazo dyes and their reductive-cleavage metabolites: a
QSAR/QPAR investigation (L. Sztandera, A. Garg, S. Hayik, K.L. Bhat, C.W.
Bock) Dyes and Pigments, 2003, 59, 117.
27. QSAR modeling of genotoxicity on non-congeneric sets of organic
compounds (U. Maran, S. Sild) Artif. Intell. Rev., 2003, 20, 13.
28. Predicting melting points of quaternary ammonium ionic liquids (D.M. Eike,
J.F. Brennecke, E.J. Maginn) Green Chem., 2003, 5, 323.
29. Six-membered cyclic ureas as HIV-1 protease inhibitors: A QSAR study
based on CODESSA PRO approach (A.R. Katritzky, A. Oliferenko, A.
Lomaka, M. Karelson) Bioorg & Med. Chem. Lett. 2002, 12, 3453.
30. QSPR models based on molecular mechanics and quantum chemical
calculations - 1. Construction of Boltzmann-averaged descriptors for alkanes,
alcohols, diols, ethers and cyclic compounds. (J.D. Dyekjaer, K. Rasmussen,
S.Ó. Jónsdóttir), J. Mol. Modeling, 2002, 8, 277.
31. Charge-transfer interactions in the inhibition of MAO-A by
phenylisopropylamines - a QSAR study (G. Vallejos, M.C. Rezende, B.K.
Cassels) J. Comput-Aid. Mol. Design, 2002, 16, 95.
32. Correlation of liquid viscosity with molecular structure for organic
compounds using different variable selection methods (B. Lucic, I. Basic, D.
Nadramija, A. Milicevic, N. Trinajstic, T. Suzuki, R. Petrukhin, M. Karelson,
A.R. Katritzky) ARKIVOC, 2002, 4, 45.
33. QSPR models derived for the kinetic data of the gas-phase homolysis of the
carbon-methyl bond (R. Hiob, M. Karelson) Comput. & Chem. 2002, 26, 237.
34. Constructing a useful tool for characterizing amino acid conformers by
means of quantum chemical and graph theory indices (C. Cardenas, M.
Obregon, E.J. Llanos, E. Machado, H.J. Bohorquez, J.L. Villaveces, M. E.
Patarroyo) Comput. & Chem. 2002, 26, 667.
35. Novel potent 5-HT3 receptor Ligands based on the pyrrolidone structure:
Synthesis, biological evaluation, and computational rationalization of the
92
ligand-receptor interaction modalities (A. Cappelli, M. Anzini, S. Vomero, L.
Mennuni, F. Makovec, E. Doucet, M. Hamon, M.C. Menziani, P.G. De Benedetti,
G. Giorgi, C. Ghelardini, S. Collina) Bioorg & Med. Chem. 2002, 10, 779.
36. The Present Utility and Future Potential to Medicinal Chemistry of
QSAR/QSPR with Whole Molecule Descriptors, (A.R. Katritzky, D. Tatham,
D. Fara, U. Maran, A. Lomaka, M. Karelson) Current Topics in Medicinal
Chemistry, 2002, 2, 1333.
37. QSPR correlation of the melting point for pyridinium bromides, potential
ionic liquids (A.R. Katritzky, A. Lomaka, R. Petrukhin, R. Jain, M. Karelson,
A.E. Visser, R.D. Rogers) J. Chem. Inf. Comput. Sci., 2002, 42, 71.
38. Correlation of the melting points of potential ionic liquids (imidazolium
bromides and benzimidazolium bromides) using the CODESSA program
(A.R. Katritzky, R. Jain, A. Lomaka, R. Petrukhin, M. Karelson, A.E. Visser,
R.D. Rogers) J. Chem. Inf. Comput. Sci., 2002, 42, 225.
39. A General QSPR Treatment for Dielectric Constants of Organic Compounds
(S. Sild, M. Karelson) J. Chem. Inf. Comput. Sci., 2002, 42, 360.
40. Prediction of Ultraviolet Spectral Absorbance using Quantitative StructureProperty Relationships, (W.L. Fitch, M. McGregor, A.R. Katritzky, A. Lomaka,
R. Petrukhin, M. Karelson) J. Chem. Inf. Comput. Sci., 2002, 42, 830.
41. General and class specific models for prediction of soil sorption using various
physicochemical descriptors (P.L. Andersson, U. Maran, D. Fara, M. Karelson,
J.L.M. Hermens) J. Chem. Inf. Comput. Sci., 2002, 42, 1450.
42. A QSPR study of sweetness potency using the CODESSA program (A.R.
Katritzky, R. Petrukhin, S. Perumal,, M. Karelson, I. Prakash, N. Desai) Croat.
Chem. Acta, 2002, 75, 475.
43. A QSPR model for the prediction of the gas-phase free energies of activation
of rotation around the N-C(O) bond (J. Leis, M. Karelson) Comput. & Chem.
2002, 26, 667.
44. Mutagenicity of aminoazobenzene dyes and related structures: a
QSAR/QPAR investigation (A. Garg, K.L. Bhat, C.W. Bock) Dyes and
Pigments, 2002, 55, 35.
45. QSAR of Flavonoids: 4. Differential Inhibition of Aldose Reductase and
p56lck Protein Tyrosine Kinase (A. Štefanič-Petek, A. Krbačič, T. Šolmajer)
Croat. Chem. Acta., 2002, 75, 517.
46. QSPR Correlation of Free-Radical Polymerization Chain-Transfer
Constants for Styrene (A.R. Katritzky, F.H. Ignatz-Hoover, R. Petrukhin, M.
Karelson) J. Chem. Inf. Comput. Sci., 2001, 41, 295.
93
47. Quantum mechanical quantitative structure activity relationships to avoid
mutagenicity in dental monomers (D. Yourtee, A.J. Holder, R. Smith, J.A.
Morrill, E. Kostoryz, W. Brockmann, A. Glaros, C. Chappelow, D. Eick) J.
Biomater. Sci. – Polym. Sci. 2001, 12, 89.
48. Correlation of the Solubilities of Gases and Vapors in Methanol and Ethanol
with Their Molecular Structures (A.R. Katritzky, D.B. Tatham, U. Maran) J.
Chem. Inf. Comput. Sci., 2001, 41, 358.
49. CODESSA-Based Theoretical QSPR Model for Hydantoin HPLC-RT
Lipophilicities (A.R. Katritzky, S. Perumal, R. Petrukhin, E. Kleinpeter) J.
Chem. Inf. Comput. Sci., 2001, 41, 569.
50. The variable connectivity index (1)chi(f) versus the traditional molecular
descriptors: A comparative study of (1)chi(f) against descriptors of
CODESSA (M. Randic, M. Pompe) J. Chem. Inf. Comput. Sci., 2001, 41, 631.
51. Factors influencing predictive models for toxicology (E. Benfenati, N. Piclin,
A. Roncaglioni, M.R. Vari) SAR QSAR Environm. Res. 2001, 12, 593.
52. A QSRR-Treatment of Solvent Effects on the Decarboxylation of 6Nitrobenzisoxazole-3-carboxylates Employing Molecular Descriptors (A.R.
Katritzky, S. Perumal, R. Petrukhin) J. Org. Chem., 2001, 66, 4036.
53. Interpretation of Quantitative Structure - Property and -Activity
Relationships (A.R. Katritzky, R. Petrukhin, D. Tatham, S. Basak, E. Benfenati,
M. Karelson, U. Maran) J. Chem. Inf. Comput. Sci., 2001, 41, 679.
54. Theoretical Descriptors for the Correlation of Aquatic Toxicity of
Environmental Pollutants by Quantitative Structure-Toxicity Relationships
(A.R. Katritzky, D.B. Tatham, U. Maran) J. Chem. Inf. Comput. Sci., 2001, 41,
1162.
55. QSPR Analysis of Flash Points (A.R. Katritzky, R. Petrukhin, R. Jain, M.
Karelson) J. Chem. Inf. Comput. Sci., 2001, 41, 1521.
56. Perspective on the Relationship between Melting Points and Chemical
Structure (A.R. Katritzky, R. Jain, A. Lomaka, R. Petrukhin, U. Maran, M.
Karelson) Crystal Growth & Design, 2001, 1, 261.
57. QSAR Correlations of the Algistatic Activity of 5-Amino-1-aryl-1Htetrazoles (A.R. Katritzky, R. Jain, R. Petrukhin, S.N. Denisenko, T. Schelenz)
S. Q. Env. Res., 2001, 12, 259.
58. QSPR Correlation and Predictions of GC Retention Indexes for MethylBranched Hydrocarbons Produced by Insects (A.R. Katritzky, K. Chen, U.
Maran, D.A. Carlson) Anal. Chem., 2000, 72, 101.
59. Non-linear QSAR Treatment of Genotoxicity (M. Karelson, S.Sild, U. Maran)
Mol. Simulat., 2000, 24, 229.
94
60. Structurally Diverse Quantitative Structure-Property Relationship
Correlations of Technologically Relevant Physical Properties (A.R. Katritzky,
U. Maran, V.S. Lobanov, M. Karelson) J. Chem. Inf. Comput. Sci., 2000, 40, 1.
61. Prediction of Liquid Viscosity for Organic Compounds by a Quantitative
Structure-Property Relationship (A.R. Katritzky, K. Chen, Y. Wang, M.
Karelson, B. Lucic, N. Trinajstic, T. Suzuki, G. Shuurmann) J. Phys. Org. Chem.,
2000, 13, 80.
62. Quantitative relationship between rate constants of the gas-phase homolysis
of C-X bonds and molecular descriptors (R. Hiob, M. Karelson) J. Chem. Inf.
Comput. Sci., 2000, 40, 1062.
63. Quantitative structure-activity relationship of flavonoid analogues. 3.
Inhibition of p56(lck) protein tyrosine kinase (M. Oblak, M. Randic, T.
Solmajer) J. Chem. Inf. Comput. Sci., 2000, 40, 994.
64. QSAR equations for N-methyl-(substituted-phenyl)-urethanes by PRECLAV
program (L. Tarko) Rev. Roum. Chim., 2000, 45, 809.
65. Improved QSARs for predictive toxicology of halogenated hydrocarbons, (S.
Trohalaki, E. Gifford, R. Pachter) Comput. & Chem. 2000, 24, 421.
66. A Comprehensive QSAR Treatment of the Genotoxicity of Heteroaromatic
and Aromatic Amines (A.R. Katritzky, U. Maran, M. Karelson) Quant. Struct.
Act. Relat. Pharmacol. Chem. Biol., 1999, 18, 3.
67. A New Efficient Approach for Variable Selection Based on Multiregression:
Prediction of Gas Chromatographic Retention Times and Response Factors
(A.R. Katritzky, B. Lucic, N. Trinajstic, S. Sild, M. Karelson,) J. Chem. Inf.
Comput. Sci., 1999, 39, 610.
68. Development of quantitative structure-property relationships using
calculated descriptors for the prediction of the physicochemical properties
(n(D), p, bp, epsilon, eta) of a series of organic solvents (M. Cocchi, P.G. De
Benedetti, R. Seeber, L. Tassi, A. Ulrici) J. Chem. Inf. Comput. Sci., 1999, 39,
1190.
69. QSPR Treatment of Solvent Scales (A.R. Katritzky, T. Tamm, Y. Wang, S.
Sild, M. Karelson) J. Chem. Inf. Comput. Sci., 1999, 39, 684.
70. A Unified Treatment of Solvent Properties (A.R. Katritzky, T. Tamm, Y.
Wang, M. Karelson) J. Chem. Inf. Comput. Sci., 1999, 39, 692.
71. QSPR prediction of densities of organic liquids (M. Karelson, A. Perkson)
Comput. & Chem. 1999, 23, 49.
72. Estimation of the liquid viscosity of organic compounds with a quantitative
structure-property model (O. Ivanciuc, T. Ivanciuc, P.A. Filip, D. Cabrol-Bass)
J. Chem. Inf. Comput. Sci., 1999, 39, 515.
95
73. QSPR and QSAR Models Derived Using Large Molecular Descriptor Spaces.
A Review of CODESSA Applications (M. Karelson, U. Maran, Y. Wang, A. R.
Katritzky) Collect. Czech. Chem. Commun., 1999, 64, 1551.
74. Insights into Sulfur Vulcanization from QSPR Quantitative StructureProperty Relationship Studies (A.R. Katritzky, F. Ignatz-Hoover, V.S.
Lobanov, M. Karelson) Rubber Chem. Technol., 1999, 72, 318.
75. Relevance of theoretical molecular descriptors in quantitative structureactivity relationship analysis of alpha 1-adrenergic receptor antagonists
(M.C. Menziani, M. Montorsi, P.G. De Benedetti, M. Karelson) Bioorg & Med.
Chem. 1999, 7, 2437.
76. QSAR study in the N-aryl-anthranylic acids class (T. Laszlo) Revista de
Chemie, 1999, 50, 864.
77. QSPR and QSAR Models Derived with CODESSA Multipurpose Statistical
Analysis Software. (M. Karelson, U. Maran, Y. Wang, A.A. Katritzky), AAAI
Tech. Report, 1999, SS-99-01, 12.
78. Evaluation of quantitative structure-activity relationship methods for largescale prediction of chemicals binding to the estrogen receptor (W. Tong, D.R.
Lowis, R. Perkins, Y. Chen, W.J. Welsh, D.W. Goddette, T.W. Heritage, D.M.
Sheehan) J. Chem. Inf. Comput. Sci., 1998, 38, 669.
79. Normal Boiling Points for Organic Compounds: Correlation and Prediction
by a Quantitative Structure - Property Relationship (A.R. Katritzky, V.S.
Lobanov, M. Karelson) J. Chem. Inf. Comput. Sci., 1998, 38, 28.
80. Quantitative Structure - Property Relationship (QSPR) Correlation of Glass
Transition Temperatures of High Molecular Weight Polymers (A.R.
Katritzky, S. Sild, V. Lobanov, M. Karelson) J. Chem. Inf. Comput. Sci., 1998,
38, 300.
81. Correlation of the Aqueous Solubility of Hydrocarbons and Halogenated
Hydrocarbons with Molecular Structure (A.R. Katritzky, P.D.T. Huibers) J.
Chem. Inf. Comput. Sci., 1998, 38, 283.
82. Relationship of Critical Temperatures to Calculated Molecular Properties
(A.R. Katritzky, L. Mu, M. Karelson) J. Chem. Inf. Comput. Sci., 1998, 38, 293.
83. Quantitative structure-property relationship study of normal boiling points
for halogen/oxygen/sulfur-containing organic compounds using the
CODESSA program (O. Ivanciuc, T. Ivanciuc, A.T. Balaban), Tetrahedron,
1998, 54, 9129.
84. QSPR Studies on Vapor Pressure, Aqueous Solubility, and the Prediction of
Water-Air Partition Coefficients (A.R. Katritzky, Y. Wang, S. Sild, T. Tamm,
M. Karelson) J. Chem. Inf. Comput. Sci., 1998, 38, 720.
96
85. General Quantitative Structure-Property Relationship Treatment of the
Refractive Index of Organic Compounds (A.R. Katritzky, S. Sild, M.
Karelson) J. Chem. Inf. Comput. Sci., 1998, 38, 840.
86. Theoretical descriptors in quantitative structure - Affinity and selectivity
relationship study of potent N4-substituted arylpiperazine 5-HT1(Alpha)
receptor antagonists (M.C. Menziani, P.G. De Benedetti, M. Karelson) Bioorg
& Med. Chem. 1998, 6, 535.
87. Correlation and Prediction of the Refractive Indices of Polymers by QSPR
(A.R. Katritzky, S. Sild, M. Karelson) J. Chem. Inf. Comput. Sci., 1998, 38,
1171.
88. Prediction of Critical Micelle Concentration Using a Quantitative StructureProperty Relationship Approach. 2. Anionic Surfactants (A.R. Katritzky,
P.D.T. Huibers, V.S. Lobanov, D.O. Shah, M. Karelson) J. Colloid and Interface
Sci., 1997, 187, 113.
89. QSPR as a Means of Predicting and Understanding Chemical and Physical
Properties in Terms of Structure (A.R. Katritzky, M. Karelson, V.S. Lobanov)
Pure Appl. Chem., 1997, 69, 245.
90. Predicting Surfactant Cloud Point from Molecular Structure (A.R. Katritzky,
P.D.T. Huibers, D.O. Shah) J. Colloid and Interface Sci., 1997, 193, 132.
91. Prediction of Melting Points for the Substituted Benzenes: A QSPR
Approach (A.R. Katritzky, U. Maran, M. Karelson, V.S. Lobanov) J. Chem. Inf.
Comput. Sci., 1997, 37, 913.
92. QSPR Treatment of the Unified Nonspecific Solvent Polarity Scale (A.R.
Katritzky, L. Mu, M. Karelson) J. Chem. Inf. Comput. Sci., 1997, 37, 756.
93. QSAR calculation programme (T. Laszlo) Revista de Chemie, 1997, 48, 676.
94. CODESSA version 2.13 for Windows (O. Ivanciuc) J. Chem. Inf. Comput. Sci.,
1997, 37, 405.
95. Prediction of Critical Micelle Concentration Using a Quantitative StructureProperty Relationship Approach. 1. Nonionic Surfactants (A.R. Katritzky,
P.D.T. Huibers, V.S. Lobanov, D.O. Shah, M. Karelson) Langmuir, 1996, 12,
1462.
96. Correlation of Boiling Points with Molecular Structure. 1. A Training Set of
298 Diverse Organics and a Test Set of 9 Simple Inorganics (A.R. Katritzky,
L. Mu, V.S. Lobanov, M. Karelson) J. Phys. Chem., 1996, 100, 10400.
97. Quantum-Chemical Descriptors in QSAR/QSPR Studies (M. Karelson, V.S.
Lobanov, A.R. Katritzky) Chem. Rev., 1996, 96, 1027.
97
98. Prediction of Polymer Glass Transition Temperatures Using A General
Quantitative Structure-Property Relationship Treatment (A.R. Katritzky, P.
Rachwal, K.W. Law, M. Karelson, V.S. Lobanov) J. Chem. Inf. Comput. Sci.,
1996, 36, 879.
99. A QSPR Study of the Solubility of Gases and Vapors in Water (A.R.
Katritzky, L. Mu, M. Karelson) J. Chem. Inf. Comput. Sci., 1996, 36, 1162.
100.
Comprehensive Descriptors for Structural and Statistical Analysis. 1
Correlations Between Structure and Physical Properties of Substituted
Pyridines (A.R. Katritzky, V.S. Lobanov, M. Karelson, R. Murugan, M.P.
Grendze, J.E. Toomey Jr.) Rev. Roum. Chim., 1996, 41, 851.
101.
QSPR: The Correlation and Quantitative Prediction of Chemical and
Physical Properties from Structure (A.R. Katritzky, V.S. Lobanov, M.
Karelson) Chem. Soc. Rev., 1995, 279.
102.
Predicting Physical Properties from Molecular Structure (A.R.
Katritzky, R.Murugan, M.P. Grendze, J.E. Toomey, Jr, M. Karelson, V. Lobanov,
P. Rachwal) Chemtech., 1994, 24, 17.
103.
Prediction of Gas Chromatographic Retention Times and Response
Factors Using a General Quantitative Structure-Property Relationship
Treatment (A.R. Katritzky, E.S. Ignatchenko, R.A. Barcock, V.S. Lobanov, M.
Karelson) Anal. Chem., 1994, 66, 1799.
98
Download