Exercises on QSAR/QSPR

advertisement
Exercises on QSAR/QSPR
The goal is to present state of the art QSAR/QSPR methodology. Most commercial software
solutions do not provide rigorous enough methods. This is the case in particular for MOE. In
these exercises, CODESSA PRO or ISIDA are much safer to be used.
The first cases studied are the Alkanes. Two datasets are provided: ALKAN.SDF and
ALKAN_15.SDF. The second one is a subset of the first one. The database contains several
fields:
 boiling point (bp) in °C
 melting point (mp) in °C
 molar volume (MV) in cm3/mol at 20 °C
 molar refractivity (MR) in cm3/mol at 20 °C
 heat capacity (HC) in J.K/mol at 300 K
 critical temperature (Tc) in K
 critical pressure (Pc) in atm
 surface tension (ST) in dyn/cm at 20 °C
The exercises focus on modeling the boiling point, next on the melting point of the alkanes.
Particular attention should be paid on:




differences between the training set and the test set
Statistical parameters to assess the model utility
Cross-validation
Ensemble modeling
The melting point is an “easy” property to model. The melting point is more difficult.
Next exercises will present model building of Thrombin inhibition value. The corresponding
dataset is thr_pKiStd.sdf. The file contains the structure and the pKi inhibition value of
each one, in the field pKi. The name of each compound is replaced by this value also, which
can be disturbing. This is a real life case: the dataset is characteristic of QSAR/QSPR
problems in size, diversity and difficulty to find “good” models. In fact, calculations are fairly
lengthy so the modeling has been performed in advances and the results will be displayed and
discussed.
The last QSAR example is the TUB.SDF dataset. It is a small dataset of active and inactive
compounds against tuberculoses. It contains a field named activity which contains either 1
or 0 which stand for active or inactive, respectively. The exercise proposes to build a linear
model to fit the binary activity of the compounds as an introduction to classification problems.
Note: linear regression on binary values, as in the exercise is an improper method for
classification. In this particular case, it gives and impression of “success”. But the models
possesses by construct flows: for instance, they do not learn any statistical feature of the
active or the inactive, since they are all equivalent. The proper way to perform classification
using linear regression methods is logistic regression.
The second part of the exercises is focused on Data Mining. In fact most QSAR/QSPR
methods borrow methods to Data Mining. These methods range from type of mathematical
models to validation procedures and success assessment. The exercises focus on Weka; a
Java based software for Data Mining, from the Waikato University in New Zeeland. The
software is free and open source. Since it is not a chemoinformatics application, it is not
possible to work directly with chemical structures and it is not possible neither to analyze
results in terms of chemical structures. Nonetheless, it is very useful to illustrate typical
situations and to build models; but all the chemoinformatics analysis have to be performed
apart: generation of molecular descriptors and chemical analysis.
The Data Mining exercises are using the following set of files:
 NBdist0.csv
 NBdist6.csv
 NBdist3.csv
 thrombinBig.arff
 thrombin.arff
 thrombin_all.sdf
 thrombin.hdr
The first three files are artificial 2D datasets. They are designed to illustrate the intuitive
concept of classification, through clustering and supervised classification. They represent two
sets of points with different centered distribution. The datasets NBdist6.csv illustrate a
difficult case, when the distributions are overlapping largely overlapping.
The other files are dealing with the Thrombin dataset. Here, all know active compounds are
labeled and a set of inactive compounds have been added. The SDF of the dataset is named
thrombin_all.sdf. ISIDA fragment descriptors have been computed on these compounds.
The nature of the each fragment is found in the file thrombin.hdr, while the other two files,
thrombinBig.arff and thrombin.arff represent the set of molecular descriptor values in
the native file format of the Weka software. The second file is formed by a selected subset of
the descriptors of the first.
The first part of the exercises will consist in getting familiar with the Weka software while
illustrating clustering and classification on the artificial datasets.
The second part will use the Naïve Bayes algorithm to build and validate powerful predictive
models with of the anti-thrombin activity.
Download