Theoretical characterization of McReynolds' constants

Analytica Chimica Acta 554 (2005) 163–171
Theoretical characterization of McReynolds’ constants
Róbert Rajkó a,∗ , Tamás Körtvélyesi b,∗ , Krisztina Sebők-Nagy c , Miklós Görgényi b
a
Department of Unit Operations and Environmental Engineering, College Faculty of Food Engineering,
University of Szeged, H-6701 Szeged, P.O. Box 433, Hungary
b Department of Physical Chemistry, University of Szeged, H-6701 Szeged, P.O. Box 105, Hungary
c Chemical Research Center, Hungarian Academy of Sciences, H-1525 Budapest, P.O. Box 17, Hungary
Received 20 April 2005; received in revised form 5 August 2005; accepted 12 August 2005
Available online 21 September 2005
Abstract
The properties of McReynolds’ constants were studied by a detailed statistical/chemometric analysis. The electronic structure, geometries
and hydrophobicity of the McReynolds’ test compounds (benzene, 1-butanol, 2-pentanone, 1-nitropropane, pyridine, 2-methyl-2-pentanol, 1iodobutane, 2-octyne, 1,4-dioxane and cis-hidrindane) were calculated at the level of PM3 semiempirical quantum chemical method and empirical
formulas. The predominant pattern was revealed using cluster and principal component analyses (CA and PCA). Dependence of McReynolds’
constants on the calculated chemical descriptors was modeled by multiple linear regression (MLR) with stepwise selections, principal component
regression (PCR) and partial least-square regression (PLSR). A novel statistical approach was developed for case-and-variable selection using
the PCR and PLSR methods for characterizing and modeling the polarity of 25 gas chromatography (GC) stationary phases (phthalates, adipates,
sebacates, phosphates, citrates and nitrils). Highest occupied molecular orbital energy, dipole moment, averaged isotropic polarizability and the
apolar solvent accessible surface area; and energy of the lowest unoccupied molecular orbital and total solvent accessible surface area were suitable
to describe the McReynolds’ constants based on the results obtained using Q2 and adjusted-Q2 . Six of the 10 test compounds were found to be
sufficient for the description of the polarity of the columns studied.
© 2005 Elsevier B.V. All rights reserved.
Keywords: Polarity; McReynolds’ constants; Stationary phases; Gas chromatography; Quantum chemical method; Principal component analysis; Principal component
regression; Partial least-square regression; Case and variable selection
1. Introduction
It is a continuously arising question for chromatographers to
find an easy-to-use method to characterize the stationary phase
and solute interaction to forecast gas chromatographic retention behavior. Which stationary phase (column type) is suitable
to separate possibly all or as many solutes in a complex mixture as possible? In this respect we have to know the polarity
and selectivity of a column. The selectivity is the ability of the
stationary phase to participate in specific intermolecular interactions. Depending on the extent of the interactions, some solutes
may be dissolved better or to a smaller extent in a given stationary phase finally resulting in some separations [1].
∗
Corresponding authors.
E-mail addresses: rajko@sol.cc.u-szeged.hu (R. Rajkó),
kortve@chem.u-szeged.hu (T. Körtvélyesi).
0003-2670/$ – see front matter © 2005 Elsevier B.V. All rights reserved.
doi:10.1016/j.aca.2005.08.024
The polarity concept was intended to use for the characterization of the interaction of the stationary phase and the solute
on the basis of its structure. Basically, polarity means that the
more polar is a stationary phase, the greater is the retention of a
polar solute compared to a non-polar solute as e.g. an n-alkane,
see e.g. in Ref. [2]. On this basis, the polarity is the sum of
various intermolecular interactions (inductive, dispersive, orientation and H-bonding). In gas chromatography, the interactions
do not depend only on the stationary phase, but also on the solute
and its functional group. The polarity is a term difficult to define:
e.g. dipole moment is often used as a symbol of polarity but in
chromatographic interactions it cannot be used as a single measure. Some empirical measures for the polarity and/or selectivity
parameters of the stationary phases are available: McReynolds’
polarity (P) [3], Kovats coefficient (KC ) [4], retention polarity
(RP ) [5], Snyder’s selectivity parameters [6], Castello’s C [7]
and GCH2 [8]. The polarity/selectivity properties of thirty stationary phases were characterized by Heberger [9] by principal
component analysis (PCA). Two groups of polarity scales were
164
R. Rajkó et al. / Analytica Chimica Acta 554 (2005) 163–171
found. The first group (P, KC , RP and C) and the second group
(Snyder’s selectivity parameters, Castello’s C) of the polarity
scales can characterize the column mainly by their polarity and
selectivity, respectively. The most influential properties are: (i)
polarity, (ii) hydrogen donating and accepting ability and (iii)
dipole interactions. The principal components of retention data
for oxo compounds were correlated with the physical properties (molar refractivity (RM ), boiling point (TBP ), molar volume
(Vm )) [10]. A predictive model was suggested by partial leastsquare regression (PLSR) method [11].
According to the thermodynamic concept the reluctance of
the liquid phase to accept a hydrocarbon may be considered
as a measure of polarity. The measure of this behavior is the
partial molar Gibbs free energy of solution for a methylene group
[8,12,13].
According to the most well-known and widely used
Rohrschneider–McReynolds concept, the Kovats retention
index difference of some specific test compound p on the column
studied (Ip ) and squalane (Isq ) provides a measure of polarity
[3,14,15] (Eq. (1)). By definition the polarity of squalane is 0,
because it was considered as an apolar (reference) phase:
Ix = Ip − Isq
(1)
In the Rohrschneider concept the intermolecular forces are additive which are characterized by several factors both characteristic
for the solute (a, b, c, d, e) and the stationary phase (x, y, z, u, s):
Ii,j (calc.) = ai xj + bi yj + ci zj + di uj + ei sj
(2)
Ii,j (calc.) is the difference in Kovats indices between the phase
of interest and squalane. xj , yj , zj , uj and sj are calculated for each
phase from the difference in Kovats indices of benzene, ethanol,
methyl ethyl ketone, nitromethane and pyridine, respectively.
ai , bi , ci , di , and ei are empirical coefficients, which can be calculated from retention data for each solute using various liquid
phases. In the simplest case ai , bi , ci , di , and ei equal to 1s (or
only one equals to 1, and the other is 0), however, if we know
Ii,j (calc.) in advance the profiles (a, b, c, . . ., and x, y, z, . . .)
can be estimated by factor analysis (FA) [16].
Rohrschneider originally used five compounds, but later
McReynolds analyzed 68 compounds on 25 columns and
selected the 10 compounds characterizing the columns the
best [3]: benzene, 1-butanol, 2-pentanone, nitropropane, pyridine, 2-methyl-2-pentanol, 1-iodobutane, 2-octyne, 1,4-dioxane
and cis-hidrindane. The most informative of these, benzene,
1-butanol, 2-pentanone, nitropropane and pyridine, are either
the same compounds Rohrschneider used or homologs of
Rohrschneider’s compounds.
The criterion of selecting the test compounds was the ability
to participate in various types of interactions with the different
stationary phases through inductive, donor–acceptor forces or Hbonding (H+ donor and acceptor). While 2-methyl-2-pentanol
and 1-iodobutane were found to increase the precision of prediction, such influence of 2-octyne, 1,4-dioxane and cis-hidrindane
could be negligible. McReynolds’ relative polarity scale was
characterized for more than 200 liquid phases.
Although the polarity is often used for predicting retention
data, several other factors may influence absorption [17]. A number of quantitative structure-retention relationship (QSRR) studies were performed on different series of compounds and good
correlations were found between IR (Kovats retention index)
and the theoretically calculated data for molecules with different functional groups (azo compounds [18], alkenes and azo
compounds [19], dialkyl hydrazones [20], alkenes [21], alkylbenzenes [22], phenol derivatives [23], primary, secondary and
tertiary amines [24], etc.). Generally, the elution data related to
one or only few columns were used. In the QSRR studies the
correlation between the Kovats retention indices and molecular
descriptors obtained by various methods (experimental, empirical results or theoretical methods) were studied in order to
obtain linear multivariate functions for the prediction of the
retention properties of the compounds (see e.g. [26]). There can
be found some criticism on using quantum chemical descriptors
[27], but their application is supported by their success [19–21,
24,25].
In this study, we investigate the correlation between the
McReynolds’ polarity scale [3] and the structural/physical properties of McReynolds’ test compounds used for characterizing the columns. We analyze, what structural descriptor(s)
(HOMO: energy of the highest occupied molecular orbital) {1},
LUMO: energy of the lowest unoccupied molecular orbital {2},
dipole moment (µ) {3}, isotropic average polarizability at 0 eV
electric field (α) {4}, volume of the molecule (V) {5}, logarithm of the octanol–water partition coefficient (log P) {6} and
total, polar and apolar solvent accessible surface area (SASA,
pSASA and apSASA, respectively)) {7,8,9} of McReynolds’
test molecules ([1] benzene, [2] 1-butanol, [3] 2-pentanone, [4]
1-nitropropane, [5] pyridine, [6] 2-methyl-2-pentanol, [7] 1iodobutane, [8] 2-octyne, [9] 1,4-dioxane, [10] cis-hidrindane)
have the greatest influence on the McReynolds’ numbers. The
calculations were performed by the PM3 semiempirical quantum chemical and chemometric methods (cluster analysis (CA),
principal component analysis (PCA), multiple linear regression (MLR), principal component regression (PCR) and partial least-square regression (PLSR)). A recently developed,
novel chemometric method: case/variable selection by principal component and partial least-square regression (CVS–PCR
and CVS–PLSR) — for building descriptive models was also
applied.
2. Calculations
The structural descriptors, HOMO, LUMO, µ and α were calculated for the 10 McReynolds’ test molecules with full geometry optimization by the PM3 semiempirical quantum chemical
method implemented in MOPAC93 [28]. The gradient norms
were always less then 0.01 kcal/mol/Å. The force matrix was
positive definite for the small molecules supported that we found
conformational minima. At some simple molecules 2–5 conformers were calculated and the thermodynamically most stable
structure was always accepted. The SASA, pSASA, apSASA
(radius of probe solvent molecule was set to 0.14 nm), V and
log P were calculated by VEGA [29].
R. Rajkó et al. / Analytica Chimica Acta 554 (2005) 163–171
165
Table 1
Molecular parameters calculated by PM3 semiempirical quantum chemical method and empirical expressions
[1] Benzene
[2] 1-Butanol
[3] 2-Pentanone
[4] 1-Nitropropane
[5] Pyridine
[6] 2-Methylpentanol-2
[7] 1-Iodbutane
[8] 2-Octyne
[9] 1,4-Dioxane
[10] cis-Hydrindane
HOMO/eV
LUMO/eV
µ/D
α/a.u.
V/A3
log P
SASA/A2
pSASA/A2
apSASA/A2
{1}
−9.751
−10.887
−10.680
−12.091
−10.104
−11.139
−9.449
−10.276
−10.448
−10.937
{2}
0.396
3.159
0.797
0.033
−0.005
3.116
−0.453
1.793
2.840
3.451
{3}
0
1.417
2.719
4.166
1.936
1.451
1.805
0.069
0
0.033
{4}
45.56
32.28
38.62
34.29
43.60
45.46
51.23
60.06
35.15
60.49
{5}
82.0
85.4
97.5
82.8
77.9
118.8
118.3
133.8
84.8
141.8
{6}
1.854
0.998
0.970
1.413
0.998
1.514
3.329
3.456
−0.138
3.220
{7}
243.9
263.2
276.9
259.1
238.3
304.2
312.7
363.4
245.3
321.7
{8}
0
58.0
43.7
96.4
24.8
38.8
86.8
0
41.8
0
{9}
243.9
205.3
233.2
162.7
213.5
265.4
225.9
363.4
203.5
321.7
Structural descriptors: HOMO: energy of the highest occupied molecular orbital, LUMO: energy of the lowest unoccupied molecular orbital, µ: dipole moment in
Debye, α: isotropic average polarizability in 0 eV electric field, V: molecular volume, log P: logarithm of the octanol–water partition coefficient, SASA: solvent
accessible surface area, pSASA: polar solvent accessible surface area, apSASA: apolar solvent accessible surface area.
McReynolds’ data were collected from the literature [1].
The statistical evaluation (MLR, CA and PCA) of the data
was performed by the PROSTAT [30] and STATISTICA [31]
packages.
PLSR [16,32–34] and PCR [16,32–34] algorithm implemented in PLS Toolbox V3.0 [35] for MatLab V6.1 R12 [36]
was used with a homemade MatLab code. Almost all possible
cases were calculated based on both the nine descriptors and
the 10 test molecules for McReynolds’ constants. The selection
criterion was Q2 , i.e., the correlation coefficients for the leaveone-out cross-validated data.
3. Results and discussion
Quantum chemical descriptors (independent variables) and
log P of the test compounds are summarized in Table 1. In Table 2
we summarized the experimentally obtained McReynolds’ numbers for 25 gas chromatographic columns with different polarities – phthalates (bis(2-butoxyethyl)phthalate (BBP), bis(2ethylhexyl)phthalate (BEP), bis(2-etoxyethyl)phthalate (BIP),
bis(2-ethoxyethoxyethyl)phthalate (BEEP), butyloctylphthalate
(BOF), dicyclohexyl phthalate (DIC), didecyl phthalate (DDP),
dinonylphthalate (DNP), bis(2-ethylhexyl)tetrachlorophthalate
Table 2
McReynolds constants of different stationary phases studied
Bis(2-butoxyethyl) adipate
Bis(2-ethylhexyl) adipate
Bis(2-butoxyethyl)phthalate
Bis(2-ethylhexyl)phthalate
Bis(2-etoxyethyl)phthalate
Bis(2-ethoxyethoxyethyl)phthalate
Butyloctylphthalate
Dicyclohexyl phthalate
Didecyl phthalate
Dinonylphthalate
Bis(2-ethylhexyl)tetrachlorophthalate
Bis(2-ethoxyethyl)sebacate
Bis(2-ethylhexyl)sebacate
Dinonyl sebacate
Octyldecyladipate
N,N,N ,N -Tetrakis-(2-hydroxyethyl)-ethylendiamin
Cresyldiphenyl phosphate
Tributoxyethyl phosphate
Tris(2-ethyl-hexyl) phosphate
Tricresyl phosphate
Acetyltributyl citrate
Sorbitan monostearate
Sorbitan monooleate
Tetracyanoethylpentaerythritol
Diethylene glycol distearate
Abbreviation
X
Y
Z
U
S
H
I
K
L
M
(BBA)
(DAP)
(BBP)
(BEP)
(BIP)
(BEEP)
(BOF)
(DIC)
(DDP)
(DNP)
(DIOC2)
(BES)
(DOS)
(DNS)
(ODA)
(THEED)
(CDP)
(TBP)
(TEHP)
(TCP)
(AC)
(SOR)
(SORM)
(TCEPE)
(DGDS)
137
076
151
092
214
233
097
146
136
083
109
151
072
066
079
463
199
141
071
176
135
088
097
526
064
278
181
282
186
375
408
194
257
255
183
132
306
168
166
179
942
351
373
288
321
268
263
266
782
193
198
121
227
150
305
317
157
206
213
147
113
211
108
107
119
626
285
209
117
250
202
158
170
677
106
300
187
338
236
446
470
246
316
320
231
171
320
180
178
193
801
413
341
215
374
314
200
216
920
143
235
134
267
167
364
389
174
245
235
159
168
274
125
118
134
893
336
274
132
299
233
258
268
837
191
216
144
217
143
290
309
149
196
201
141
104
328
132
130
141
746
266
285
225
242
214
201
207
621
147
118
071
138
092
190
207
096
144
126
082
075
129
068
062
072
427
190
126
071
169
112
082
094
444
057
104
055
112
066
159
170
069
104
101
065
045
110
049
050
057
269
153
104
047
131
102
055
066
333
041
205
119
225
140
312
337
147
204
202
138
137
224
107
106
119
721
292
204
103
254
207
180
191
766
121
028
009
048
026
079
092
027
058
038
018
034
036
011
008
010
254
088
031
007
076
026
037
041
237
020
X: I(benzene), Y: I(1-butanol), Z: I(2-pentanone), U: I(1-nitropropane), S: I(pyridine), H: I(2-methylpentanol-2), I: I(1-iodbutane), K: I(2-octyne), L:
I(1,4-dioxane), M: I(cis-hydrindane). Data were found in Ref. [1].
166
R. Rajkó et al. / Analytica Chimica Acta 554 (2005) 163–171
(DIOC2)), adipates (bis(2-butoxyethyl) adipate (BBA), bis(2ethylhexyl) adipate (DAP)), sebacates (bis(2-ethoxyethyl)sebacate (BES), bis(2-ethylhexyl)sebacate (DOS), dinonyl
sebacate (DNS), octyldecyladipate (ODA)), phosphates (cresyldiphenyl phosphate (CDP), tributoxyethyl phosphate (TBP),
tris(2-ethyl-hexyl) phosphate (TEHP), tricresyl phosphate
(TCP)), citrates (acetyltributyl citrate (AC)), nitrils (tetracyanoethylpentaerythritol (TCEPE)), amines (N,N,N ,N -tetrakis(2-hydroxyethyl)-ethylendiamin (THEED)), stearates (sorbitan
monostearate (SOR), diethylene glycol distearate (DGDS)),
oleate (sorbitan monooleate (SORM) – published in the literature [1] and used in the calculations. Naturally, the full
names and the abbreviations of the stationary phases are also
given in Table 2. Cross-correlation data of chemical descriptors of the 10 test molecules (benzene, 1-butanol, 2-pentanone,
1-nitropropane, pyridine, 2-methyl-2-pentanol, 1-iodobutane, 2octyne, 1,4-dioxane and cis-hidrindane) show high correlation
in some cases: SASA and V (R = 0.938), apSASA and α; α and
V; α and log P (R > 0.85), which is important in multivariate
regression because of the multicollinarity. The pair correlations
between HOMO, LUMO, pSASA, apSASA and SASA were
found to be less than 0.4. Values less than 0.6 was obtained
between HOMO, LUMO, µ and α. The correlation coefficient
was 0.77 between apSASA and µ. The correlations between the
McReynolds’ numbers of different stationary phases were also
large (R > 0.9).
3.1. Results of cluster analysis
The variables were standardized before cluster analysis. The
mean value of the matrix column was subtracted from all
the elements of the column and data obtained were divided
by the column standard deviation. This procedure ensures
that the different measures, units will not deform the cluster
analysis.
CA using Ward’s method, which analyses of variances to
evaluate the distances between the clusters, was performed. It
minimizes the sum of squares of any two clusters. On clustering
Fig. 1. Result of the cluster analysis (Ward’s method) for the descriptors.
Fig. 2. Result of the cluster analysis (Ward’s method) for the dependent variables.
all the descriptors (Fig. 1), we obtain pSASA and µ, SASA
and V, apSASA and α, HOMO and LUMO as clusters with two
members. log P is separating from α and apSASA cluster. In the
analysis of stationary phases (Fig. 2), two main clusters were
obtained with 11 and 14 stationary phases. The first cluster was
separated into two clusters: DGDS, SOR, SORM, THEED, and
BES, TEHP, TBP, DNS, ODA, DES, DAP. The second cluster
was also separated into two smaller ones: TCET, DIOC2, and
BOF, DIC, TCP, CDP, BEEP, BIP BBP, DDP, DIN, BEP, AC,
BBA. Our conclusion is that to classify the columns by polarity
is difficult on the basis of cluster analysis of the McReynolds’
numbers. Although the similarity in polarity columns could be
determined (see, e.g., SOR and SORM, or AC and BBA), but in
some cases contradictions were found (see, e.g., THEED, which
has large McReynolds numbers, was found to be similar to SOR,
SORM, DGDS).
3.2. Principal component analysis (PCA)
Basically PCA decomposes the original matrix into the production of score (orthogonal) and loading (orthonormal) matrices. At least three variables are necessary to explain more than
90% of the total variance. The first factor explains 82.0% of total
variance, the first and the second ones explain 88.3% and the first
three factors explain 94.1%. We may expect that three orthogonal
variables describe the McReynolds’ constants with acceptable
error, confer with [9]. The loadings correspond to the correlation coefficient between the 34 original variables. The first factor
correlates with α, V, log P, apSASA and all the McReynolds’
constants of the studied stationary phases (loadings are significant (>0.700)). Fig. 3 shows the relationship between Factors
1 and 2. Correlation between the McReynolds’ numbers of different stationary phases is very high. Factor 2 did not correlate
significantly, Factor 3 correlated significantly only with LUMO.
Factor 1 versus Factor 2 versus Factor 3 dependence can be
found in Fig. 4. The pattern of clusters for the stationary phases
shows similar distribution as we found in the cluster analysis
(see, Figs. 1 and 2).
R. Rajkó et al. / Analytica Chimica Acta 554 (2005) 163–171
167
Table 3
Results of MLR calculations
Fig. 3. Factor loadings, Factor 1 vs. Factor 2 (unrotated), extraction by principal
components.
3.3. Case and variable selection by multiple linear
regression (MLR), principal component regression (PCR)
and partial least-square regression (PLSR) methods
Unfortunately, either CA or PCA could not give unambiguous and usable answer for the question: which variables are
important, and which are negligible in the model.
First MLR with backward elimination or forward selection
(stepwise mode) was performed for the McReynolds’ constants
of individual stationary phases, in order to find the necessary descriptors. The variable selection criterion was p < 0.10
(p means the significance level, how much the possibility is
that the effects occured by chance). The results are summarized in Table 3. We found that the best results were obtained
with the LUMO, µ, V, SASA descriptors using all of the test
molecules (Table 3). In some cases (CDP and TCEPE), we found
LUMO, µ, α as the necessary descriptors on the basis of stepwise regression criteria. In some cases SASA was not significant
in the model. With the individual evaluation of the equations we
loose the information as a whole on the McReynolds’ station-
BBA
DAP
BBP
BEP
BIP
BEEP
BOF
DIC
DDP
DNP
DIOC2
BES
DOS
DNS
ODA
THEED
CDP
TBP
TEHP
TCP
AC
SOR
SORM
TCEPE
DGDS
B
C
D
E
R2
F
176.39
84.72
237.23
138.34
345.42
546.31
146.02
330.71
196.99
114.56
258.33
351.09
83.77
64.36
85.25
1187.80
540.78
374.60
192.79
405.8
179.14
323.83
334.52
1526.07
235.77
28.14
21.29
24.04
16.84
29.99
28.60
17.49
16.23
22.53
18.86
n.a.
35.00
19.05
20.43
20.32
86.51
n.a.
39.97
40.73
22.44
27.91
29.87
28.15
n.a.
21.12
36.19
26.68
39.72
31.09
48.43
51.31
32.18
37.08
39.55
31.67
9.73
46.96
25.54
26.05
26.50
91.93
24.60
53.62
45.99
43.48
38.75
30.41
30.99
n.a.
21.88
−5.12
−3.33
−4.60
−3.03
−5.85
−3.58
−3.17
−2.13
−4.31
−3.31
−1.59
−2.53
−2.97
−3.13
−3.34
−8.11
−7.09a
−2.92
−1.85
−2.64
−5.04
−2.52
−2.52
−20.41a
−1.85
1.55
1.05
1.21
0.83
1.47
n.a.
0.86
n.a.
1.20
0.98
n.a.
n.a.
0.91
1.02
1.06
n.a.
n.a.
n.a.
n.a.
n.a.
1.50
n.a.
n.a.
n.a.
n.a.
0.950
0.936
0.965
0.972
0.962
0.930
0.972
0.941
0.970
0.971
0.915
0.765
0.934
0.943
0.941
0.853
0.896
0.835
0.738
0.936
0.956
0.787
0.796
0.841
0.749
23.98
18.32
34.21
42.90
31.53
26.55
43.86
31.65
40.41
41.94
37.85
6.51
17.68
20.59
19.96
11.62
30.14
10.14
5.63
29.19
26.85
7.38
7.80
42.21
5.98
A: intercept, B: LUMO, C: µ, D: V, E: SASA. R2 : square of correlation coefficient, F: Fischer number.
a Descriptor: polarizability.
ary phase polarity system. The descriptors, the properties of
test molecules, obtained in the statistical evaluation support the
parameters that are important in the absorption: LUMO, the measure of electron affinity, µ, dipole moment, the polarity of the test
molecule, V, SASA the volume and solvent (water) accessible
surface area are the measure of the molecule. α, the polarizability is the measure of the flexibility in the electron system of the
molecule.
Because MLR with stepwise regression can operate on only
one dependent variable at a time, an iterative method was developed to find both the dependent (molecules) and the independent (descriptors) variables, necessary for explaining all the
McReynolds’ numbers. Thus, the used regression model is:
Y
10×25
Fig. 4. Factor loadings, Factor 1 vs. Factor 2 vs. Factor 3 (unrotated), extraction
by principal components.
A
= X
B
(3)
10×9 9×25
Fig. 5 shows the screen-plots for X- (Panel a) and Y-blocks (Panel
b). For X- and Y-blocks, 3 and 1 latent variables (LV) can be chosen, respectively, because in the case of X-block the 4th latent
variable has the same small variance component as the remaining, and in the case of Y-block the 2nd latent variable has that
small variance component. The latent variable of PLS is similar to factor or principal component of PCA, but in the case of
PLS both X and Y are included, thus one common number of the
latent variables has to be selected. Three LVs were chosen to the
further investigations, because they can explain 91.66% of the
covariance between X and Y.
R. Rajkó et al. / Analytica Chimica Acta 554 (2005) 163–171
168
less than six test molecules caused run-time errors for PCR and
PLS functions of PLS Toolbox. All models were validated by
leave-one-out cross-validation by using Q2 . The Q2 was calculated as a correlation coefficient between the original Y and the
cross-validated prediction of Y (YCV ) with using 1, 2, . . . and all
latent variables:
2
(Yi − Y )(YCV,i − Y CV )
2
Q =
(4)
2
2
(Yi − Y )
(YCV,i − Y CV )
Fig. 5. Screen-plots for the X- (Panel a) and Y-blocks (Panel b) to determine the
number of latent variables.
PCR and PLS were run at all possible descriptor variations
and almost all test molecules for McReynolds’ constants with
the selection criterion Q2 , i.e., the correlation coefficients for the
leave-one-out cross-validated data. Because the number of test
molecules (cases) and the number of descriptors (variables) are
small (10 and 9, respectively) in our case we could proceed with
the total case and variable selection procedure with the crossvalidation in reasonable time. We could only started the process
with the number of test molecules equals to six, because using
We found that the best results were obtained at the HOMO,
µ, α, apSASA descriptors using test molecules (benzene, 2pentanone, 1-nitropropane, 1-iodobutane, 1,4-dioxane and cishidrindane) (Tables 4 and 5). Cross-validated correlation coefficients (Q2 ) were 0.9832 and 0.9834 for PCR and PLS, respectively.
Similar results were obtained with neglecting log P descriptors using the same molecules (Tables 4 and 5). The best results
are fairly same for PCR and PLS, but they cannot be significantly distinguished from the second, third, etc. best results. The
molecules and the descriptors according to the results of PCR
and PLS calculations based on the first 50 best Q2 are shown in
Table 6.
The similarity of the PCR- and PLS-based results is rather
satisfying, since the simplest and the most complicated procedures provided with them. Our conclusions can be considered
relatively established according to the data used. We then found
that the best results were obtained at the HOMO, µ, α, apSASA
descriptors using six test molecules (benzene, 2-pentanone, 1nitropropane, 1-iodobutane, 1,4-dioxane and cis-hidrindane).
The previous calculations were based on the condition that
the influence in the variation of degrees of freedom (according
to the reduced data) is negligible. However, we can calculate the
adjusted-Q2 (Q2a ) (similar to the adjusted-R2 [37]):
m−1
Q2a = 1 − (1 − Q2 )
(5)
m−q
where m means the number of test molecules and d means the
number of descriptors used in the case/variable selection procedure. It is interesting that while Q2 cannot, Q2a can be negative
(it means that X cannot explain Y):
m−1
d−1
< 0 ⇒ Q2 <
(6)
1 − (1 − Q2 )
m−d
m−1
Table 4
Results of PCR calculations (first five best Q2 )
A
B
No. of test mols.
No. of descriptors
Q2
[1 3 4 7 9 10]
[1 3 4 7 9 10]
[1 3 4 7 9 10]
[1 3 4 7 9 10]
[1 3 4 7 9 10]
{1 3 4 9}
{1 2 4 6 9}
{1 3 4 5 6 7 8 9}
{1 2 4 5 6 8}
{1 2 4 5 6 8 9}
0.9832
0.9827
0.9818
0.9814
0.9804
Latent variables
No. of test mols.
No. of descriptors
Q2
Latent variables
3
3
4
4
4
[1 3 4 7 9 10]
[1 3 4 7 9 10]
[2 3 5 6 7 8]
[1 2 3 4 6 8]
[1 2 3 4 6 8]
{1 3 4 9}
{1 2 4 9}
{2 7}
{1 2 4 5 8 9}
{1 2 4 5 7 8}
0.9832
0.9775
0.9736
0.9733
0.9726
3
3
2
3
4
Independent variables: descriptors of McReynolds test molecules, dependent variables: McReynolds numbers of GC columns studied. A: with all nine descriptor, B:
without log P, eight descriptors. Resolution of the numbers of test molecules and descriptors in square brackets and braces, respectively, is in Table 1.
R. Rajkó et al. / Analytica Chimica Acta 554 (2005) 163–171
169
Table 5
Results of PLS calculations (first five best Q2 )
A
B
No. of test mols.
No. of descriptors
Q2
[1 3 4 7 9 10]
[1 3 4 7 9 10]
[1 3 4 7 9 10]
[1 3 4 7 9 10]
[1 3 4 7 9 10]
{1 3 4 9}
{1 3 4 5 6 7 8 9}
{1 2 4 5 6 8}
{1 2 4 5 6 8 9}
{1 3 4 6 7 8 9}
0.9834
0.9818
0.9814
0.9804
0.9798
Latent variables
No. of test mols.
No. of descriptors
Q2
Latent variables
3
4
4
4
2
[1 3 4 7 9 10]
[1 2 3 4 6 8]
[2 3 4 5 7 10]
[3 4 5 7 8 9]
[1 2 3 4 6 8]
{1 3 4 9}
{2 8 9}
{2 3 5 7 8}
{1 3 5}
{1 2 4 5 8 9}
0.9834
0.9750
0.9741
0.9739
0.9736
3
3
4
2
3
Independent variables: descriptors of McReynolds test molecules, dependent variables: McReynolds numbers of GC columns studied. A: with all nine descriptor, B:
without log P, eight descriptors. Resolution of the numbers of test molecules and descriptors in square brackets and braces, respectively, is in Table 1.
Table 6
Frequencies of the molecules and descriptors according to the results of PCR and PLS calculations based on the first 50 best Q2
PLS incl. log P
No. of molecules
Frequency
No. of descriptors
Frequency
4
49
1
43
3
48
8
37
1
46
9
34
7
36
5
32
9
36
4
31
10
33
6
31
8
17
3
29
2
16
2
29
6
15
7
25
5
4
PLS excl. log P
No. of molecules
Frequency
No. of descriptors
Frequency
3
44
1
39
4
43
2
36
1
42
4
33
8
36
8
31
2
36
9
31
6
33
5
31
7
25
3
30
9
20
7
23
10
14
6
0
5
8
PCR incl. log P
No. of molecules
Frequency
No. of descriptors
Frequency
3
49
1
41
4
48
9
35
1
47
2
34
7
37
8
33
9
36
4
31
10
29
6
31
8
22
5
27
2
15
7
22
6
15
3
20
5
2
PCR excl. log P
No. of molecules
Frequency
No. of descriptors
Frequency
3
47
2
41
4
46
1
38
1
44
9
36
8
39
4
32
2
36
8
26
6
35
3
25
7
19
5
24
9
17
7
19
10
12
6
0
5
6
Table 7
Results of PCR and PLS calculations including log P (first five best adjusted-Q2 (Q2a ))
PCR
PLS
No. of test mols.
No. of descriptors
Q2a
[2 3 5 6 7 8]
[1 3 4 7 9 10]
[1 2 5 6 8 9]
[1 3 4 7 8 9]
[1 3 4 7 9 10]
{2 7}
{1 3 4 9}
{2 8}
{1 6 9}
{1 4 9}
0.9671
0.9581
0.9539
0.9535
0.9533
Latent variables
No. of test mols.
No. of descriptors
Q2a
Latent variables
2
3
2
3
2
[1 3 4 7 9 10]
[1 2 3 4 6 8]
[3 4 5 7 8 9]
[2 3 5 6 7 8]
[1 2 5 6 8 9]
{1 3 4 9}
{2 8 9}
{1 3 5}
{2 7}
{2 8}
0.9585
0.9583
0.9566
0.9545
0.9539
3
3
2
2
2
Independent variables: descriptors of McReynolds test molecules, dependent variables: McReynolds numbers of GC columns studied. Resolution of the numbers of
test molecules and descriptors in square brackets and braces, respectively, is in Table 1.
Table 8
Frequencies of the molecules and descriptors according to the results of PCR and PLS calculations based on the first 50 best adjusted-Q2 (Q2a )
PLS
No. of molecules
Frequency
No. of descriptors
Frequency
1
42
3
25
2
35
8
21
9
34
2
19
6
31
4
17
10
29
6
14
4
29
1
10
7
28
9
10
8
27
7
8
3
27
5
7
5
26
PCR
No. of molecules
Frequency
No. of descriptors
Frequency
3
40
1
18
9
36
4
18
4
35
5
18
7
35
2
16
8
33
8
13
1
30
9
13
10
30
3
12
5
26
6
10
2
25
7
9
6
17
170
R. Rajkó et al. / Analytica Chimica Acta 554 (2005) 163–171
Tables 7 and 8 show the results obtained with using adjustedQ2 . We can conclude as before: the PCR- and PLS-based
results are rather similar and undistinguishable. Using principle
of Occam’s razor one can choose the best result of the simplest method, i.e., PCR (this result is the fourth best for PLS):
descriptors are LUMO and SASA, test molecules are 1-butanol,
2-pentanone, pyridine, 2-methylpentanol-2,1-iodobutane and 2octyne. LUMO and SASA of the test molecules must be important in absorption — they characterize the strength of test
molecule binding to solutes with different polarities.
4. Conclusion
Unfortunately, neither CA nor PCA could give unambiguous
and usable answer for the question: which variables are important, and which are negligible in the model. Because MLR with
stepwise regression can operate on only one dependent variable
at a time, PCR and PLS had to be used for building the regression
model.
On the basis of detailed statistical analysis and accepting
only the best results based on Q2 (Tables 4 and 5) benzene, 2pentanone, 1-nitropropane, 1-iodobutane, 1,4-dioxane and cishidrindane — McReynolds’ test molecules are adequate to characterize the polarity of the GC column. Four descriptors characterize the expression where the independent variables are these
descriptors and the dependent variables are the McReynolds’
numbers. According to the PCR and PLS results (these methods can handle the cases when there are much more dependent
variables than independent ones) of the first 50 best regression
models it can be concluded that six McReynolds test molecules
are really enough (Tables 6 and 8). The number and kind of
the descriptors depend on the regression methods and whether
log P is included or excluded. Seeking the simplest regression
algorithm and model, the four descriptors are HOMO, µ, α and
apSASA according to the result of PCR excluding log P.
On the other hand, considering the results given by using
adjusted-Q2 a little bit different conclusions can be drawn.
It remained the same, that six McReynolds’ test molecules
are really enough. However, these molecules in this case
are 1-butanol, 2-pentanone, pyridine, 2-methylpentanol-2, 1iodobutane and 2-octyne (note that 2-pentanone and 1iodobutane are common). Two descriptors, which characterize
the measure of absorption in solute, were found to be enough for
building the regression model, namely LUMO and SASA (note
that LUMO is common).
However, we can consider together the results based on
Q2 and adjusted-Q2 . Tables 4, 5 and 7 show that the descriptive model which was formed with the six McReynolds’
test molecules (benzene, 2-pentanone, 1-nitropropane, 1iodobutane, 1,4-dioxane and cis-hidrindane) and the four
descriptors (HOMO, µ, α and apSASA) placed first for five
cases from six, and it placed second when it did not place first
(Table 7).
The conclusions suggest that the six McReynolds’ test
molecules mentioned can provide the same information of polarity as the original 10 McReynolds’ test molecules can according
to the model built with four descriptors.
Regarding the hopeful results of building descriptive model,
we are working on building a predictive model using the novel
case/variable selection method using PLS and PCR combined
with Q2 and adjusted-Q2 introduced in this paper.
Acknowledgement
Károly Héberger and István Pálinkó are greatly appreciated
for helping to make more valuable the manuscript version of this
paper. The authors would like to acknowledge helpful critical
comments to the anonymous referees. This work was supported
by the Hungarian Scientific Research Fund (OTKA/T032966
and OTKA/T046484) and by I. Széchenyi Research Fellowships
(R.R. and T.K.).
References
[1] H. Rotzsche, Flüssige und chemisch gebundene stationare Phasen, in:
E. Leibnitz, H.G. Struppe (Eds.), Handbuch der Gaschromatographie,
Akademische Verlagsgesellshaft, Geest and Portig K.-G. Lepzig, Germany, 1984, pp. 442–506.
[2] T. Körtvélyesi, M. Görgényi, K. Héberger, Anal. Chim. Acta 428 (2001)
73–82.
[3] W.O. McReynolds, J. Chromatogr. Sci. 8 (1970) 685–691.
[4] G. Tarján, Á. Kiss, G. Kocsis, S. Mészáros, J.M. Takács, J. Chromatogr.
119 (1976) 327–332.
[5] E. Fernandez-Sanchez, A. Fernandez-Torres, J.A. Garcia-Dominguez,
J.M. Santiuste, Chromatographia 31 (1991) 75–79.
[6] L.R. Snyder, J. Chromatogr. 92 (1974) 223–230.
[7] G. Castello, G. D’Amato, S. Vezzani, J. Chromatogr. 646 (1993)
361–368.
[8] R.V. Golovnya, B.M. Polanuer, J. Chromatogr. 517 (1990) 51–66.
[9] K. Héberger, Chemom. Intell. Lab. Syst. 47 (1990) 41–49.
[10] K. Héberger, M. Görgényi, J. Chromatogr. A 845 (1999) 21–31.
[11] K. Héberger, M. Görgényi, M. Sjöström, Chromatographia 51 (2000)
595–600.
[12] R.V. Golovnya, T. Misharina, Chromatographia 10 (1977) 658–
660.
[13] R.V. Golovnya, T.A. Misharina, Chromatographia 190 (1980) 1–12.
[14] L. Rohrschneider, J. Chromatogr. 17 (1965) 1–12.
[15] L. Rohrschneider, J. Chromatogr. 22 (1966) 6–22.
[16] E.R. Malinowski, Factor Analysis in Chemistry, 3rd ed., Wiley, New
York, USA, 2002.
[17] H. Rotsche, Stationary Phases in Gas Chromatography, J. Chromatography Library, vol. 48, Elsevier, Amsterdam, 1991.
[18] M. Görgényi, Z. Fekete, L. Seres, Chromatographia 27 (1989) 581–
584.
[19] T. Körtvélyesi, M. Görgényi, L. Seres, Chromatographia 41 (1995)
282–286.
[20] Z. Király, T. Körtvélyesi, L. Seres, M. Görgényi, Chromatographia 42
(1996) 653–659.
[21] A. Garcia-Raso, F. Saura-Calixto, M. Raso, J. Chromatogr. 302 (1984)
107–117.
[22] N. Dimov, A. Osman, O.V. Mekanyan, D. Papazova, Anal. Chim. Acta
298 (1994) 303–317.
[23] R. Kaliszan, H.-D. Höltje, J. Chromatogr. 234 (1982) 303–311.
[24] K. Osmialowski, J. Halkiewicz, A. Radecki, R. Kaliszan, J. Chromatogr.
346 (1985) 53–60.
[25] A.R. Katritzky, E.S. Ignatchenko, R.A. Barcock, V.S. Lobanov, M.
Karelson, Anal. Chem. 66 (1994) 1799–1807.
[26] R.P.W. Scott, J. Chromatogr. 122 (1976) 35–53.
[27] V.S. Ong, R.A. Hites, Anal. Chem. 63 (1991) 2829–2834.
[28] J.J.P. Stewart, MOPAC93, Fujitsu Ltd., Tokyo, 1994.
[29] Pedretti A., Vistoli G., VEGA, Version 1.5., 2003.
R. Rajkó et al. / Analytica Chimica Acta 554 (2005) 163–171
[30] PROSTAT Ver. 3.0, PolySoftware, P.O. Box 60, Pearl River, NY 10965,
USA.
[31] STATISTICA 99, Statsoft 2300 East 14th St. Tulsa, Oklahoma 74104,
USA.
[32] P. Geladi, B.R. Kowalski, Anal. Chim. Acta 185 (1986) 1–17.
[33] H. Martens, T. Neas, Multivariate Calibration, Wiley, Chichester, UK,
1991.
171
[34] R.G. Brereton, Chemometrics: Data Analysis for the Laboratory and
Chemical Plant, Wiley, Chichester, UK, 2003.
[35] Eigenvector Research Inc., PLS-Toolbox® Version 3.0.3a., 2003.
[36] The Mathworks Inc., MATLAB®, Version 6.1. (R12.1) User’s Guide,
2000.
[37] N.R. Draper, H. Smith, Applied Regression Analysis, 2nd ed., Wiley,
New York, USA, 1981.