Analytica Chimica Acta 554 (2005) 163–171 Theoretical characterization of McReynolds’ constants Róbert Rajkó a,∗ , Tamás Körtvélyesi b,∗ , Krisztina Sebők-Nagy c , Miklós Görgényi b a Department of Unit Operations and Environmental Engineering, College Faculty of Food Engineering, University of Szeged, H-6701 Szeged, P.O. Box 433, Hungary b Department of Physical Chemistry, University of Szeged, H-6701 Szeged, P.O. Box 105, Hungary c Chemical Research Center, Hungarian Academy of Sciences, H-1525 Budapest, P.O. Box 17, Hungary Received 20 April 2005; received in revised form 5 August 2005; accepted 12 August 2005 Available online 21 September 2005 Abstract The properties of McReynolds’ constants were studied by a detailed statistical/chemometric analysis. The electronic structure, geometries and hydrophobicity of the McReynolds’ test compounds (benzene, 1-butanol, 2-pentanone, 1-nitropropane, pyridine, 2-methyl-2-pentanol, 1iodobutane, 2-octyne, 1,4-dioxane and cis-hidrindane) were calculated at the level of PM3 semiempirical quantum chemical method and empirical formulas. The predominant pattern was revealed using cluster and principal component analyses (CA and PCA). Dependence of McReynolds’ constants on the calculated chemical descriptors was modeled by multiple linear regression (MLR) with stepwise selections, principal component regression (PCR) and partial least-square regression (PLSR). A novel statistical approach was developed for case-and-variable selection using the PCR and PLSR methods for characterizing and modeling the polarity of 25 gas chromatography (GC) stationary phases (phthalates, adipates, sebacates, phosphates, citrates and nitrils). Highest occupied molecular orbital energy, dipole moment, averaged isotropic polarizability and the apolar solvent accessible surface area; and energy of the lowest unoccupied molecular orbital and total solvent accessible surface area were suitable to describe the McReynolds’ constants based on the results obtained using Q2 and adjusted-Q2 . Six of the 10 test compounds were found to be sufficient for the description of the polarity of the columns studied. © 2005 Elsevier B.V. All rights reserved. Keywords: Polarity; McReynolds’ constants; Stationary phases; Gas chromatography; Quantum chemical method; Principal component analysis; Principal component regression; Partial least-square regression; Case and variable selection 1. Introduction It is a continuously arising question for chromatographers to find an easy-to-use method to characterize the stationary phase and solute interaction to forecast gas chromatographic retention behavior. Which stationary phase (column type) is suitable to separate possibly all or as many solutes in a complex mixture as possible? In this respect we have to know the polarity and selectivity of a column. The selectivity is the ability of the stationary phase to participate in specific intermolecular interactions. Depending on the extent of the interactions, some solutes may be dissolved better or to a smaller extent in a given stationary phase finally resulting in some separations [1]. ∗ Corresponding authors. E-mail addresses: rajko@sol.cc.u-szeged.hu (R. Rajkó), kortve@chem.u-szeged.hu (T. Körtvélyesi). 0003-2670/$ – see front matter © 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.aca.2005.08.024 The polarity concept was intended to use for the characterization of the interaction of the stationary phase and the solute on the basis of its structure. Basically, polarity means that the more polar is a stationary phase, the greater is the retention of a polar solute compared to a non-polar solute as e.g. an n-alkane, see e.g. in Ref. [2]. On this basis, the polarity is the sum of various intermolecular interactions (inductive, dispersive, orientation and H-bonding). In gas chromatography, the interactions do not depend only on the stationary phase, but also on the solute and its functional group. The polarity is a term difficult to define: e.g. dipole moment is often used as a symbol of polarity but in chromatographic interactions it cannot be used as a single measure. Some empirical measures for the polarity and/or selectivity parameters of the stationary phases are available: McReynolds’ polarity (P) [3], Kovats coefficient (KC ) [4], retention polarity (RP ) [5], Snyder’s selectivity parameters [6], Castello’s C [7] and GCH2 [8]. The polarity/selectivity properties of thirty stationary phases were characterized by Heberger [9] by principal component analysis (PCA). Two groups of polarity scales were 164 R. Rajkó et al. / Analytica Chimica Acta 554 (2005) 163–171 found. The first group (P, KC , RP and C) and the second group (Snyder’s selectivity parameters, Castello’s C) of the polarity scales can characterize the column mainly by their polarity and selectivity, respectively. The most influential properties are: (i) polarity, (ii) hydrogen donating and accepting ability and (iii) dipole interactions. The principal components of retention data for oxo compounds were correlated with the physical properties (molar refractivity (RM ), boiling point (TBP ), molar volume (Vm )) [10]. A predictive model was suggested by partial leastsquare regression (PLSR) method [11]. According to the thermodynamic concept the reluctance of the liquid phase to accept a hydrocarbon may be considered as a measure of polarity. The measure of this behavior is the partial molar Gibbs free energy of solution for a methylene group [8,12,13]. According to the most well-known and widely used Rohrschneider–McReynolds concept, the Kovats retention index difference of some specific test compound p on the column studied (Ip ) and squalane (Isq ) provides a measure of polarity [3,14,15] (Eq. (1)). By definition the polarity of squalane is 0, because it was considered as an apolar (reference) phase: Ix = Ip − Isq (1) In the Rohrschneider concept the intermolecular forces are additive which are characterized by several factors both characteristic for the solute (a, b, c, d, e) and the stationary phase (x, y, z, u, s): Ii,j (calc.) = ai xj + bi yj + ci zj + di uj + ei sj (2) Ii,j (calc.) is the difference in Kovats indices between the phase of interest and squalane. xj , yj , zj , uj and sj are calculated for each phase from the difference in Kovats indices of benzene, ethanol, methyl ethyl ketone, nitromethane and pyridine, respectively. ai , bi , ci , di , and ei are empirical coefficients, which can be calculated from retention data for each solute using various liquid phases. In the simplest case ai , bi , ci , di , and ei equal to 1s (or only one equals to 1, and the other is 0), however, if we know Ii,j (calc.) in advance the profiles (a, b, c, . . ., and x, y, z, . . .) can be estimated by factor analysis (FA) [16]. Rohrschneider originally used five compounds, but later McReynolds analyzed 68 compounds on 25 columns and selected the 10 compounds characterizing the columns the best [3]: benzene, 1-butanol, 2-pentanone, nitropropane, pyridine, 2-methyl-2-pentanol, 1-iodobutane, 2-octyne, 1,4-dioxane and cis-hidrindane. The most informative of these, benzene, 1-butanol, 2-pentanone, nitropropane and pyridine, are either the same compounds Rohrschneider used or homologs of Rohrschneider’s compounds. The criterion of selecting the test compounds was the ability to participate in various types of interactions with the different stationary phases through inductive, donor–acceptor forces or Hbonding (H+ donor and acceptor). While 2-methyl-2-pentanol and 1-iodobutane were found to increase the precision of prediction, such influence of 2-octyne, 1,4-dioxane and cis-hidrindane could be negligible. McReynolds’ relative polarity scale was characterized for more than 200 liquid phases. Although the polarity is often used for predicting retention data, several other factors may influence absorption [17]. A number of quantitative structure-retention relationship (QSRR) studies were performed on different series of compounds and good correlations were found between IR (Kovats retention index) and the theoretically calculated data for molecules with different functional groups (azo compounds [18], alkenes and azo compounds [19], dialkyl hydrazones [20], alkenes [21], alkylbenzenes [22], phenol derivatives [23], primary, secondary and tertiary amines [24], etc.). Generally, the elution data related to one or only few columns were used. In the QSRR studies the correlation between the Kovats retention indices and molecular descriptors obtained by various methods (experimental, empirical results or theoretical methods) were studied in order to obtain linear multivariate functions for the prediction of the retention properties of the compounds (see e.g. [26]). There can be found some criticism on using quantum chemical descriptors [27], but their application is supported by their success [19–21, 24,25]. In this study, we investigate the correlation between the McReynolds’ polarity scale [3] and the structural/physical properties of McReynolds’ test compounds used for characterizing the columns. We analyze, what structural descriptor(s) (HOMO: energy of the highest occupied molecular orbital) {1}, LUMO: energy of the lowest unoccupied molecular orbital {2}, dipole moment (µ) {3}, isotropic average polarizability at 0 eV electric field (α) {4}, volume of the molecule (V) {5}, logarithm of the octanol–water partition coefficient (log P) {6} and total, polar and apolar solvent accessible surface area (SASA, pSASA and apSASA, respectively)) {7,8,9} of McReynolds’ test molecules ([1] benzene, [2] 1-butanol, [3] 2-pentanone, [4] 1-nitropropane, [5] pyridine, [6] 2-methyl-2-pentanol, [7] 1iodobutane, [8] 2-octyne, [9] 1,4-dioxane, [10] cis-hidrindane) have the greatest influence on the McReynolds’ numbers. The calculations were performed by the PM3 semiempirical quantum chemical and chemometric methods (cluster analysis (CA), principal component analysis (PCA), multiple linear regression (MLR), principal component regression (PCR) and partial least-square regression (PLSR)). A recently developed, novel chemometric method: case/variable selection by principal component and partial least-square regression (CVS–PCR and CVS–PLSR) — for building descriptive models was also applied. 2. Calculations The structural descriptors, HOMO, LUMO, µ and α were calculated for the 10 McReynolds’ test molecules with full geometry optimization by the PM3 semiempirical quantum chemical method implemented in MOPAC93 [28]. The gradient norms were always less then 0.01 kcal/mol/Å. The force matrix was positive definite for the small molecules supported that we found conformational minima. At some simple molecules 2–5 conformers were calculated and the thermodynamically most stable structure was always accepted. The SASA, pSASA, apSASA (radius of probe solvent molecule was set to 0.14 nm), V and log P were calculated by VEGA [29]. R. Rajkó et al. / Analytica Chimica Acta 554 (2005) 163–171 165 Table 1 Molecular parameters calculated by PM3 semiempirical quantum chemical method and empirical expressions [1] Benzene [2] 1-Butanol [3] 2-Pentanone [4] 1-Nitropropane [5] Pyridine [6] 2-Methylpentanol-2 [7] 1-Iodbutane [8] 2-Octyne [9] 1,4-Dioxane [10] cis-Hydrindane HOMO/eV LUMO/eV µ/D α/a.u. V/A3 log P SASA/A2 pSASA/A2 apSASA/A2 {1} −9.751 −10.887 −10.680 −12.091 −10.104 −11.139 −9.449 −10.276 −10.448 −10.937 {2} 0.396 3.159 0.797 0.033 −0.005 3.116 −0.453 1.793 2.840 3.451 {3} 0 1.417 2.719 4.166 1.936 1.451 1.805 0.069 0 0.033 {4} 45.56 32.28 38.62 34.29 43.60 45.46 51.23 60.06 35.15 60.49 {5} 82.0 85.4 97.5 82.8 77.9 118.8 118.3 133.8 84.8 141.8 {6} 1.854 0.998 0.970 1.413 0.998 1.514 3.329 3.456 −0.138 3.220 {7} 243.9 263.2 276.9 259.1 238.3 304.2 312.7 363.4 245.3 321.7 {8} 0 58.0 43.7 96.4 24.8 38.8 86.8 0 41.8 0 {9} 243.9 205.3 233.2 162.7 213.5 265.4 225.9 363.4 203.5 321.7 Structural descriptors: HOMO: energy of the highest occupied molecular orbital, LUMO: energy of the lowest unoccupied molecular orbital, µ: dipole moment in Debye, α: isotropic average polarizability in 0 eV electric field, V: molecular volume, log P: logarithm of the octanol–water partition coefficient, SASA: solvent accessible surface area, pSASA: polar solvent accessible surface area, apSASA: apolar solvent accessible surface area. McReynolds’ data were collected from the literature [1]. The statistical evaluation (MLR, CA and PCA) of the data was performed by the PROSTAT [30] and STATISTICA [31] packages. PLSR [16,32–34] and PCR [16,32–34] algorithm implemented in PLS Toolbox V3.0 [35] for MatLab V6.1 R12 [36] was used with a homemade MatLab code. Almost all possible cases were calculated based on both the nine descriptors and the 10 test molecules for McReynolds’ constants. The selection criterion was Q2 , i.e., the correlation coefficients for the leaveone-out cross-validated data. 3. Results and discussion Quantum chemical descriptors (independent variables) and log P of the test compounds are summarized in Table 1. In Table 2 we summarized the experimentally obtained McReynolds’ numbers for 25 gas chromatographic columns with different polarities – phthalates (bis(2-butoxyethyl)phthalate (BBP), bis(2ethylhexyl)phthalate (BEP), bis(2-etoxyethyl)phthalate (BIP), bis(2-ethoxyethoxyethyl)phthalate (BEEP), butyloctylphthalate (BOF), dicyclohexyl phthalate (DIC), didecyl phthalate (DDP), dinonylphthalate (DNP), bis(2-ethylhexyl)tetrachlorophthalate Table 2 McReynolds constants of different stationary phases studied Bis(2-butoxyethyl) adipate Bis(2-ethylhexyl) adipate Bis(2-butoxyethyl)phthalate Bis(2-ethylhexyl)phthalate Bis(2-etoxyethyl)phthalate Bis(2-ethoxyethoxyethyl)phthalate Butyloctylphthalate Dicyclohexyl phthalate Didecyl phthalate Dinonylphthalate Bis(2-ethylhexyl)tetrachlorophthalate Bis(2-ethoxyethyl)sebacate Bis(2-ethylhexyl)sebacate Dinonyl sebacate Octyldecyladipate N,N,N ,N -Tetrakis-(2-hydroxyethyl)-ethylendiamin Cresyldiphenyl phosphate Tributoxyethyl phosphate Tris(2-ethyl-hexyl) phosphate Tricresyl phosphate Acetyltributyl citrate Sorbitan monostearate Sorbitan monooleate Tetracyanoethylpentaerythritol Diethylene glycol distearate Abbreviation X Y Z U S H I K L M (BBA) (DAP) (BBP) (BEP) (BIP) (BEEP) (BOF) (DIC) (DDP) (DNP) (DIOC2) (BES) (DOS) (DNS) (ODA) (THEED) (CDP) (TBP) (TEHP) (TCP) (AC) (SOR) (SORM) (TCEPE) (DGDS) 137 076 151 092 214 233 097 146 136 083 109 151 072 066 079 463 199 141 071 176 135 088 097 526 064 278 181 282 186 375 408 194 257 255 183 132 306 168 166 179 942 351 373 288 321 268 263 266 782 193 198 121 227 150 305 317 157 206 213 147 113 211 108 107 119 626 285 209 117 250 202 158 170 677 106 300 187 338 236 446 470 246 316 320 231 171 320 180 178 193 801 413 341 215 374 314 200 216 920 143 235 134 267 167 364 389 174 245 235 159 168 274 125 118 134 893 336 274 132 299 233 258 268 837 191 216 144 217 143 290 309 149 196 201 141 104 328 132 130 141 746 266 285 225 242 214 201 207 621 147 118 071 138 092 190 207 096 144 126 082 075 129 068 062 072 427 190 126 071 169 112 082 094 444 057 104 055 112 066 159 170 069 104 101 065 045 110 049 050 057 269 153 104 047 131 102 055 066 333 041 205 119 225 140 312 337 147 204 202 138 137 224 107 106 119 721 292 204 103 254 207 180 191 766 121 028 009 048 026 079 092 027 058 038 018 034 036 011 008 010 254 088 031 007 076 026 037 041 237 020 X: I(benzene), Y: I(1-butanol), Z: I(2-pentanone), U: I(1-nitropropane), S: I(pyridine), H: I(2-methylpentanol-2), I: I(1-iodbutane), K: I(2-octyne), L: I(1,4-dioxane), M: I(cis-hydrindane). Data were found in Ref. [1]. 166 R. Rajkó et al. / Analytica Chimica Acta 554 (2005) 163–171 (DIOC2)), adipates (bis(2-butoxyethyl) adipate (BBA), bis(2ethylhexyl) adipate (DAP)), sebacates (bis(2-ethoxyethyl)sebacate (BES), bis(2-ethylhexyl)sebacate (DOS), dinonyl sebacate (DNS), octyldecyladipate (ODA)), phosphates (cresyldiphenyl phosphate (CDP), tributoxyethyl phosphate (TBP), tris(2-ethyl-hexyl) phosphate (TEHP), tricresyl phosphate (TCP)), citrates (acetyltributyl citrate (AC)), nitrils (tetracyanoethylpentaerythritol (TCEPE)), amines (N,N,N ,N -tetrakis(2-hydroxyethyl)-ethylendiamin (THEED)), stearates (sorbitan monostearate (SOR), diethylene glycol distearate (DGDS)), oleate (sorbitan monooleate (SORM) – published in the literature [1] and used in the calculations. Naturally, the full names and the abbreviations of the stationary phases are also given in Table 2. Cross-correlation data of chemical descriptors of the 10 test molecules (benzene, 1-butanol, 2-pentanone, 1-nitropropane, pyridine, 2-methyl-2-pentanol, 1-iodobutane, 2octyne, 1,4-dioxane and cis-hidrindane) show high correlation in some cases: SASA and V (R = 0.938), apSASA and α; α and V; α and log P (R > 0.85), which is important in multivariate regression because of the multicollinarity. The pair correlations between HOMO, LUMO, pSASA, apSASA and SASA were found to be less than 0.4. Values less than 0.6 was obtained between HOMO, LUMO, µ and α. The correlation coefficient was 0.77 between apSASA and µ. The correlations between the McReynolds’ numbers of different stationary phases were also large (R > 0.9). 3.1. Results of cluster analysis The variables were standardized before cluster analysis. The mean value of the matrix column was subtracted from all the elements of the column and data obtained were divided by the column standard deviation. This procedure ensures that the different measures, units will not deform the cluster analysis. CA using Ward’s method, which analyses of variances to evaluate the distances between the clusters, was performed. It minimizes the sum of squares of any two clusters. On clustering Fig. 1. Result of the cluster analysis (Ward’s method) for the descriptors. Fig. 2. Result of the cluster analysis (Ward’s method) for the dependent variables. all the descriptors (Fig. 1), we obtain pSASA and µ, SASA and V, apSASA and α, HOMO and LUMO as clusters with two members. log P is separating from α and apSASA cluster. In the analysis of stationary phases (Fig. 2), two main clusters were obtained with 11 and 14 stationary phases. The first cluster was separated into two clusters: DGDS, SOR, SORM, THEED, and BES, TEHP, TBP, DNS, ODA, DES, DAP. The second cluster was also separated into two smaller ones: TCET, DIOC2, and BOF, DIC, TCP, CDP, BEEP, BIP BBP, DDP, DIN, BEP, AC, BBA. Our conclusion is that to classify the columns by polarity is difficult on the basis of cluster analysis of the McReynolds’ numbers. Although the similarity in polarity columns could be determined (see, e.g., SOR and SORM, or AC and BBA), but in some cases contradictions were found (see, e.g., THEED, which has large McReynolds numbers, was found to be similar to SOR, SORM, DGDS). 3.2. Principal component analysis (PCA) Basically PCA decomposes the original matrix into the production of score (orthogonal) and loading (orthonormal) matrices. At least three variables are necessary to explain more than 90% of the total variance. The first factor explains 82.0% of total variance, the first and the second ones explain 88.3% and the first three factors explain 94.1%. We may expect that three orthogonal variables describe the McReynolds’ constants with acceptable error, confer with [9]. The loadings correspond to the correlation coefficient between the 34 original variables. The first factor correlates with α, V, log P, apSASA and all the McReynolds’ constants of the studied stationary phases (loadings are significant (>0.700)). Fig. 3 shows the relationship between Factors 1 and 2. Correlation between the McReynolds’ numbers of different stationary phases is very high. Factor 2 did not correlate significantly, Factor 3 correlated significantly only with LUMO. Factor 1 versus Factor 2 versus Factor 3 dependence can be found in Fig. 4. The pattern of clusters for the stationary phases shows similar distribution as we found in the cluster analysis (see, Figs. 1 and 2). R. Rajkó et al. / Analytica Chimica Acta 554 (2005) 163–171 167 Table 3 Results of MLR calculations Fig. 3. Factor loadings, Factor 1 vs. Factor 2 (unrotated), extraction by principal components. 3.3. Case and variable selection by multiple linear regression (MLR), principal component regression (PCR) and partial least-square regression (PLSR) methods Unfortunately, either CA or PCA could not give unambiguous and usable answer for the question: which variables are important, and which are negligible in the model. First MLR with backward elimination or forward selection (stepwise mode) was performed for the McReynolds’ constants of individual stationary phases, in order to find the necessary descriptors. The variable selection criterion was p < 0.10 (p means the significance level, how much the possibility is that the effects occured by chance). The results are summarized in Table 3. We found that the best results were obtained with the LUMO, µ, V, SASA descriptors using all of the test molecules (Table 3). In some cases (CDP and TCEPE), we found LUMO, µ, α as the necessary descriptors on the basis of stepwise regression criteria. In some cases SASA was not significant in the model. With the individual evaluation of the equations we loose the information as a whole on the McReynolds’ station- BBA DAP BBP BEP BIP BEEP BOF DIC DDP DNP DIOC2 BES DOS DNS ODA THEED CDP TBP TEHP TCP AC SOR SORM TCEPE DGDS B C D E R2 F 176.39 84.72 237.23 138.34 345.42 546.31 146.02 330.71 196.99 114.56 258.33 351.09 83.77 64.36 85.25 1187.80 540.78 374.60 192.79 405.8 179.14 323.83 334.52 1526.07 235.77 28.14 21.29 24.04 16.84 29.99 28.60 17.49 16.23 22.53 18.86 n.a. 35.00 19.05 20.43 20.32 86.51 n.a. 39.97 40.73 22.44 27.91 29.87 28.15 n.a. 21.12 36.19 26.68 39.72 31.09 48.43 51.31 32.18 37.08 39.55 31.67 9.73 46.96 25.54 26.05 26.50 91.93 24.60 53.62 45.99 43.48 38.75 30.41 30.99 n.a. 21.88 −5.12 −3.33 −4.60 −3.03 −5.85 −3.58 −3.17 −2.13 −4.31 −3.31 −1.59 −2.53 −2.97 −3.13 −3.34 −8.11 −7.09a −2.92 −1.85 −2.64 −5.04 −2.52 −2.52 −20.41a −1.85 1.55 1.05 1.21 0.83 1.47 n.a. 0.86 n.a. 1.20 0.98 n.a. n.a. 0.91 1.02 1.06 n.a. n.a. n.a. n.a. n.a. 1.50 n.a. n.a. n.a. n.a. 0.950 0.936 0.965 0.972 0.962 0.930 0.972 0.941 0.970 0.971 0.915 0.765 0.934 0.943 0.941 0.853 0.896 0.835 0.738 0.936 0.956 0.787 0.796 0.841 0.749 23.98 18.32 34.21 42.90 31.53 26.55 43.86 31.65 40.41 41.94 37.85 6.51 17.68 20.59 19.96 11.62 30.14 10.14 5.63 29.19 26.85 7.38 7.80 42.21 5.98 A: intercept, B: LUMO, C: µ, D: V, E: SASA. R2 : square of correlation coefficient, F: Fischer number. a Descriptor: polarizability. ary phase polarity system. The descriptors, the properties of test molecules, obtained in the statistical evaluation support the parameters that are important in the absorption: LUMO, the measure of electron affinity, µ, dipole moment, the polarity of the test molecule, V, SASA the volume and solvent (water) accessible surface area are the measure of the molecule. α, the polarizability is the measure of the flexibility in the electron system of the molecule. Because MLR with stepwise regression can operate on only one dependent variable at a time, an iterative method was developed to find both the dependent (molecules) and the independent (descriptors) variables, necessary for explaining all the McReynolds’ numbers. Thus, the used regression model is: Y 10×25 Fig. 4. Factor loadings, Factor 1 vs. Factor 2 vs. Factor 3 (unrotated), extraction by principal components. A = X B (3) 10×9 9×25 Fig. 5 shows the screen-plots for X- (Panel a) and Y-blocks (Panel b). For X- and Y-blocks, 3 and 1 latent variables (LV) can be chosen, respectively, because in the case of X-block the 4th latent variable has the same small variance component as the remaining, and in the case of Y-block the 2nd latent variable has that small variance component. The latent variable of PLS is similar to factor or principal component of PCA, but in the case of PLS both X and Y are included, thus one common number of the latent variables has to be selected. Three LVs were chosen to the further investigations, because they can explain 91.66% of the covariance between X and Y. R. Rajkó et al. / Analytica Chimica Acta 554 (2005) 163–171 168 less than six test molecules caused run-time errors for PCR and PLS functions of PLS Toolbox. All models were validated by leave-one-out cross-validation by using Q2 . The Q2 was calculated as a correlation coefficient between the original Y and the cross-validated prediction of Y (YCV ) with using 1, 2, . . . and all latent variables: 2 (Yi − Y )(YCV,i − Y CV ) 2 Q = (4) 2 2 (Yi − Y ) (YCV,i − Y CV ) Fig. 5. Screen-plots for the X- (Panel a) and Y-blocks (Panel b) to determine the number of latent variables. PCR and PLS were run at all possible descriptor variations and almost all test molecules for McReynolds’ constants with the selection criterion Q2 , i.e., the correlation coefficients for the leave-one-out cross-validated data. Because the number of test molecules (cases) and the number of descriptors (variables) are small (10 and 9, respectively) in our case we could proceed with the total case and variable selection procedure with the crossvalidation in reasonable time. We could only started the process with the number of test molecules equals to six, because using We found that the best results were obtained at the HOMO, µ, α, apSASA descriptors using test molecules (benzene, 2pentanone, 1-nitropropane, 1-iodobutane, 1,4-dioxane and cishidrindane) (Tables 4 and 5). Cross-validated correlation coefficients (Q2 ) were 0.9832 and 0.9834 for PCR and PLS, respectively. Similar results were obtained with neglecting log P descriptors using the same molecules (Tables 4 and 5). The best results are fairly same for PCR and PLS, but they cannot be significantly distinguished from the second, third, etc. best results. The molecules and the descriptors according to the results of PCR and PLS calculations based on the first 50 best Q2 are shown in Table 6. The similarity of the PCR- and PLS-based results is rather satisfying, since the simplest and the most complicated procedures provided with them. Our conclusions can be considered relatively established according to the data used. We then found that the best results were obtained at the HOMO, µ, α, apSASA descriptors using six test molecules (benzene, 2-pentanone, 1nitropropane, 1-iodobutane, 1,4-dioxane and cis-hidrindane). The previous calculations were based on the condition that the influence in the variation of degrees of freedom (according to the reduced data) is negligible. However, we can calculate the adjusted-Q2 (Q2a ) (similar to the adjusted-R2 [37]): m−1 Q2a = 1 − (1 − Q2 ) (5) m−q where m means the number of test molecules and d means the number of descriptors used in the case/variable selection procedure. It is interesting that while Q2 cannot, Q2a can be negative (it means that X cannot explain Y): m−1 d−1 < 0 ⇒ Q2 < (6) 1 − (1 − Q2 ) m−d m−1 Table 4 Results of PCR calculations (first five best Q2 ) A B No. of test mols. No. of descriptors Q2 [1 3 4 7 9 10] [1 3 4 7 9 10] [1 3 4 7 9 10] [1 3 4 7 9 10] [1 3 4 7 9 10] {1 3 4 9} {1 2 4 6 9} {1 3 4 5 6 7 8 9} {1 2 4 5 6 8} {1 2 4 5 6 8 9} 0.9832 0.9827 0.9818 0.9814 0.9804 Latent variables No. of test mols. No. of descriptors Q2 Latent variables 3 3 4 4 4 [1 3 4 7 9 10] [1 3 4 7 9 10] [2 3 5 6 7 8] [1 2 3 4 6 8] [1 2 3 4 6 8] {1 3 4 9} {1 2 4 9} {2 7} {1 2 4 5 8 9} {1 2 4 5 7 8} 0.9832 0.9775 0.9736 0.9733 0.9726 3 3 2 3 4 Independent variables: descriptors of McReynolds test molecules, dependent variables: McReynolds numbers of GC columns studied. A: with all nine descriptor, B: without log P, eight descriptors. Resolution of the numbers of test molecules and descriptors in square brackets and braces, respectively, is in Table 1. R. Rajkó et al. / Analytica Chimica Acta 554 (2005) 163–171 169 Table 5 Results of PLS calculations (first five best Q2 ) A B No. of test mols. No. of descriptors Q2 [1 3 4 7 9 10] [1 3 4 7 9 10] [1 3 4 7 9 10] [1 3 4 7 9 10] [1 3 4 7 9 10] {1 3 4 9} {1 3 4 5 6 7 8 9} {1 2 4 5 6 8} {1 2 4 5 6 8 9} {1 3 4 6 7 8 9} 0.9834 0.9818 0.9814 0.9804 0.9798 Latent variables No. of test mols. No. of descriptors Q2 Latent variables 3 4 4 4 2 [1 3 4 7 9 10] [1 2 3 4 6 8] [2 3 4 5 7 10] [3 4 5 7 8 9] [1 2 3 4 6 8] {1 3 4 9} {2 8 9} {2 3 5 7 8} {1 3 5} {1 2 4 5 8 9} 0.9834 0.9750 0.9741 0.9739 0.9736 3 3 4 2 3 Independent variables: descriptors of McReynolds test molecules, dependent variables: McReynolds numbers of GC columns studied. A: with all nine descriptor, B: without log P, eight descriptors. Resolution of the numbers of test molecules and descriptors in square brackets and braces, respectively, is in Table 1. Table 6 Frequencies of the molecules and descriptors according to the results of PCR and PLS calculations based on the first 50 best Q2 PLS incl. log P No. of molecules Frequency No. of descriptors Frequency 4 49 1 43 3 48 8 37 1 46 9 34 7 36 5 32 9 36 4 31 10 33 6 31 8 17 3 29 2 16 2 29 6 15 7 25 5 4 PLS excl. log P No. of molecules Frequency No. of descriptors Frequency 3 44 1 39 4 43 2 36 1 42 4 33 8 36 8 31 2 36 9 31 6 33 5 31 7 25 3 30 9 20 7 23 10 14 6 0 5 8 PCR incl. log P No. of molecules Frequency No. of descriptors Frequency 3 49 1 41 4 48 9 35 1 47 2 34 7 37 8 33 9 36 4 31 10 29 6 31 8 22 5 27 2 15 7 22 6 15 3 20 5 2 PCR excl. log P No. of molecules Frequency No. of descriptors Frequency 3 47 2 41 4 46 1 38 1 44 9 36 8 39 4 32 2 36 8 26 6 35 3 25 7 19 5 24 9 17 7 19 10 12 6 0 5 6 Table 7 Results of PCR and PLS calculations including log P (first five best adjusted-Q2 (Q2a )) PCR PLS No. of test mols. No. of descriptors Q2a [2 3 5 6 7 8] [1 3 4 7 9 10] [1 2 5 6 8 9] [1 3 4 7 8 9] [1 3 4 7 9 10] {2 7} {1 3 4 9} {2 8} {1 6 9} {1 4 9} 0.9671 0.9581 0.9539 0.9535 0.9533 Latent variables No. of test mols. No. of descriptors Q2a Latent variables 2 3 2 3 2 [1 3 4 7 9 10] [1 2 3 4 6 8] [3 4 5 7 8 9] [2 3 5 6 7 8] [1 2 5 6 8 9] {1 3 4 9} {2 8 9} {1 3 5} {2 7} {2 8} 0.9585 0.9583 0.9566 0.9545 0.9539 3 3 2 2 2 Independent variables: descriptors of McReynolds test molecules, dependent variables: McReynolds numbers of GC columns studied. Resolution of the numbers of test molecules and descriptors in square brackets and braces, respectively, is in Table 1. Table 8 Frequencies of the molecules and descriptors according to the results of PCR and PLS calculations based on the first 50 best adjusted-Q2 (Q2a ) PLS No. of molecules Frequency No. of descriptors Frequency 1 42 3 25 2 35 8 21 9 34 2 19 6 31 4 17 10 29 6 14 4 29 1 10 7 28 9 10 8 27 7 8 3 27 5 7 5 26 PCR No. of molecules Frequency No. of descriptors Frequency 3 40 1 18 9 36 4 18 4 35 5 18 7 35 2 16 8 33 8 13 1 30 9 13 10 30 3 12 5 26 6 10 2 25 7 9 6 17 170 R. Rajkó et al. / Analytica Chimica Acta 554 (2005) 163–171 Tables 7 and 8 show the results obtained with using adjustedQ2 . We can conclude as before: the PCR- and PLS-based results are rather similar and undistinguishable. Using principle of Occam’s razor one can choose the best result of the simplest method, i.e., PCR (this result is the fourth best for PLS): descriptors are LUMO and SASA, test molecules are 1-butanol, 2-pentanone, pyridine, 2-methylpentanol-2,1-iodobutane and 2octyne. LUMO and SASA of the test molecules must be important in absorption — they characterize the strength of test molecule binding to solutes with different polarities. 4. Conclusion Unfortunately, neither CA nor PCA could give unambiguous and usable answer for the question: which variables are important, and which are negligible in the model. Because MLR with stepwise regression can operate on only one dependent variable at a time, PCR and PLS had to be used for building the regression model. On the basis of detailed statistical analysis and accepting only the best results based on Q2 (Tables 4 and 5) benzene, 2pentanone, 1-nitropropane, 1-iodobutane, 1,4-dioxane and cishidrindane — McReynolds’ test molecules are adequate to characterize the polarity of the GC column. Four descriptors characterize the expression where the independent variables are these descriptors and the dependent variables are the McReynolds’ numbers. According to the PCR and PLS results (these methods can handle the cases when there are much more dependent variables than independent ones) of the first 50 best regression models it can be concluded that six McReynolds test molecules are really enough (Tables 6 and 8). The number and kind of the descriptors depend on the regression methods and whether log P is included or excluded. Seeking the simplest regression algorithm and model, the four descriptors are HOMO, µ, α and apSASA according to the result of PCR excluding log P. On the other hand, considering the results given by using adjusted-Q2 a little bit different conclusions can be drawn. It remained the same, that six McReynolds’ test molecules are really enough. However, these molecules in this case are 1-butanol, 2-pentanone, pyridine, 2-methylpentanol-2, 1iodobutane and 2-octyne (note that 2-pentanone and 1iodobutane are common). Two descriptors, which characterize the measure of absorption in solute, were found to be enough for building the regression model, namely LUMO and SASA (note that LUMO is common). However, we can consider together the results based on Q2 and adjusted-Q2 . Tables 4, 5 and 7 show that the descriptive model which was formed with the six McReynolds’ test molecules (benzene, 2-pentanone, 1-nitropropane, 1iodobutane, 1,4-dioxane and cis-hidrindane) and the four descriptors (HOMO, µ, α and apSASA) placed first for five cases from six, and it placed second when it did not place first (Table 7). The conclusions suggest that the six McReynolds’ test molecules mentioned can provide the same information of polarity as the original 10 McReynolds’ test molecules can according to the model built with four descriptors. Regarding the hopeful results of building descriptive model, we are working on building a predictive model using the novel case/variable selection method using PLS and PCR combined with Q2 and adjusted-Q2 introduced in this paper. Acknowledgement Károly Héberger and István Pálinkó are greatly appreciated for helping to make more valuable the manuscript version of this paper. The authors would like to acknowledge helpful critical comments to the anonymous referees. This work was supported by the Hungarian Scientific Research Fund (OTKA/T032966 and OTKA/T046484) and by I. Széchenyi Research Fellowships (R.R. and T.K.). References [1] H. Rotzsche, Flüssige und chemisch gebundene stationare Phasen, in: E. Leibnitz, H.G. Struppe (Eds.), Handbuch der Gaschromatographie, Akademische Verlagsgesellshaft, Geest and Portig K.-G. Lepzig, Germany, 1984, pp. 442–506. [2] T. Körtvélyesi, M. Görgényi, K. Héberger, Anal. Chim. Acta 428 (2001) 73–82. [3] W.O. McReynolds, J. Chromatogr. Sci. 8 (1970) 685–691. [4] G. Tarján, Á. Kiss, G. Kocsis, S. Mészáros, J.M. Takács, J. Chromatogr. 119 (1976) 327–332. [5] E. Fernandez-Sanchez, A. Fernandez-Torres, J.A. Garcia-Dominguez, J.M. Santiuste, Chromatographia 31 (1991) 75–79. [6] L.R. Snyder, J. Chromatogr. 92 (1974) 223–230. [7] G. Castello, G. D’Amato, S. Vezzani, J. Chromatogr. 646 (1993) 361–368. [8] R.V. Golovnya, B.M. Polanuer, J. Chromatogr. 517 (1990) 51–66. [9] K. Héberger, Chemom. Intell. Lab. Syst. 47 (1990) 41–49. [10] K. Héberger, M. Görgényi, J. Chromatogr. A 845 (1999) 21–31. [11] K. Héberger, M. Görgényi, M. Sjöström, Chromatographia 51 (2000) 595–600. [12] R.V. Golovnya, T. Misharina, Chromatographia 10 (1977) 658– 660. [13] R.V. Golovnya, T.A. Misharina, Chromatographia 190 (1980) 1–12. [14] L. Rohrschneider, J. Chromatogr. 17 (1965) 1–12. [15] L. Rohrschneider, J. Chromatogr. 22 (1966) 6–22. [16] E.R. Malinowski, Factor Analysis in Chemistry, 3rd ed., Wiley, New York, USA, 2002. [17] H. Rotsche, Stationary Phases in Gas Chromatography, J. Chromatography Library, vol. 48, Elsevier, Amsterdam, 1991. [18] M. Görgényi, Z. Fekete, L. Seres, Chromatographia 27 (1989) 581– 584. [19] T. Körtvélyesi, M. Görgényi, L. Seres, Chromatographia 41 (1995) 282–286. [20] Z. Király, T. Körtvélyesi, L. Seres, M. Görgényi, Chromatographia 42 (1996) 653–659. [21] A. Garcia-Raso, F. Saura-Calixto, M. Raso, J. Chromatogr. 302 (1984) 107–117. [22] N. Dimov, A. Osman, O.V. Mekanyan, D. Papazova, Anal. Chim. Acta 298 (1994) 303–317. [23] R. Kaliszan, H.-D. Höltje, J. Chromatogr. 234 (1982) 303–311. [24] K. Osmialowski, J. Halkiewicz, A. Radecki, R. Kaliszan, J. Chromatogr. 346 (1985) 53–60. [25] A.R. Katritzky, E.S. Ignatchenko, R.A. Barcock, V.S. Lobanov, M. Karelson, Anal. Chem. 66 (1994) 1799–1807. [26] R.P.W. Scott, J. Chromatogr. 122 (1976) 35–53. [27] V.S. Ong, R.A. Hites, Anal. Chem. 63 (1991) 2829–2834. [28] J.J.P. Stewart, MOPAC93, Fujitsu Ltd., Tokyo, 1994. [29] Pedretti A., Vistoli G., VEGA, Version 1.5., 2003. R. Rajkó et al. / Analytica Chimica Acta 554 (2005) 163–171 [30] PROSTAT Ver. 3.0, PolySoftware, P.O. Box 60, Pearl River, NY 10965, USA. [31] STATISTICA 99, Statsoft 2300 East 14th St. Tulsa, Oklahoma 74104, USA. [32] P. Geladi, B.R. Kowalski, Anal. Chim. Acta 185 (1986) 1–17. [33] H. Martens, T. Neas, Multivariate Calibration, Wiley, Chichester, UK, 1991. 171 [34] R.G. Brereton, Chemometrics: Data Analysis for the Laboratory and Chemical Plant, Wiley, Chichester, UK, 2003. [35] Eigenvector Research Inc., PLS-Toolbox® Version 3.0.3a., 2003. [36] The Mathworks Inc., MATLAB®, Version 6.1. (R12.1) User’s Guide, 2000. [37] N.R. Draper, H. Smith, Applied Regression Analysis, 2nd ed., Wiley, New York, USA, 1981.