1 Application of Class-Modelling Techniques to Near Infrared data for Food 2 Authentication Purposes 3 P. Oliveri1, V. Di Egidio2, T. Woodcock3 and G. Downey3 4 5 1 6 University of Genoa, Department of Drug and Food Chemistry and Technology, Via Brigata 7 8 Salerno, 13 - 16147 Genoa, Italy 2 University of Milan, Department of Food Science and Technology, Via Celoria, 2 - 20133 Milan, 9 10 Italy 3 Teagasc, Ashtown Food Research Centre, Ashtown, Dublin 15, Ireland 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 1 26 Abstract 27 Following the introduction of legal identifiers of geographic origin within Europe, methods 28 for confirming any such claims are required. Spectroscopic techniques provide a method for rapid 29 and non-destructive data collection and a variety of chemometric approaches have been deployed 30 for their interrogation. In this present study, class-modelling techniques (SIMCA, UNEQ and 31 POTFUN) have been deployed after data compression by principal component analysis for the 32 development of class-models for a set of olive oil and honey. The number of principal components, 33 the confidence level and spectral pre-treatments (1st and 2nd derivative, standard normal variate) 34 were varied, and a strategy for variable selection was tried. Models were evaluated on a separate 35 validation sample set. The outcomes are reported and criteria for selection of the most appropriate 36 models for any given application are discussed. 37 38 Keywords: food authenticity; class-modelling; chemometrics; NIR; spectroscopy 39 2 40 1. Introduction 41 In recent years, there has been a growing interest among consumers in the safety and 42 traceability of food products. In particular there is an increasing focus on the geographical origin of 43 raw materials and finished products, for several reasons including specific sensory properties, 44 perceived health values, confidence in locally-produced products and, finally, media attention 45 (Luykx & Van Ruth, 2008). As a result of these factors, the European Union has recognised and 46 supported the differentiation of quality products on a regional basis (Dimara & Skuras, 2003), 47 introducing an integrated framework for the protection of geographical origin for agricultural 48 products and foodstuffs by specific regulation (EU Regulation 510/2006). This regulation permits 49 the application of the following geographical indications to a food product: protected designation of 50 origin (PDO), protected geographical indication (PGI) and traditional speciality guaranteed (TSG). 51 Traditional analyses for food authentication, based on chemical and physical methods, have 52 several drawbacks, the most significant of which are low speed, the necessity for sample pre- 53 treatments, a requirement for highly-skilled personnel and destruction of the sample (Tomás- 54 Barberán, Ferreres, Garciá-Vuguera, & Tomás-Lorente, 1993; Stefanoudaki, Kotsifaki & 55 Koutsaftakis, 1997; Anderson & Smith, 2002; Consolandi et al., 2008). Several fast and non- 56 destructive instrumental methods have been proposed to overcome these hurdles. Among them, 57 infrared spectroscopy has proven to be a successful analytical method for analyses of a variety of 58 food products. In particular, the NIR region (between 750 and 2500 nm), in which vibration and 59 combination overtones of the fundamental O-H, C-H and N-H bounds are the main recordable 60 phenomena (William & Norris, 2001), gives useful spectral fingerprints of food samples. 61 In the literature, there are several studies reporting the use of NIR spectroscopy to determine 62 geographical and varietal origin of olive oil (Galtier et al., 2007; Casale, Casolino, Ferrari & Forina, 63 2008; Sinelli, Casiraghi, Tura & Downey, 2008), wine (Cozzolino, Smyth & Gishen, 2003; Liu et 64 al., 2008), honey (Toher, Downey & Murphy, 2007; Woodcock, Downey, Kelly & O’Donnell, 65 2007) and other products (Reid, O’Donnell & Downey, 2006). In all these cases, multivariate data 3 66 analysis has been shown to be indispensable for the extraction of the maximum useful information 67 from recorded signals. 68 In most of the applications found in the literature which report discrimination on the basis of 69 geographical origin, classification techniques like linear discriminant analysis (LDA) and partial 70 least squares discriminant analysis (PLS-DA) have been used. Given a number of samples 71 belonging to some pre-defined classes, classification techniques build a delimiter between these 72 classes and always assign each object to the category to which it most probably belongs even in the 73 case of objects which are actually extraneous to the classes studied. These methods are not the best 74 approach for the verification of food geographical origin, not least because normally there is only a 75 single class of interest e.g. a single PDO. In such cases, class-modelling techniques can be properly 76 and more appropriately applied (Forina, Casale & Oliveri, 2009; Marini, Bucci, Magrì & Magrì, 77 2010); in fact, they provide an answer to the general question: “Is sample X, stated to be of class A, 78 really compatible with the class A model?”. This is essentially the question to be answered in 79 addressing problems of food geographical authenticity: if a product is sold with a specific claim on 80 a label regarding provenance, it is important to be able to verify compatibility of its measured 81 characteristics with those of authentic similar material from the declared origin (Forina, Oliveri, 82 Lanteri & Casale, 2008). 83 A class model is characterised by two parameters: sensitivity and specificity. However, 84 when a class-modelling technique such as SIMCA is applied in food authentication, attention is 85 often focused only on its classification performance (e.g. correct classification rate). Use of such a 86 restricted focus under-utilises the significant characteristics of class-modelling approach. 87 The aim of this work is the exploration of three different class modelling techniques 88 (SIMCA, UNEQ and POTFUN) to evaluate their abilities for verifying the declared geographical 89 origin of two PDO food products: olive oil from Liguria and honey from Corsica. They represent 90 economically-valuable food products traded extensively internationally and differing widely in 91 composition. 4 92 These sample sets were collected as part of the EU-funded TRACE project 93 (http://www.trace.eu.org) and their authenticity is guaranteed. In the case of the olive oil samples, 94 classification techniques have previously been applied to the NIR spectral collection and the results 95 published (Woodcock, Downey & O’Donnell, 2008). 96 97 2. Materials and Methods 98 2.1 Samples and NIR analysis 99 Olive oil samples (n= 913) were collected from a number of different areas in Europe over a 100 period spanning three harvests: 2005 (316 samples: 63 samples from Liguria and 253 from other 101 regions), 2006 (352 samples: 79 samples from Liguria and 273 from other regions) and 2007 (245 102 samples: 68 samples from Liguria and 177 from other regions). Oils were sourced from 103 Mediterranean countries i.e. Italy, Spain, France, Greece, Cyprus and Turkey (Table 1). All oils 104 were transported to a single laboratory (Joint Research Centre, Ispra, Italy) for sub-sampling and 105 delivery by air to Ashtown Food Research Centre. Uniquely, oils from harvest 2 (2006) were 106 collected and distributed in two separate batches. All olive oils were stored in a refrigerated room (4 107 ºC) in the dark between delivery and spectral acquisition (less than 2 weeks), minimising the chance 108 of any significant chemical change occurring during this time-period. Olive oil samples (50 ml 109 approx.) were placed in screw-capped vials in a water-bath maintained at 30 °C and allowed to 110 equilibrate for 30 minutes prior to spectral acquisition. Transflectance spectra (1100-2498 nm) at 2 111 nm intervals (700 variables) of each sample were collected. Full experimental details have been 112 previously described in Woodcock et al. (2008). Figure 1.a shows an example of NIR spectra of 113 olive oil samples. 114 Artisanal unfiltered honey samples were collected directly from beekeepers during harvest 115 2006 (111 samples from Corsica and 71 from other regions). All honey collected were stored in the 116 dark in screw-cap jars at room temperature (18-25 °C) between collection and spectral acquisition 117 (less than 2 weeks). Immediately prior to spectral collection, honey samples were incubated at 40 5 118 °C overnight in an air oven and manually stirred to ensure homogeneity. The solids content of each 119 sample was measured using a benchtop Abbé model 2WA (Kernco Instruments, Texas, USA) 120 refractometer and each sample was adjusted to a standard solids content (70 ±1 °Brix) with distilled 121 water. This step was necessary to minimise spectral complications arising from naturally-occurring 122 variations in sugar concentration and to avoid spurious classifications on the basis of solids content 123 variations between honey samples. 124 Before being scanned, honey samples (50 ml approx.) were placed in screw-capped vials in 125 a water-bath maintained at 30 °C for 30 minutes. Spectral data were collected in transflectance 126 mode (1100-2498 nm) at 2 nm intervals (700 variables) were collected. Full experimental details 127 have been previously described in Woodcock, Downey, Kelly & O’Donnell (2007). 128 129 2.2 Data pre-processing 130 Spectral data of olive oil and honey were exported from The Unscrambler binary files in 131 ASCII format and imported into the chemometric package V-PARVUS (Forina, Lanteri, Armanino, 132 Casolino, Casale & Oliveri, 2008). 133 Olive oil spectra were structured in a data matrix with 913 rows (samples) and 700 columns 134 (variables). A calibration sample set was constructed so as to include equal numbers of Ligurian 135 and non-Ligurian oils, i.e. to be balanced with regard to sample type. Two-thirds (n=140) of the 136 Ligurian samples were selected at random from the 3-year harvest collection as were an equal 137 number of non-Ligurian oils to form a calibration set of 280 samples. All the remaining samples 138 were used as an external validation sample set. 139 Honey spectra were structured in a matrix with 182 rows (samples) and 700 columns 140 (variables). The calibration sample set was constructed by selecting at random two-thirds of the 141 total sample set; all of the remaining samples were used as an external validation sample set. 142 To remove or at least minimise any unwanted spectral contribution arising from e.g. light 143 scatter (Blanco & Pagés, 2002), the effect of a number of mathematical pre-treatments was 6 144 investigated for all models. These pre-treatments included 1st and 2nd derivative using the Savitzky- 145 Golay method (Alciaturi, Escobar & De La Cruz, 1998; Vaiphasa, 2006) with cubic smoothing and 146 segment sizes between 5 and 21 datapoints, and the standard normal variate (SNV) transform 147 (Barnes, Dhanoa & Lister, 1989). 148 149 150 151 2.3 Class-modelling techniques The modelling techniques used in this work were SIMCA (soft independent modelling of class analogy), UNEQ (unequal class spaces) and POTFUN (potential function techniques). 152 153 2.3.1 Soft Independent Modelling of Class Analogy (SIMCA) 154 SIMCA (Wold & Sjöström 1977) was the first class-modelling technique used in 155 chemometrics; the central feature of this method is the application of principal component analysis 156 (PCA) to the sample category studied (e.g. a PDO food product), generally after within-class 157 autoscaling or centering. SIMCA models are defined by the range of the sample scores on a selected 158 number of low-order principal components (PCs) and models therefore correspond to rectangles 159 (two PCs), parallelepipeds (three PCs) or hyper-parallelepiped (more than three PCs) referred to as 160 the SIMCA inner space. Conversely, the principal components not used to describe the model 161 define the outer space, considered as uninformative space. The scores range can be enlarged or 162 reduced, depending mainly on the number of samples, to avoid the possibility of under- or over- 163 estimation (Forina & Lanteri, 1984). The standard deviation of the distance of the objects in the 164 calibration set from a model is the class standard deviation. The boundaries of SIMCA space around 165 the model are determined by a critical distance, which is obtained by means of Fisher statistics; 166 there is no specific hypothesis other than that this distance should be normally distributed. 167 However, the distribution of samples in the inner space should be more-or-less uniform otherwise 168 regions in the inner space lacking objects from the modelled class would be incorrectly considered 169 as a part of the model. Figure 2 shows an example of non-uniform sample distribution in a space 7 170 defined by the first two PCs: in such a case SIMCA is clearly a not satisfactory device (Fig. 2.a), as 171 this model includes a large number of objects not belonging to the modelled class. 172 SIMCA is a very flexible technique since it allows variation in a large number of parameters 173 such as scaling or weighting of the original variables, number of components, expanded or 174 contracted scores range, different weights for the distances from the model in the inner space and in 175 the outer space and confidence level applied. 176 177 2.3.2 UNEQual class spaces (UNEQ) 178 UNEQ, originating in the work of Hotelling (1947), was introduced into chemometrics by 179 Derde & Massart (1986). This technique derives from QDA (quadratic discriminant analysis); it is 180 based on the hypothesis of a multivariate normal distribution in each category studied and, 181 consequently, on the use of T2 statistics to define a class space. The UNEQ model is the centroid i.e. 182 the vector of the mean values of the variables. UNEQ should be applied in cases when the ratio 183 between the number of objects in a given category and the number of the variables measured is 3 or 184 greater. In cases involving many variables (such as spectral data), it is possible to apply UNEQ 185 following a preliminary reduction in variable number by PCA. The boundary of the class space 186 around the centroid is an ellipse (two variables), an ellipsoid (three variables) or a hyper-ellipsoid 187 (more than three variables). The dispersion of a class space is defined by the critical value of the T 2 188 statistics at a selected confidence level. The shape and the orientation of the ellipse depend on the 189 correlation between the variables. As was stated for SIMCA, if sample distribution in variable space 190 is not uniform, UNEQ may not provide satisfactory results (Fig. 2.b). 191 192 2.3.3 POTential FUNction techniques (POTFUN) 193 Potential function techniques were introduced to chemometrics by Coomans & Broeckaert 194 (1986). These methods estimate a probability density distribution as the sum of the contributions of 195 each single object in the calibration set. Here we used a Gaussian-like contribution, with a 8 196 smoothing coefficient (formally analogous to the standard deviation of the Gaussian probability 197 function and evaluated by means a leave-one-out procedure) that is a function of the local density of 198 objects: this strategy, known as normal variable-potential, is useful when the underlying 199 multivariate distribution is very asymmetric, with wide regions characterised by a very low density 200 of objects (Forina, Armanino, Leardi & Drava, 1991). The resulting estimated distribution can be 201 very complex, capable of effectively describing non-uniform distributions of samples (Fig. 2.c). The 202 boundary of the class space is obtained in this approach by the method of the equivalent 203 determinant (Forina et al., 1991) at a selected confidence level. 204 205 2.4 Validation parameters 206 Sensitivity, specificity and efficiency were computed in this work to evaluate model 207 performance. Sensitivity is defined as the percentage of objects in the external validation set 208 belonging to the modelled class which are accepted by the model developed using objects in the 209 calibration set. Specificity is the percentage of objects belonging to the other (un-modelled) 210 category or categories in the external validation set which are rejected by the model developed 211 using objects in the calibration set. A class-modelling technique builds a class space around a 212 mathematical class model which is the confidence interval, at a pre-selected confidence level, for 213 the class objects: sensitivity is its experimental measure. A decrease in the confidence level for the 214 modelled class decreases the sensitivity and increases the specificity of the model. Efficiency has 215 been computed in this study as geometric mean of sensitivity and specificity values. 216 217 2.5 Variable selection 218 In the case of spectral data, not all wavelengths contribute unique or useful information. The 219 elimination of predictors of limited or negligible utility can increase the efficiency of class models 220 or make their interpretation simpler. Stepwise decorrelation of variables, introduced by Kowalski & 221 Bender (1976) as an algorithm with the name SELECT, was applied to the spectral datasets in this 9 222 study. This algorithm can be used both for classification, class-modelling problems and multivariate 223 quantitative calibration. In the case of classification and class-modelling problems, SELECT 224 functions by a sequential identification and decorrelation of individual variables in the initial 225 variable set. For example, SELECT first identifies a variable with the largest Fisher weight (Wold 226 & Sjöström, 1977), a measure of the between-class variance: within-class variance ratio. This first 227 selected variable is decorrelated from all the others so that these latter become orthogonal to that 228 selected. After decorrelation, the selected variable is set aside from the variable set and the process 229 repeated with another identification. This identification and decorrelation procedure continues until 230 the decorrelated predictors do not have a significant Fisher weight. Decorrelation is especially 231 important in the case of spectral data because contiguous variables are often highly inter-correlated. 232 233 3. Results and discussion 234 As is typical with NIR spectra, no difference between spectra of olive oils or honey from 235 different regions was detectable by the naked eye (data not shown). However, some indication of 236 variability in each spectral dataset may be obtained from a principal component scores plot; these 237 are shown for olive oil and honey in Figures 1 (a) and (b) respectively. In the case of olive oils, the 238 scores plot on PC1 (accounting for 82% of the variation) and PC2 (accounting for 16% of the 239 variation) reveals some clustering behaviour that probably relates to harvest, storage time or 240 scanning time. For honey, two phenomena may be seen. The first is a vertical separation between 241 samples on the basis of harvest year while the second involves a large dispersion of honeys along 242 PC1 in the case of harvest 1. The cause of this latter behaviour is unknown; the separation along 243 PC2 may be related to factors relating to each harvest or instrument drift or both. 244 Tables 2 and 3 show the parameters used to evaluate class-model performance on the 245 relevant external validation set for a number of Ligurian olive oil and Corsican honey models. In 246 both cases, the number of principal components was varied over a very large range (from 2 to 20), a 247 range which was much larger than the number of significant components. In addition, the 10 248 confidence level of each model was varied between 95% and 80%. Two column pre-treatments 249 (column centering and column autoscaling) were applied prior to PCA in order to try to balance the 250 influence of the variables and improve model performance. The results reported are those which 251 produced the highest percentage efficiency on the external validation set for each data pre-treatment 252 and confidence level; they are the best according to the maximum efficiency criterion and generally 253 correspond to relatively well-balanced values of specificity and sensitivity. Figure 3 is a graphical 254 representation of these results: sensitivity and specificity show, as expected, an overall negative 255 correlation. Models represented in the upper left or in lower right corners of these plots are 256 characterised by relatively unbalanced parameters. 257 In the case of Ligurian olive oil, as shown in Table 2, the three class-modelling techniques 258 provided quite similar results although at different confidence levels. The sensitivity and specificity 259 values associated with the maximum efficiency varied between 70.0 - 90.0% and 62.2 – 87.7 %, 260 respectively. The highest efficiency (84.5 %) was obtained by a SIMCA model involving 9 PCs 261 after column-centering and a 2nd derivative (21 datapoints) pre-treatment. Comparable values were 262 produced by a number of POTFUN models while UNEQ efficiencies never exceeded 80 %. As 263 regards selecting the best model, parameters to be considered in addition to sensitivity, selectivity 264 and efficiency include the number of components in each model and the confidence level selected. 265 It should be noted that quite large numbers of principal components are included in all of the 266 models reported in Table 2, varying from 8/9 in the case of SIMCA up to 13 for some POTFUN 267 models. These, however, may be a reflection of the type of problem which is being addressed 268 namely trying to differentiate between olive oils from separate geographic areas. While there may 269 be some expected differences in terms of varietal composition of oils, chemical differences which 270 are detected by near infrared spectroscopy may be expected to be very small and therefore 271 represented by sources of variation which require the involvement of high-order principal 272 components for their description. It may be useful to examine the performance of models which are 273 similar to that producing e.g. the highest efficiency value. Take, for example, the SIMCA model 11 274 with the efficiency value equal to 84.5%. This is derived by a 2nd derivative (21 points) pre- 275 treatment but efficiency values for related models produced by 2nd derivative (13 datapoints), 2nd 276 derivative (9 datapoints) and 2nd derivative (5 datapoints) are a lot lower at 77.2, 78.4 and 77.0% 277 respectively. By way of contrast, there are 3 POTFUN models producing percentage efficiency 278 values around 83% i.e. 1st derivative (5 datapoints), 1st derivative (9 datapoints) and 1st derivative 279 (13 datapoints). These results were also achieved at a higher % confidence level (95% vs 90%), 280 which is preferable. POTFUN models can fit data of this type, which exhibit a complex sample 281 distribution, better than SIMCA. For this reason, POTFUN results are quite constant when model 282 parameters are varied, as it is evident from an inspection of Figure 3-a. 283 Sensitivity and specificity percentages for any given model can be altered by varying the 284 associated confidence level. A particular example of this is given by POTFUN which, at a 90% 285 confidence level and using >12 PCs, produced the high specificity values (more than 90 %), but 286 was associated with very low sensitivity figures of between 60 and 70 % (unpublished data). 287 Similarly, UNEQ models developed on this olive oil dataset achieved even higher sensitivity values 288 (about 97 %) but at the expense of unacceptably low specificity values (<40 %) (results not shown). 289 In case of Corsican honey, as shown in Table 3, models produced by UNEQ and POTFUN 290 were characterised by similar sensitivity values although UNEQ produced models with the highest 291 sensitivity (94.6%). It was notable that sensitivity values for all of the SIMCA models were much 292 lower (59.5 to 70%). Regarding specificity values, SIMCA did produce models with very high 293 values (between 87 and 100%) while POTFUN and UNEQ produced similar and lower ranges from 294 69.6 up to 87.0%. Overall, the highest efficiency (88.4 %) was obtained by UNEQ and involved a 295 model using 6 PCs, a 2nd derivative (9 points) pre-treatment and an 85% confidence level. POTFUN 296 also produced models with high efficiency around 83%. Since the results of SIMCA are very 297 unbalanced with regard to sensitivity and specificity values, as clearly shown in Figure 3-b, SIMCA 298 models cannot be considered appropriate for this data set. 12 299 Applying the criteria discussed above, it may be that the model with best performances 300 figures at a high (95%) confidence level and a low (4) number of principal components involves 301 POTFUN using a 1st derivative (5 or 9 datapoints) pre-treatment. 302 Efficiency values achieved by SIMCA and POTFUN varied between 75.4 – 82.2 % and 75.1 303 – 84.0 %, respectively. Also in the case of honey, with a proper combination of number of PCs and 304 confidence level, UNEQ and POTFUN can provide high-sensitivity models (close to 100 %) but 305 these are associated with very low specificity values (data not shown). 306 In food authentication problems, both sensitivity and specificity are important. Sensitivity 307 near to 100 % means that almost all samples produced and controlled by e.g. the PDO Corsican 308 honey consortium are accepted by the model built for this class; associated high specificity values 309 mean that almost all non-Corsican samples are rejected. Depending on the particular aim, therefore, 310 it is possible and maybe even desirable to choose models characterised by elevated values of either 311 or both of these parameters. For example, from the viewpoint of a national regulatory body, a model 312 with 100% sensitivity and specificity is the goal since all compliant and non-compliant samples will 313 be correctly identified. However, in the case of a marketing body, the most important criterion for a 314 good model may be that it identifies all of the samples produced in e.g. a given PDO as being 315 compliant with the model for samples from that PDO. Such a model does not mean that samples 316 originating outside the region will, in all cases, be classified outside the model but that is not of 317 prime importance in this case. 318 Variable selection using the SELECT algorithm was applied two both olive oil and honey 319 spectral datasets with the aim of removing uninformative variables and thereby potentially 320 improving model sensitivity and specificity. The results showed sensitivity and specificity values 321 very close to those obtained using the full variable dataset. In this case, therefore, variable reduction 322 did not significantly improve percentage efficiency results. For example, the best POTFUN models 323 for honey data after variable reduction achieved a maximum efficiency (1st derivative 5 pts) of 85.8 324 % versus 83.2 % obtained previously when all variables were retained. The explanation for this may 13 325 be that all models were developed using principal components, which extract the maximum useful 326 information available in the spectral data. It may also be, of course, that the information used in 327 model development is dispersed throughout the entire spectral range. 328 Classification results for these sample sets and issues have been previously published 329 (Woodcock et al., 2008; Woodcock et al., 2009). With regard to olive oils, the best PLS-DA models 330 correctly identified the origins of 92.8 and 81.2% of Ligurian and non-Ligurian samples in a 331 prediction sample set respectively. In the case of honeys, correct classification rates of 90.4 and 332 86.3% for Corsican and non-Corsican prediction samples respectively were reported. It is difficult 333 to make meaningful comparisons between these results and those of the class-models developed in 334 the current study but it is fair to say that class-modelling methods produced correct classification 335 rates similar to those achieved by PLS-DA and have a firmer theoretical basis. Given that the 336 classification approach depends on the development of two models (e.g. Ligurian olive oil and non- 337 Ligurian olive oil) and that the composition of the latter class cannot be well-defined, PLS-DA 338 models will always be heavily influenced by the composition of this sample class. 339 340 4. Conclusions 341 Class-modelling methods are appropriate for addressing problems of the type that typically 342 arise in food authentication studies. In this study, models have been developed for olive oil and 343 honey, each from a single PDO region in Italy and France respectively. Different options with 344 regard to the identification of the most appropriate or best model have been discussed. Arising from 345 this discussion, models with sensitivity percentages of up to 90% for Ligurian olive oil (UNEQ) and 346 94.6% for Corsican honey (UNEQ) may be selected. Using overall percentage efficiency as an 347 indicator of success, best models for both Ligurian olive oil and Corsican honey were developed by 348 a potential function technique (POTFUN) and produced values of around 83%. 349 350 14 351 352 353 Acknowledgements 354 Thanks are due to the EU-funded TRACE project for the samples and to T. Woodcock for 355 collecting the spectral data. The information contained in this article reflects the authors’ views; the 356 European Commission is not liable for any use of the information contained therein. 357 358 359 360 361 362 363 364 365 366 References Alciaturi, C.E., Escobar, M.E., & De La Cruz, C. (1998). A numerical procedure for curve fitting of noisy infrared spectra. Analytica Chimica Acta, 376, 169-181. Anderson, K. A., & Smith, B. W. (2002). Chemical profiling to differentiate geographic growing origins of coffee. Journal of Agricultural and Food Chemistry, 50, 2068–2075. Barnes, R.J., Dhanoa, M.S., & Lister, S.J. (1989). Standard Normal Variate transformation and de-trending of near infrared diffuse reflectance spectra. Applied Spectroscopy, 43, 772-777. Blanco, M., & Pagés, J. (2002). Classification and quantification of finishing oils by near infrared spectroscopy, Analytica Chimica Acta, 463, 295-303. 367 Casale, M., Casolino, C., Ferrari, G., & Forina, M. (2008). Near infrared spectroscopy and 368 class modelling techniques for the geographical authentication of Ligurian extra virgin olive oil. 369 Journal of Near Infrared Spectroscopy, 16, 39-47. 370 Consolandi, C., Palmieri, L., Severgnini, M., Maestri, E., Marmiroli, N., Agrimonti, C., 371 Baldoni, L., Donini, P., De Bellis, G., & Castiglioni, B. (2008). A procedure for olive oil 372 traceability and authenticity: DNA extraction, multiplex PCR and LDR-universal array analysis. 373 European Food Research and Technology, 227, 1429-1438. 374 375 Coomans, D., & Broeckaert, I. (1986). Potential Pattern Recognition in Chemical and Medical Decision Making. Research Studies Press, England. Letchworth. 15 376 377 Council Regulation (EC) No 510/2006 of 20 March 2006 on the protection of geographical indications and designations of origin for agricultural products and foodstuffs. 378 Cozzolino, D., Smyth, H. E., & Gishen, M. (2003). Feasibility study on the use of visible 379 and near infrared spectroscopy together with chemometrics to discriminate between commercial 380 white wines of different varietal origins. Journal of Agriculture and Food Chemistry, 51, 7703– 381 7708. 382 383 384 385 Derde, M.P., & Massart, D.L. (1986). UNEQ: a disjoint modelling technique for pattern recognition based on normal distribution. Analytica Chimica Acta, 184, 33-51. Dimara, E., & Skuras, D., (2003). Consumer evaluations of product certification, geographic association and traceability in Greece. European Journal of Marketing, 37, 690-705. 386 Forina, M., & Lanteri S. (1984). Chemometrics: Mathematics and Statistics in Chemistry. 387 In: B.R. Kowalski (Ed.), NATO ASI Series, Ser. C, vol. 138, (pp. 439–466). Reidel Publ.Co., 388 Dordrecht. 389 390 Forina, M., Armanino, C., Leardi, R., & Drava, G. (1991). A class-modelling technique based on potential functions. Journal of Chemometrics, 5, 435–453. 391 Forina, M., Oliveri P., Lanteri S., Casale., M. (2008). Class-modelling techniques, classic 392 and new, for old and new problems. Chemometrics and Intelligent Laboratory System, 93, 132-148. 393 Forina, M., Lanteri, S., Armanino, C., Casolino, C., Casale, M., & Oliveri, P. (2008). V- 394 PARVUS 2008, Dip. Chimica e Tecnologie Farmaceutiche e Alimentari, University of Genova, 395 available (free, with manual and examples) from authors or at http://www.parvus.unige.it. 396 Forina, M., Casale, M., & Oliveri, P. (2009). Application of Chemometrics to Food 397 Chemistry. In: S.D.Brown, R.Tauler & B.Walczak (Eds.), Comprehensive Chemometrics, pp. 75- 398 128. Elsevier, Amsterdam. 399 Galtier, O., Dupuy, N., Le Dréau Y., Ollivier D., Pinatel, C., Kister J., & Artaud, J. (2007). 400 Geographic origins and compositions of virgin olive oils determinated by chemometric analysis of 401 NIR spectra. Analytica Chimica Acta, 595, 136-144. 16 402 403 Hotelling, H. (1947). Multivariate quality control. In: C.Eisenhart, M.W.Hastay & W.A.Wallis (Eds.), Techniques of Statistical Analysis, pp. 111-184. McGraw-Hill, N.Y. 404 405 Kowalski, B.R., & Bender, C.F. (1976). An orthogonal feature selection method, Pattern Recognition, 8, 1-4. 406 Liu, L., Cozzolino, D., Cykar, W. U., Dambergs, R. G., Janik, L., O’Neill, B. K., Colby, C. 407 B., & Gishen, M. (2008). Preliminary study on the application of visible-near infrared spectroscopy 408 and Chemometric to classify Riesling wines from different countries. Food Chemistry, 106, 781- 409 786. 410 411 Luykx, D. M. A. M., & Van Ruth, S. M. (2008). An overview of analytical methods for determining the geographical origin of food products. Food Chemistry, 107, 897-911. 412 Marini, F., Bucci, R., Magrì, A. L., Magrì, A. D. (2010). An overview of the chemometric 413 methods for the authentication of the geographical and varietal origin of olive oils. In: V.R.Preedy 414 & R.R.Watson (Eds.), Olives and Olive Oil in Health and Disease Prevention, pp. 569-579. 415 Elsevier, Amsterdam. 416 417 Reid, L. M., O’Donnell, C. P., & Downey, G.(2006). Recent technological advances for the determination of food authenticity. Trends in Food Science & Technology, 17, 344–353. 418 Sinelli, N., Casiraghi, E., Tura, D., & Downey G. (2007). Characterisation and classification 419 of Italian virgin olive oils by near- and mid- infrared spectroscopy. Journal of Near Infrared 420 Spectroscopy, 16, 335-342. 421 422 Stefanoudaki, E., Kotsifaki, F., & Koutsaftakis, A. (1997). The potential of HPLC triglyceride profiles for the classification of Cretan olive oils. Food Chemistry, 60, 425–432. 423 Toher, D., Downey, G., & Murphy, T.B. (2007). A comparison of model-based and 424 regression classification techniques applied to near infrared spectroscopic data in food 425 authentication studies. Chemometrics and Intelligent Laboratory System, 89 , 102-115. 17 426 Tomás-Barberán, F. A., Ferreres, F., Garcıá-Viguera, C., & Tomás-Lorente, F. (1993). 427 Flavonoids in honey of different geographical origin. Zeitschrift für Lebensmittel-Untersuchung 428 und–Forschung, 196, 38–44. 429 430 431 432 433 434 Vaiphasa, C. (2006). Consideration of smoothing techniques for hyperspectral remote sensing. ISPRS Journal of Photogrammetry & Remote Sensing, 60, 91-99. Williams, P., & Norris, K. 2001. Near Infrared Technology in the Agricultural and Food Industries 2nd ed; American Association of Cereal Chemist. St. Paul, Minnesota, USA, 296 pp. Wold S., & Sjöström M. (1977). Chemometrics: Theory and Applications. In: B.R. Kowalski (Ed.), ACS Symposium Series, American Chemical Society, vol. 52, (pp. 243–282). 435 Woodcock, T., Downey, G., Kelly, J.D., & O’Donnell, C.P. (2007) Geographical 436 Classification of Honey Samples by Near-Infrared Spectroscopy: A Feasibility Study. Journal of 437 Agricultural and Food Chemistry, 55, 9128-9134 . 438 Woodcock, T., Downey, G., & O’Donnell, C.P. (2008). Confirmation of Declared 439 Provenance of European Extra Virgin Olive Oil Samples by NIR Spectroscopy. Journal of 440 Agricultural and Food Chemistry, 56, 11520-11525. 441 442 Woodcock, T., Downey, G. and O’Donnell, C. (2009). “Near infrared spectral fingerprinting for confirmation of claimed PDO provenance of honey.” Food Chemistry, 114(2), 742-746. 443 444 445 446 447 448 449 450 451 18 (a) 452 (b) 453 454 455 Figure 1. Principal component scores (PC1 vs PC2) plots of all samples of (a) olive oil and (b) 456 honey. 457 19 458 459 460 461 Fig. 2. Lines represent SIMCA (a), UNEQ (b) and POTFUN (c) models based on the first two PCs in a simulated example for non-uniformly distributed objects. (filled squares: objects belong to the modelled class, open squares: objects are extraneous to it) 20 462 463 464 465 466 467 468 469 470 Fig. 3. Representation of the results, in terms of sensitivity and specificity on the external validation set, for models of Ligurian olive oil (a) and Corsican honey (b), as reported in Tables 1 and 2. (S: SIMCA models, U: UNEQ models, P: POTFUN models) 21 471 472 Table 1. Sources of olive oil samples Harvest Liguria Italy Non-Liguria Greece Spain France Cyprus Turkey Total 2005 79 173 46 38 10 6 0 352 2006 63 163 25 42 9 0 14 316 2007 68 116 7 34 20 0 0 245 Total 210 452 78 114 39 6 14 913 473 474 22 475 476 477 Table 2. Ligurian olive oil: best model performances on the external validation set according to the maximum efficiency criterion ClassModelling Technique SIMCA UNEQ POTFUN Sensitivity (%) Specificity (%) 75.7 82.9 80.0 80.0 71.4 70.0 72.9 71.4 81.4 82.9 85.7 88.6 87.1 87.1 90.0 85.7 87.1 87.1 88.6 90.0 84.2 80.0 80.0 78.6 77.1 75.7 78.6 80.0 72.9 78.6 87.7 77.1 76.6 77.6 78.9 84.7 84.4 83.5 87.7 78.7 67.0 71.2 71.8 72.3 70.2 62.9 66.6 67.7 70.2 68.9 62.2 86.3 86.2 86.9 82.6 74.4 82.1 81.8 84.9 79.6 478 Efficiency (%)479 480 81.5481 482 79.9483 78.3484 78.8 75.1 77.0 78.4 77.2 84.5 80.8 75.8 79.4 79.1 79.4 79.5 73.4 76.2 76.8 485 78.8486 78.8 72.4487 83.1 488 83.0 82.6489 79.8 75.1490 80.3 80.9491 78.7 492 79.1 493 494 23 495 496 Table 3. Corsican honey: best model performances on the external validation set according to the 497 maximum efficiency criterion ClassSensitivity Specificity Efficiency Modelling (%) (%) (%)498 499 Technique 500 78.2 70.2 87.0 501 62.2 100.0 78.8502 80.5503 64.9 100.0 80.5504 64.9 100.0 77.1505 62.2 95.7 SIMCA 78.8 64.9 95.7 78.8 62.2 100.0 77.1 59.5 100.0 75.4 59.5 95.7 82.2 67.6 100.0 89.2 69.6 78.8 81.1 82.6 81.8 78.4 82.6 80.5 78.4 82.6 80.5 81.1 78.3 79.7 UNEQ 94.6 73.9 83.6 94.6 82.6 88.4 86.5 73.9 80.0 89.2 69.6 78.8 86.5 73.9 80.0 83.8 73.9 78.7 83.8 82.6 83.2 83.8 82.6 83.2 81.1 87.0 84.0 75.7 82.6 79.1 POTFUN 83.8 69.6 76.3 75.7 87.0 81.1 73.0 82.6 77.6 81.1 69.6 75.1 86.5 78.3 82.3 24