Wavelet Transform-based Clustering of Spectra in Chemometrics A. Ukil J. Bernasconi H. Braendle ABB Corporate Research, Segelhofstrasse 1K, Baden-Daettwil, CH-5405, Switzerland {abhisek.ukil, jakob.bernasconi, hubert.braendle}@ch.abb.com Keywords: Chemometrics, Spectroscopy, Spectrum, Wavelet transform, Data clustering. 1 Introduction Spectroscopy is the study of the interaction between radiation (electromagnetic radiation, or light, as well as particle radiation) and matter. Spectrometry is the measurement of these interactions and an instrument which performs such measurements is a spectrometer or spectrograph. A plot of the interaction is referred to as a spectrum [1]. Such spectra are typically used in food and agrochemical quality control applications, pharmaceutical, medical diagnostics, etc. Various kinds of statistical and mathematical data analysis techniques are applied to process those spectra, grouped under an umbrella term ‘chemometrics’ [2]. In this paper, we propose a novel algorithm using the wavelet transform to cluster different spectra originating from different chemical applications and acquired by different spectrometers. The clustering technique insensitive to spectrometer type or spectra acquisition method, can be applied effectively to classify the spectra for further complex chemometrics operations which supposedly benefit from the clustered data. 2 Background Information 2.1 Spectroscopy & Chemometrics Infrared (IR) or near-infrared (NIR) spectroscopy is a method used to identify a compound or to analyze the composition of a material. This is done by studying the interaction of infrared light with matter. The plot called IR/NIR spectrum shows the absorption of the infrared light at various different wavelengths. In IR spectroscopy the considered frequency is usually somewhere between 14,000 and 10cm−1. Note that the frequency scale applied is wavenumbers (measured in reciprocal centimeters) rather than wavelengths (measured in microns). On the other hand, the absorption of the materials at different frequencies is measured in percent. Chemometrics is the application of mathematical or statistical methods to chemical data. Chemometrics is utilized for various purposes like multivariate calibration, signal processing/ conditioning, pattern recognition, experimental design and so forth [2]. 2.2 Wavelet Transform The Wavelet Transform (WT) is a mathematical tool, like Fourier transform for signal analysis. Wavelet analysis is the breaking up of a signal into shifted and scaled versions of the original (or mother) wavelet. The Continuous Wavelet Transform (CWT) is defined as the sum over all time of the signal multiplied by the scaled and shifted versions of the wavelet function ψ . The CWT of a signal x(t) is defined as CWT (a, b) = ∞ x(t )ψ a*,b (t ) dt , (1) ψ ((t − b) / a ) . (2) −∞ ψ a ,b (t ) = a −1 / 2 ψ (t ) is the mother wavelet, the asterisk in (1) denotes a complex conjugate, and a, b ∈ R, a ≠ 0 , (R is a real continuous number system) are the scaling and shifting parameters respectively. The Discrete Wavelet Transform (DWT) is given by choosing a = a 0m , b = na 0m b0 , t = kT in (1) & (2), where T = 1.0 and k , m, n ∈ Z , (Z is the set of positive integers). DWT (m, n) = a0− m / 2 ( ) x[k ]ψ *[(k − na0mb0 ) / a0m ] . (3) The Multiresolution Signal Decomposition (MSD) [4] technique decomposes a given signal into its detailed and smoothed versions. Let x[n] be a discrete-time signal, then MSD technique decomposes the signal in the form of WT coefficients at scale 1 into c1[n] and d1[n], where c1[n] is the smoothed version of the original signal, and d1[n] the detailed version. c1[n] = h[k − 2n] x[k ] , (4) g[k − 2n] x[k ] . (5) k d1[n] = k where h[n] and g[n] are the associated filter coefficients that decompose x[n] into c1[n] and d1[n] respectively. The next higher scale decomposition will be based on c1[n]. Thus, the decomposition process can be iterated, with successive approximations being decomposed in turn, so that the original signal is broken down into many lower resolution components. This is called the wavelet decomposition tree [4], shown in Figure 1. Figure 1 – Multiresolution signal decomposition and wavelet decomposition tree 3 Application of Wavelet Transform & Clustering 3.1 WT on spectra WT can be effectively applied for processing IR or NIR spectra in chemometrics [5],[6]. Figure 2 shows one such example. In Figure 2, plot (i) shows a typical NIR chemometrics spectra which has been recorded over 220 wavenumbers; plot (ii) shows the smoothed version (see (4)) of the wavelet decomposition, and plot (iii) the detailed version (see (5)). A 4-scale decomposition using the Haar [3] mother wavelet has been performed on the 220-point spectra, giving 220 24 ≈ 14 wavelet coefficients (in plots ii & iii). The smoothed version coefficients (Figure 2, plot ii) resemble the original spectra but reducing the data points from 220 wavenumbers to 14 wavelet coefficients. Therefore, smoothed coefficients could be utilized for data reduction purpose, which is necessary in spectra calibration. This is because all the wavenumbers (220 in Figure 2, plot i) cannot be used due to the risk of overfitting. Besides WT, other popular data reduction techniques in chemometics are partial least squares (PLS) [2], [7], principal component analysis (PCA) [2], [7]. If we have a spectra with N wavenumbers and we require m number of reduced data point (m and N are strictly integers, m < N ) using wavelet decomposition at scale S, then the relationship is given by m = round N . 2S (6) From (6), knowing N and m, the optimum scale can be determined as S = round log N m log(2) , (7) where S should be strictly an integer (that is why the round operation). In practice, to avoid under- and overfitting, m is restricted between 5 and 20 [5],[6]. From Figure 2, the detailed coefficients (plot iii) show the changes in the frequency profile. For example, in plot (iii) of Figure 2, we can notice three peaks at coefficients 1, 5, 12 corresponding to the wavenumbers 1, 70, 180 in plot (i) respectively. These peaks show the changes in the frequency profile of the spectra at those points occurring due to changes in the constituent absorptions [4]. 3.2 Proposed Clustering Algorithm The clustering algorithm is depicted by the flowchart in Figure 3. As per the flowchart in Figure 3, for the raw spectra of N wavenumbers, we determine the optimum scale using (7) assuming a realistic data reduction point m between 5 and 20. Wavelet decomposition using Haar [3] mother wavelet up to this optimum scale gives smoothed and detailed coefficients. The smoothed coefficients could be used for calibration purpose. On the detailed coefficients, we perform a search for the maximum coefficient (among the m coefficients). The maximum coefficient is a key parameter for clustering the spectra. That is, for different spectra we monitor their respective maximum detailed wavelet coefficient. Figure 2 – Application of wavelet transform on chemometrics spectra Figure 3 – Flowchart of the spectra clustering algorithm The kind of clustering revealed depends on applications, and could be indicating spectra methodology, different samples, etc, as we shall see in the application results section. One key point in achieving effective clustering is to use the raw spectra before applying any preprocessing. In industrial chemometrics, standard preprocessing steps like multiplicative scatter correction (MSC) [2] (described briefly in the following section), mean centering [2], etc are performed on the spectra to minimize the adverse effects due to instrumental variations, changes in recording conditions, etc. 3.3 Multiplicative Scatter Correction (MSC) si (k ) represents spectral absorbances of sample i ( i = 1,2, ( k = 1,2, , N ), then If , n ) at wavelength number k si (k ) = ai s (k ) + bi , 1 s (k ) = n n i =1 (8) si ( k ) . (9) For MSC, the coefficients ai ’s and bi ’s are obtained by solving the following optimization problem N arg min ai ,bi k =1 [si (k ) − ai s (k ) − bi ]2 . (10) This gives si s − s i s ai = bi = s2 − s , 2 si s 2 − si s s s2 − s si = 1 N N 2 (11) , (12) si ( k ) . (13) si (k ) − bi , ai (14) k =1 Then, we have siMSC (k ) = where ai ’s and bi ’s are obtained using (11-12). 4 Application Results Figure 4 shows the monitoring of maximum detailed wavelet coefficients of various NIR spectra of chemical components. Figure 4 shows clearly two clusters (boxed and marked A and B in Figure 4). For this particular case, 487 different spectra were recorded using two methods ‘singlebeam’ [2] and ‘non-singlebeam’ [2], which are reflected by the wavelet-based clusters. We used the maximum detailed coefficient as the clustering parameter. The reason is if we use other detailed wavelet coefficients, the classification gap between the clusters gets reduced. This is shown in Figure 5, where we plot the first, second and third highest detailed wavelet coefficients for the same chemical spectra. Figure 6 shows an example how preprocessing destroys the clusters. From Figure 6, we can notice that as an effect of the MSC preprocessing on the spectra, we lose the distinct clustering pattern as in Figure 4. Figure 4 – Clustering of NIR spectra of chemical components Figure 5 – 1st, 2nd, 3rd highest coefficients as clustering parameter Figure 6 – MSC results in unsuccessful clustering 5 Conclusion In this paper, we have presented a novel algorithm based on wavelet decomposition for clustering chemometrics spectra. Maximum detailed wavelet coefficients of spectra decomposed to an optimum scale are used as the clustering parameter. Maximum detailed coefficients are chosen as they reflect the frequency profile changes to highest order compared to decreasing coefficients. The clustering step has to be applied before any preprocessing like multiplicative scatter correction, baseline correction, etc. The smoothed wavelet coefficients arriving from the otherside of the wavelet decomposition could instead be used as a data reduction tool for calibration. However, it is proposed that calibration accuracy improves for clustered spectra compared to all-mixed spectra [4], [5]. The kind of clusters revealed varies depending on particular application. 6 References [1] Wikipedia resources. Available: http://en.wikipedia.org [2] R.G. Brereton, Chemometrics: data Analysis for the Laboratory and Chemical Plant, John Wiley & Sons, Ltd, England, 2003. [3] I. Daubechies, Ten Lectures on Wavelets, Society for Industrial and Applied Mathematics, Philadelphia, 1992. [4] S. Mallat, “A Theory for Multiresolution Signal Decomposition: The Wavelet Representation,” IEEE Trans. Pattern Anal. Mach. Intelligence vol. 11, no. 7, pp. 674-693, 1989. [5] F.T. Chau, Y.Z. Liang, J. Gao, X.G. Shao, Chemometrics From Basics to Wavelet Transform, John Wiley, NJ, 2004. [6] B. Walczak (ed.), Wavelets in Chemistry, Elsevier, Amsterdam, 2000. [7] K.H. Esbensen, D. Guyot, F. Westad, L.P. Houmoller, Multivariate Data Analysis – In Practice, 5th ed., Camo Process AS, Norway, 2002.