Wavelet Transform-based Clustering of Spectra in Chemometrics ( )

advertisement
Wavelet Transform-based Clustering of Spectra in Chemometrics
A. Ukil
J. Bernasconi
H. Braendle
ABB Corporate Research, Segelhofstrasse 1K, Baden-Daettwil, CH-5405, Switzerland
{abhisek.ukil, jakob.bernasconi, hubert.braendle}@ch.abb.com
Keywords: Chemometrics, Spectroscopy, Spectrum, Wavelet transform, Data clustering.
1 Introduction
Spectroscopy is the study of the interaction between radiation (electromagnetic radiation, or light, as well as
particle radiation) and matter. Spectrometry is the measurement of these interactions and an instrument
which performs such measurements is a spectrometer or spectrograph. A plot of the interaction is referred to
as a spectrum [1]. Such spectra are typically used in food and agrochemical quality control applications,
pharmaceutical, medical diagnostics, etc. Various kinds of statistical and mathematical data analysis
techniques are applied to process those spectra, grouped under an umbrella term ‘chemometrics’ [2].
In this paper, we propose a novel algorithm using the wavelet transform to cluster different spectra
originating from different chemical applications and acquired by different spectrometers. The clustering
technique insensitive to spectrometer type or spectra acquisition method, can be applied effectively to
classify the spectra for further complex chemometrics operations which supposedly benefit from the
clustered data.
2 Background Information
2.1 Spectroscopy & Chemometrics
Infrared (IR) or near-infrared (NIR) spectroscopy is a method used to identify a compound or to analyze the
composition of a material. This is done by studying the interaction of infrared light with matter. The plot
called IR/NIR spectrum shows the absorption of the infrared light at various different wavelengths. In IR
spectroscopy the considered frequency is usually somewhere between 14,000 and 10cm−1. Note that the
frequency scale applied is wavenumbers (measured in reciprocal centimeters) rather than wavelengths
(measured in microns). On the other hand, the absorption of the materials at different frequencies is
measured in percent.
Chemometrics is the application of mathematical or statistical methods to chemical data. Chemometrics is
utilized for various purposes like multivariate calibration, signal processing/ conditioning, pattern
recognition, experimental design and so forth [2].
2.2 Wavelet Transform
The Wavelet Transform (WT) is a mathematical tool, like Fourier transform for signal analysis. Wavelet
analysis is the breaking up of a signal into shifted and scaled versions of the original (or mother) wavelet.
The Continuous Wavelet Transform (CWT) is defined as the sum over all time of the signal multiplied by
the scaled and shifted versions of the wavelet function ψ . The CWT of a signal x(t) is defined as
CWT (a, b) =
∞
x(t )ψ a*,b (t ) dt ,
(1)
ψ ((t − b) / a ) .
(2)
−∞
ψ a ,b (t ) = a
−1 / 2
ψ (t ) is the mother wavelet, the asterisk in (1) denotes a complex conjugate, and
a, b ∈ R, a ≠ 0 , (R is a real
continuous number system) are the scaling and shifting parameters respectively.
The Discrete Wavelet Transform (DWT) is given by choosing a = a 0m , b = na 0m b0 , t = kT in (1) & (2),
where T = 1.0 and k , m, n ∈ Z , (Z is the set of positive integers).
DWT (m, n) = a0− m / 2
(
)
x[k ]ψ *[(k − na0mb0 ) / a0m ] .
(3)
The Multiresolution Signal Decomposition (MSD) [4] technique decomposes a given signal into its detailed
and smoothed versions. Let x[n] be a discrete-time signal, then MSD technique decomposes the signal in the
form of WT coefficients at scale 1 into c1[n] and d1[n], where c1[n] is the smoothed version of the original
signal, and d1[n] the detailed version.
c1[n] =
h[k − 2n] x[k ] ,
(4)
g[k − 2n] x[k ] .
(5)
k
d1[n] =
k
where h[n] and g[n] are the associated filter coefficients that decompose x[n] into c1[n] and d1[n]
respectively. The next higher scale decomposition will be based on c1[n]. Thus, the decomposition process
can be iterated, with successive approximations being decomposed in turn, so that the original signal is
broken down into many lower resolution components. This is called the wavelet decomposition tree [4],
shown in Figure 1.
Figure 1 – Multiresolution signal decomposition and wavelet decomposition tree
3 Application of Wavelet Transform & Clustering
3.1 WT on spectra
WT can be effectively applied for processing IR or NIR spectra in chemometrics [5],[6]. Figure 2 shows one
such example. In Figure 2, plot (i) shows a typical NIR chemometrics spectra which has been recorded over
220 wavenumbers; plot (ii) shows the smoothed version (see (4)) of the wavelet decomposition, and plot (iii)
the detailed version (see (5)). A 4-scale decomposition using the Haar [3] mother wavelet has been
performed on the 220-point spectra, giving
220
24
≈ 14
wavelet coefficients (in plots ii & iii).
The smoothed version coefficients (Figure 2, plot ii) resemble the original spectra but reducing the data
points from 220 wavenumbers to 14 wavelet coefficients. Therefore, smoothed coefficients could be utilized
for data reduction purpose, which is necessary in spectra calibration. This is because all the wavenumbers
(220 in Figure 2, plot i) cannot be used due to the risk of overfitting. Besides WT, other popular data
reduction techniques in chemometics are partial least squares (PLS) [2], [7], principal component analysis
(PCA) [2], [7]. If we have a spectra with N wavenumbers and we require m number of reduced data point (m
and N are strictly integers, m < N ) using wavelet decomposition at scale S, then the relationship is given by
m = round
N
.
2S
(6)
From (6), knowing N and m, the optimum scale can be determined as
S = round log
N
m
log(2) ,
(7)
where S should be strictly an integer (that is why the round operation). In practice, to avoid under- and
overfitting, m is restricted between 5 and 20 [5],[6].
From Figure 2, the detailed coefficients (plot iii) show the changes in the frequency profile. For example, in
plot (iii) of Figure 2, we can notice three peaks at coefficients 1, 5, 12 corresponding to the wavenumbers 1,
70, 180 in plot (i) respectively. These peaks show the changes in the frequency profile of the spectra at those
points occurring due to changes in the constituent absorptions [4].
3.2 Proposed Clustering Algorithm
The clustering algorithm is depicted by the flowchart in Figure 3. As per the flowchart in Figure 3, for the
raw spectra of N wavenumbers, we determine the optimum scale using (7) assuming a realistic data
reduction point m between 5 and 20. Wavelet decomposition using Haar [3] mother wavelet up to this
optimum scale gives smoothed and detailed coefficients. The smoothed coefficients could be used for
calibration purpose. On the detailed coefficients, we perform a search for the maximum coefficient (among
the m coefficients). The maximum coefficient is a key parameter for clustering the spectra. That is, for
different spectra we monitor their respective maximum detailed wavelet coefficient.
Figure 2 – Application of wavelet transform on
chemometrics spectra
Figure 3 – Flowchart of the spectra clustering algorithm
The kind of clustering revealed depends on applications, and could be indicating spectra methodology,
different samples, etc, as we shall see in the application results section. One key point in achieving effective
clustering is to use the raw spectra before applying any preprocessing. In industrial chemometrics, standard
preprocessing steps like multiplicative scatter correction (MSC) [2] (described briefly in the following
section), mean centering [2], etc are performed on the spectra to minimize the adverse effects due to
instrumental variations, changes in recording conditions, etc.
3.3 Multiplicative Scatter Correction (MSC)
si (k ) represents spectral absorbances of sample i ( i = 1,2,
( k = 1,2, , N ), then
If
, n ) at wavelength number k
si (k ) = ai s (k ) + bi ,
1
s (k ) =
n
n
i =1
(8)
si ( k ) .
(9)
For MSC, the coefficients ai ’s and bi ’s are obtained by solving the following optimization problem
N
arg min
ai ,bi
k =1
[si (k ) − ai s (k ) − bi ]2 .
(10)
This gives
si s − s i s
ai =
bi =
s2 − s
,
2
si s 2 − si s s
s2 − s
si =
1
N
N
2
(11)
,
(12)
si ( k ) .
(13)
si (k ) − bi
,
ai
(14)
k =1
Then, we have
siMSC (k ) =
where ai ’s and bi ’s are obtained using (11-12).
4 Application Results
Figure 4 shows the monitoring of maximum detailed wavelet coefficients of various NIR spectra of chemical
components. Figure 4 shows clearly two clusters (boxed and marked A and B in Figure 4). For this particular
case, 487 different spectra were recorded using two methods ‘singlebeam’ [2] and ‘non-singlebeam’ [2],
which are reflected by the wavelet-based clusters.
We used the maximum detailed coefficient as the clustering parameter. The reason is if we use other detailed
wavelet coefficients, the classification gap between the clusters gets reduced. This is shown in Figure 5,
where we plot the first, second and third highest detailed wavelet coefficients for the same chemical spectra.
Figure 6 shows an example how preprocessing destroys the clusters. From Figure 6, we can notice that as an
effect of the MSC preprocessing on the spectra, we lose the distinct clustering pattern as in Figure 4.
Figure 4 – Clustering of NIR spectra of chemical
components
Figure 5 – 1st, 2nd, 3rd highest coefficients as
clustering parameter
Figure 6 – MSC results in unsuccessful clustering
5 Conclusion
In this paper, we have presented a novel algorithm based on wavelet decomposition for clustering
chemometrics spectra. Maximum detailed wavelet coefficients of spectra decomposed to an
optimum scale are used as the clustering parameter. Maximum detailed coefficients are chosen as
they reflect the frequency profile changes to highest order compared to decreasing coefficients. The
clustering step has to be applied before any preprocessing like multiplicative scatter correction,
baseline correction, etc. The smoothed wavelet coefficients arriving from the otherside of the
wavelet decomposition could instead be used as a data reduction tool for calibration. However, it is
proposed that calibration accuracy improves for clustered spectra compared to all-mixed spectra [4],
[5]. The kind of clusters revealed varies depending on particular application.
6 References
[1] Wikipedia resources. Available: http://en.wikipedia.org
[2] R.G. Brereton, Chemometrics: data Analysis for the Laboratory and Chemical Plant, John Wiley & Sons, Ltd, England, 2003.
[3] I. Daubechies, Ten Lectures on Wavelets, Society for Industrial and Applied Mathematics, Philadelphia, 1992.
[4] S. Mallat, “A Theory for Multiresolution Signal Decomposition: The Wavelet Representation,” IEEE Trans. Pattern Anal.
Mach. Intelligence vol. 11, no. 7, pp. 674-693, 1989.
[5] F.T. Chau, Y.Z. Liang, J. Gao, X.G. Shao, Chemometrics From Basics to Wavelet Transform, John Wiley, NJ, 2004.
[6] B. Walczak (ed.), Wavelets in Chemistry, Elsevier, Amsterdam, 2000.
[7] K.H. Esbensen, D. Guyot, F. Westad, L.P. Houmoller, Multivariate Data Analysis – In Practice, 5th ed., Camo Process AS,
Norway, 2002.
Download