Study on Image Pattern Selection via Support Vector Machine for Improving Chinese Herb GC×GC Data Classification and Clustering Performance wu zhili Vincent@comp.hkbu.edu.hk Comp Dept , HKBU And other authors …. Abstract The two-dimensional Gas Chromatography (2D-GC) has been a highly powerful technique in the analysis of complex mixtures. However, despite the much informative 2-D GC intensity image that is easily visualized by experts for manual interpretation, it has imposed great complexities and difficulties upon computational analysis approaches when intending to precisely and automatically process the 2D-GC data compared with the already matured signal processing method for 1D-GC data. Complemented by some techniques used in the pre/pro-processing for image analysis, this paper proposes the support vector machine (SVM) method for pattern selection from 2D-GC images. The experimentation for Chinese Herb data classification and clustering shows the improvement adopting the SVM feature selection method. Keywords: Chinese herb, 2D-GC, SVM, Feature Selection, Image Analysis, Classification, Clustering. 1 Introduction 1.1 The Significance and importance for Chinese Herb Data analysis …. 1.2 The superiority of 2D-GC when compared with 1D-GC, and its exact suitability for Chinese Herb analysis …. 1.3 The difficulties of analyzing 2D-GC data when dealing with the computational complexity and the intractability of pattern recognition …. As from the previous introduction to 2D-GC, the data captured is saved with a matrix form with each column of data corresponding to intensities sampled within the retention time for the second column of the 2D-GC device and the row length corresponding to the total time the experimentation lasts. It is thus computationally overhead to analyze such large data matrices. A way to address the computational complexity is to reduce the matrix dimension. Only significant and distinctive patterns in the image are retained as meaningful features. For example, the ANOVA method is adopted [ Ref…] for feature selection from 2-D GC data matrix. It uses a small subset of samples from each class of data and remains the matrix entries which are with large inter-class variances and small intra-class variances. This paper presents linear SVM for feature selection. Linear SVM, as a linear classifier for each pair of two classes of data, tries to specify a weighting for each matrix entry. The weighting is separable (e.g. opposite sign, or an obvious threshold) for entries from different classes. And the absolute weightings signify the importance of the corresponding entries. Hereby those entries with the largest weightings are retained as meaningful features. For a set of data from more than two classes, the feature selection operates in a pair by pair manner, and the features selected by multiple runs are unified into one combined feature set. The 2D-GC data is analyzed as image on which the patterns are regarded as stable and unique characteristics for a certain herb species or chemical component. But the properties of patterns such as the areas, intensities and positions always show some varieties, it thus requires the comparing algorithms to be variant-tolerant, but at the same time not to degrade the sensitivity when analyzing patterns from two different species. Thus some image processing techniques are adopted for more accurately pattern extraction and matching. 1.4 Why the machine learning approaches such as SVM and the algorithm family of classification and clustering can help the Chinese Herb 2D-GC Image data analysis …. Machine Learning is the study of computer algorithms that improve automatically through experience [1]. As remarked in [2], the machine learning will be 2. The Feature Selection Methodology: Linear Support Vector Machine The formulism of SVM is introduced here….. 3. SVM applied to pattern extraction from Chinese Herb 2D-GC Image Part 1. Input data: Please give the full name of the element A and element B A B Sample ID 0% 100% 12345 30% 70% 6 7 8 9 10 50% 50% 11 12 13 14 15 Data format: Form a (400 x 510) matrix for each sample observation. During each time segment of 4 seconds, 400 readings are sampled from the column 2 of the GC device at a rate of 1FID/0.01s. And a complete run of experiment lasts 34 [(510 x 4s)/60 = 34] minutes. And the FID intensities range from 21 to nearly 4500. But in the following analysis, we discard the part of readings obtained in the first 8 minutes when the compounds are going through the GC device due to the severe noise when booting the machine. Thus we are handling totally 15 data matrices with size 400 x 390. Part 2. 2D GC has larger information capacity than 1D GC: It is claimed the 1D GC can be obtained by accumulating the FID signal strengths in each column of the 2D GC data matrix (Fig 1). If that is true, it is straightforward that the 2 D GC has larger information capacity than the corresponding 1D GC (a 400x390 matrix vs. a 1x390 vector) (Fig 2). Although it is argued that such a reconstruction does not compare the 1D GC with the 2D GC under the same condition (e.g. 2D is sampled at the frequency 100 = 1/(0.01s), but 1D is sampled at the frequency 0.25 = 1/(4s) ), our other experiments show that this reconstruction is credible. A bunch of 1D experiments are conducted at a high frequency (f=100) and lasting 34 minutes too. And then the FID readings are sequentially folded to segments with a length of 400, thus transferred to data matrices with size 400 x 510. Shown as images (Fig 3), it is nearly identical to those obtained by simply reconstructing the 2D GC and then replicating and tiling with mean readings to a matrix. Fig. 1 (a) Image show of a sample GC x GC chromatogram. (b) Reconstructed first column chromatogram of the same 2D GC x GC chromatogram. Fig 2. Extending the 1D GC data into the same matrix form, whose column can be supposed to be filled with a flat mean reading obtained from the column 2 of the GC device. It is obvious that the 2D GC x GC signals are more distinct due to the various strengths along each column. Fig 3. 1D GC data by increasing the frequency to (f=100) Part 3. Further Data preprocessing 3.1 Guassian filtering with size 3 x 7. It is generally believed that same characteristic patterns can be observed in the graphs for two 2D GC experiments on same compounds. Now assume a significant pattern centering at (x,y) in the graph is observed, where x is the rowwise pixel position and y is the columnwise pixel position (or it is regarded as time when referring the 2D GC experiment). And from the knowledge of experimentation, such a pattern should not be observed as only a single pulse at an isolated position, but be observed during an x and y interval (For the image representation, it should be a rectangle box with width x and height y. Thus when comparing two 2D GC graph, we cannot simply notice the difference of FID signal strengths between each pair of pixels at the same (time / graph) position. However, we should consider the pattern difference at nearly the same region of graphs. A simple way to enforce the effect of the neighboring pixels for a centering pixel is to use some local filters with a small window size. Among the huge set of filters available in the field of image analysis, the Gaussian smooth filter is popularly used. In our GC graph analysis, we select the Gaussian filter with window size 3 x 7, which fits for the case that the columnwise correlation of 2D GC data should be more accuractely pinpointed. Part 4. Feature Selection Using Linear Support Vector Classification Machine 4.1 Feature Selection for 2D GC Data by Using Linear SVM It is necessary to reduce each 2D GC data from a huge matrix to a more economic size by discarding some insignificant values, only keeping those important features. It is not only a great help to reduce the computational burden casting upon the lately used classification or clustering algorithms, more importantly, it also essential to sketch out the featured patterns in the 2D GC data for chemist’s inspection or further chemical analysis. A novel machine learning approach to feature selection is proposed recently by utilizing the state-of-the art Support Vector Machine. Following the general setting to re-shape each 2D GC matrix to a one dimensional vector by sequentially tiling each column, and stack the vectors of all samples together, we form a N x d data matrix, where N = 15 and d = 400 x 390 = 15600. Since support vector machine is a classification method, some training samples are used to guide the feature selection procedures. For instance, four samples (the 1st, 2nd, 11-th, 12-th) are regarded as the training samples, where the 1st and 2nd samples are in the same class (purely composed with A) and the remaining two samples (the 11-th, 12-th) are grouped into the opposite class (contaminated by some B). Denote the four data vectors as xi (I = 1,2,11,12). The linear Support Vector Machine tries to construct a separation function f(x) = wx + b such that wx1 + b > 0, wx2 + b > 0, wx11 + b<0 and wx12 + b < 0 with some constraints on w and b. After some systematic solving procedures, we can obtain an explicit solution for w from a linear Support Vector Machine. And generally the w, which has the same dimension as the x, expresses the importance of each dimension of xi by the corresponding term in w. A B Sample Classification Result 0% 100% 12345 11111 30% 70% 6 7 8 9 10 22222 50% 50% 11 12 13 14 15 22222 Table 2. Using parameter C = 1, tolerance = 0.001, and cache size = 100 MB. The Training error is 0 and the testing err is also 0. And constructing the linear SVM mainly aims at feature selection, although its classification is already encouraging enough as shown in (Table 2). After training a linear SVM, the w obtained is illustrated in following graph: Fig 4. Each FID signal (total 15600) is associated with a weight value, Fig 5. Reducing the number of features doesn’t hurt the classification accuracy too much. Fig 6. Fractional Area of features along the threshold value 4.2 Classification accuracy comparison between 2D and 1D data A comparison has been done to validate that 2D GC data has a larger information capacity than reconstructed 1D GC: using Linear SVM to separate more classes of experimental data and compare their classification accuracies. We then produce 5 classes of compounds with various percentages of elements B. In particular, the percentage of B are 0% 10% 20% 30% 40%, and each specific kind of blending has been fed into the GC device 5 times to get a set of 5 replicated 2D measurements and then obtain the same number of reconstructed 1D data vectors. It is a multi-task classification task in fact. And we report the separation rate per pair of classes by using the training sample rates 0. 4 (2 training per class): B:A=0:100 B:A=10:90 B:A=20:80 B:A=30:70 B:A=40:60 B:A=0:100 - 0.77 1 1 1 B:A=10:90 0.78 - 0.83 0.96 0.96 B:A=20:80 1.00 0.92 - 0.75 0.99 B:A=30:70 1.00 1.00 0.86 - 0.81 B:A=40:60 1.00 1.00 0.93 0.93 - Table 3. The overall (both training and testing) classification accuracies for 1D and 2D GC data using linear SVM under the parameter settings: C=1, tolerance = 0.001, cache size = 100 MB. Each table cell shows the accuracy of classifying one type of sample named by the column title from another type marked with the row title. The upper triangular part shows the results for 1D GC data, and the lower triangular part is for 2D data. The better accuracy of any diagonally symmetric pair of values is highlighted. The above results are averaged from 10 repeated experiments with different training samples used. From the table 3, we can notice that most results for classifying 2D data are better than those for classifying 1D data. The only exception happens when classifying 40%-B from 30%-B. But their accuracy difference is not too large to disprove the superiority of using 2D data by taking an account of the device noise and the limited samples used. Part 5: Further optimizing the 2D GC features by Image processing Methods The features selected by the novel Linear Support Vector Machine can effectively distinguish samples without B mixed from those contaminated by B. For example, extracting only 1 percent features still completely classifies the two classes of data (Fig 5 & 6). Although the classification results, as shown above, are insensitive to the number of features, choosing how many percents of features is still critical from the view of chemistry domain experts because it might be too dangerous to use an extremely small set of features to represent a sample originally high-dimensionally featured. We might have to determine the optimal threshold for w obtained from the Linear SVM such that both the number of features is not formidably large and the sample representation by features is not vulnerably oversimplified. If we have plenty of training samples, some classical methods such as the cross validation can be used to guide the process of deciding the optimal w. But in chemo metric field, in particular, for our 2D GC experiment case, obtaining more samples is very time-consuming and labor demanding. The w itself is a long vector and has the one-to-one correspondence to the data vector of the 2D vectors. Recalling many unsupervised feature selection methods achieve good results too by only noticing the pixel intensities and pattern spatiality on each single 2D image, we can apply same methodology to the way of selecting w if we retransforming the w vector to a 2-D image with identical size of each 2D GC image. To extract more reasonable features from the images without too much supervision, we can employ some image processing techniques for contour/boundary detection. We adopt a set of threshold values (totally twenty levels) to locate the contours in the image valued by w. Those threshold values are automatically selected for each sample image in an unsupervised manner. Following is a set of important area found for the image of w. Figure 7. Part 6: Clustering upon the whole dataset using new feature vectors 6.1 PCA analyses to show the improvement due to feature selection To verify the effectiveness of the Support Vector Machine feature selection approach, we report the clustering results on the feature-reduced 2D GC data compared with the originally raw 2D GC data. The clustering algorithms used are PCA combining with the K-mean methods. The input data are firstly represented by a subset of principle components through PCA, and then are clustered into several groups by the K-mean algorithm. For high dimensional input vectors, such as the raw 2D GC vectors with a length of 15600, the PCA with the K-L transform trick (identical to the kernel-PCA using Linear Kernel) is used to avoid directly operating on the covariance matrix which in size is proportional to the square of the dimension of input vectors. And even for other lower dimension data, the linear kernel PCA is also performed besides the conventional PCA testing, and the better results are reported under the same category of (PCA+K-mean) in the following table. Training samples Raw 2D GC Data Linear SVM K-mean PCA+K-mean Feature Selection (a) Feature Selection (b) PCA + K-mean PCA + K-mean 2 0.9356 4 0.9711 6 0.9911 0.8956 0.8874 0.9244 0.8533 0.9156 0.9289 0.9022 0.9244 Table 3. Validate the feature selection Methods for 2D GC data. Shown in table 3, the Linear SVM , as a supervised classification algorithm, always achieves the best results under different training sample rate even though it operates on the raw data. While the K-mean clustering, as an unsupervised method, either with PCA added or not, does not achieve good results on raw 2D data. After feature selection, the clustering algorithms obtain higher accuracies with smaller computation complexity. One can also observe that the feature selection (b) has a positive correlation with the training sample rate used at the linear SVM stage. Although the scheme (b) runs worse when the training rate is small, it boosts to a higher precision compared with the scheme (a) as the training rate increases. Also it is verifiable from the following figure 8 and figure 9. The summation of variance percentage of the first 3 principle components in Figure 9 achieve 80%, which is much higher than the total variance percentages of the first three principle components in Figure 8. Figure 8. PCA and K-mean on the data using feature selection scheme (a) Figure 9. PCA and K-mean on the data using feature selection scheme (b). 4. Conclusion 5. References [1] Machine Learning, Tom Mitchell, McGraw Hill, 1997.