Part 4. Feature Selection Using Linear Support Vector Classification

advertisement
Study on Image Pattern Selection via Support Vector
Machine for Improving Chinese Herb GC×GC Data
Classification and Clustering Performance
wu zhili
Vincent@comp.hkbu.edu.hk
Comp Dept , HKBU
And other authors ….
Abstract
The two-dimensional Gas Chromatography (2D-GC) has been a highly powerful technique in the
analysis of complex mixtures. However, despite the much informative 2-D GC intensity image
that is easily visualized by experts for manual interpretation, it has imposed great complexities and
difficulties upon computational analysis approaches when intending to precisely and automatically
process the 2D-GC data compared with the already matured signal processing method for 1D-GC
data.
Complemented by some techniques used in the pre/pro-processing for image analysis, this paper
proposes the support vector machine (SVM) method for pattern selection from 2D-GC images.
The experimentation for Chinese Herb data classification and clustering shows the improvement
adopting the SVM feature selection method.
Keywords: Chinese herb, 2D-GC, SVM, Feature Selection, Image Analysis, Classification,
Clustering.
1 Introduction
1.1 The Significance and importance for Chinese Herb Data analysis
….
1.2 The superiority of 2D-GC when compared with 1D-GC, and its exact
suitability for Chinese Herb analysis
….
1.3 The difficulties of analyzing 2D-GC data when dealing with the
computational complexity and the intractability of pattern recognition
….
As from the previous introduction to 2D-GC, the data captured is saved with a matrix form
with each column of data corresponding to intensities sampled within the retention time for
the second column of the 2D-GC device and the row length corresponding to the total time the
experimentation lasts. It is thus computationally overhead to analyze such large data matrices.
A way to address the computational complexity is to reduce the matrix dimension. Only
significant and distinctive patterns in the image are retained as meaningful features. For
example, the ANOVA method is adopted [ Ref…] for feature selection from 2-D GC data
matrix. It uses a small subset of samples from each class of data and remains the matrix
entries which are with large inter-class variances and small intra-class variances.
This paper presents linear SVM for feature selection. Linear SVM, as a linear classifier for
each pair of two classes of data, tries to specify a weighting for each matrix entry. The
weighting is separable (e.g. opposite sign, or an obvious threshold) for entries from different
classes. And the absolute weightings signify the importance of the corresponding entries.
Hereby those entries with the largest weightings are retained as meaningful features. For a set
of data from more than two classes, the feature selection operates in a pair by pair manner,
and the features selected by multiple runs are unified into one combined feature set.
The 2D-GC data is analyzed as image on which the patterns are regarded as stable and unique
characteristics for a certain herb species or chemical component. But the properties of patterns
such as the areas, intensities and positions always show some varieties, it thus requires the
comparing algorithms to be variant-tolerant, but at the same time not to degrade the sensitivity
when analyzing patterns from two different species. Thus some image processing techniques
are adopted for more accurately pattern extraction and matching.
1.4 Why the machine learning approaches such as SVM and the algorithm
family of classification and clustering can help the Chinese Herb 2D-GC
Image data analysis
….
Machine Learning is the study of computer algorithms that improve automatically through
experience [1]. As remarked in [2], the machine learning will be
2. The Feature Selection Methodology: Linear Support Vector
Machine
The formulism of SVM is introduced here…..
3. SVM applied to pattern extraction from Chinese Herb 2D-GC
Image
Part 1. Input data:
Please give the full name of the element A and element B
A
B
Sample ID
0%
100%
12345
30%
70%
6 7 8 9 10
50%
50%
11 12 13 14 15
Data format:
Form a (400 x 510) matrix for each sample observation. During each time segment of 4
seconds, 400 readings are sampled from the column 2 of the GC device at a rate of 1FID/0.01s.
And a complete run of experiment lasts 34 [(510 x 4s)/60 = 34] minutes. And the FID intensities
range from 21 to nearly 4500.
But in the following analysis, we discard the part of readings obtained in the first 8 minutes
when the compounds are going through the GC device due to the severe noise when booting the
machine. Thus we are handling totally 15 data matrices with size 400 x 390.
Part 2. 2D GC has larger information capacity than 1D GC:
It is claimed the 1D GC can be obtained by accumulating the FID signal strengths in each column
of the 2D GC data matrix (Fig 1). If that is true, it is straightforward that the 2 D GC has larger
information capacity than the corresponding 1D GC (a 400x390 matrix vs. a 1x390 vector) (Fig
2).
Although it is argued that such a reconstruction does not compare the 1D GC with the 2D GC
under the same condition (e.g. 2D is sampled at the frequency 100 = 1/(0.01s), but 1D is sampled
at the frequency 0.25 = 1/(4s) ), our other experiments show that this reconstruction is credible.
A bunch of 1D experiments are conducted at a high frequency (f=100) and lasting 34 minutes too.
And then the FID readings are sequentially folded to segments with a length of 400, thus
transferred to data matrices with size 400 x 510. Shown as images (Fig 3), it is nearly identical to
those obtained by simply reconstructing the 2D GC and then replicating and tiling with mean
readings to a matrix.
Fig. 1 (a) Image show of a sample GC x GC chromatogram. (b) Reconstructed
first column chromatogram of the same 2D GC x GC chromatogram.
Fig 2. Extending the 1D GC data into the same matrix form, whose column can be supposed to be
filled with a flat mean reading obtained from the column 2 of the GC device. It is obvious that the
2D GC x GC signals are more distinct due to the various strengths along each column.
Fig 3. 1D GC data by increasing the frequency to (f=100)
Part 3. Further Data preprocessing
3.1 Guassian filtering with size 3 x 7.
It is generally believed that same characteristic patterns can be observed in the graphs
for two 2D GC experiments on same compounds. Now assume a significant pattern centering at
(x,y) in the graph is observed, where x is the rowwise pixel position and y is the columnwise pixel
position (or it is regarded as time when referring the 2D GC experiment). And from the knowledge
of experimentation, such a pattern should not be observed as only a single pulse at an isolated
position, but be observed during an x and y interval (For the image representation, it should be
a rectangle box with width x and height y. Thus when comparing two 2D GC graph, we cannot
simply notice the difference of FID signal strengths between each pair of pixels at the same (time /
graph) position. However, we should consider the pattern difference at nearly the same region of
graphs. A simple way to enforce the effect of the neighboring pixels for a centering pixel is to use
some local filters with a small window size. Among the huge set of filters available in the field of
image analysis, the Gaussian smooth filter is popularly used. In our GC graph analysis, we select
the Gaussian filter with window size 3 x 7, which fits for the case that the columnwise correlation
of 2D GC data should be more accuractely pinpointed.
Part 4. Feature Selection Using Linear Support Vector Classification Machine
4.1 Feature Selection for 2D GC Data by Using Linear SVM
It is necessary to reduce each 2D GC data from a huge matrix to a more economic size by
discarding some insignificant values, only keeping those important features. It is not only a great
help to reduce the computational burden casting upon the lately used classification or clustering
algorithms, more importantly, it also essential to sketch out the featured patterns in the 2D GC
data for chemist’s inspection or further chemical analysis.
A novel machine learning approach to feature selection is proposed recently by utilizing the
state-of-the art Support Vector Machine. Following the general setting to re-shape each 2D GC
matrix to a one dimensional vector by sequentially tiling each column, and stack the vectors of all
samples together, we form a N x d data matrix, where N = 15 and d = 400 x 390 = 15600.
Since support vector machine is a classification method, some training samples are used to
guide the feature selection procedures. For instance, four samples (the 1st, 2nd, 11-th, 12-th) are
regarded as the training samples, where the 1st and 2nd samples are in the same class (purely
composed with A) and the remaining two samples (the 11-th, 12-th) are grouped into the opposite
class (contaminated by some B). Denote the four data vectors as xi (I = 1,2,11,12).
The linear Support Vector Machine tries to construct a separation function f(x) = wx + b such
that wx1 + b > 0, wx2 + b > 0, wx11 + b<0 and wx12 + b < 0 with some constraints on w and b.
After some systematic solving procedures, we can obtain an explicit solution for w from a linear
Support Vector Machine. And generally the w, which has the same dimension as the x, expresses
the importance of each dimension of xi by the corresponding term in w.
A
B
Sample
Classification Result
0%
100%
12345
11111
30%
70%
6 7 8 9 10
22222
50%
50%
11 12 13 14 15
22222
Table 2. Using parameter C = 1, tolerance = 0.001, and cache size = 100 MB. The Training error is
0 and the testing err is also 0.
And constructing the linear SVM mainly aims at feature selection, although its classification is
already encouraging enough as shown in (Table 2). After training a linear SVM, the w obtained is
illustrated in following graph:
Fig 4. Each FID signal (total 15600) is associated with a weight value,
Fig 5. Reducing the number of features doesn’t hurt the classification accuracy too much.
Fig 6. Fractional Area of features along the threshold value
4.2 Classification accuracy comparison between 2D and 1D data
A comparison has been done to validate that 2D GC data has a larger information capacity
than reconstructed 1D GC: using Linear SVM to separate more classes of experimental data and
compare their classification accuracies.
We then produce 5 classes of compounds with various percentages of elements B. In
particular, the percentage of B are 0% 10% 20% 30% 40%, and each specific kind of blending has
been fed into the GC device 5 times to get a set of 5 replicated 2D measurements and then obtain
the same number of reconstructed 1D data vectors.
It is a multi-task classification task in fact. And we report the separation rate per pair of
classes by using the training sample rates 0. 4 (2 training per class):
B:A=0:100
B:A=10:90
B:A=20:80
B:A=30:70
B:A=40:60
B:A=0:100
-
0.77
1
1
1
B:A=10:90
0.78
-
0.83
0.96
0.96
B:A=20:80
1.00
0.92
-
0.75
0.99
B:A=30:70
1.00
1.00
0.86
-
0.81
B:A=40:60
1.00
1.00
0.93
0.93
-
Table 3. The overall (both training and testing) classification accuracies for 1D and 2D GC
data using linear SVM under the parameter settings: C=1, tolerance = 0.001, cache size = 100 MB.
Each table cell shows the accuracy of classifying one type of sample named by the column title
from another type marked with the row title. The upper triangular part shows the results for 1D
GC data, and the lower triangular part is for 2D data. The better accuracy of any diagonally
symmetric pair of values is highlighted. The above results are averaged from 10 repeated
experiments with different training samples used.
From the table 3, we can notice that most results for classifying 2D data are better than those
for classifying 1D data. The only exception happens when classifying 40%-B from 30%-B. But
their accuracy difference is not too large to disprove the superiority of using 2D data by taking an
account of the device noise and the limited samples used.
Part 5: Further optimizing the 2D GC features by Image processing Methods
The features selected by the novel Linear Support Vector Machine can effectively distinguish
samples without B mixed from those contaminated by B. For example, extracting only 1 percent
features still completely classifies the two classes of data (Fig 5 & 6).
Although the classification results, as shown above, are insensitive to the number of features,
choosing how many percents of features is still critical from the view of chemistry domain experts
because it might be too dangerous to use an extremely small set of features to represent a sample
originally high-dimensionally featured. We might have to determine the optimal threshold for w
obtained from the Linear SVM such that both the number of features is not formidably large and
the sample representation by features is not vulnerably oversimplified.
If we have plenty of training samples, some classical methods such as the cross validation
can be used to guide the process of deciding the optimal w. But in chemo metric field, in particular,
for our 2D GC experiment case, obtaining more samples is very time-consuming and labor
demanding.
The w itself is a long vector and has the one-to-one correspondence to the data vector of the
2D vectors. Recalling many unsupervised feature selection methods achieve good results too by
only noticing the pixel intensities and pattern spatiality on each single 2D image, we can apply
same methodology to the way of selecting w if we retransforming the w vector to a 2-D image
with identical size of each 2D GC image.
To extract more reasonable features from the images without too much supervision, we can
employ some image processing techniques for contour/boundary detection. We adopt a set of
threshold values (totally twenty levels) to locate the contours in the image valued by w. Those
threshold values are automatically selected for each sample image in an unsupervised manner.
Following is a set of important area found for the image of w.
Figure 7.
Part 6: Clustering upon the whole dataset using new feature vectors
6.1 PCA analyses to show the improvement due to feature selection
To verify the effectiveness of the Support Vector Machine feature selection approach, we
report the clustering results on the feature-reduced 2D GC data compared with the originally raw
2D GC data.
The clustering algorithms used are PCA combining with the K-mean methods. The input data
are firstly represented by a subset of principle components through PCA, and then are clustered
into several groups by the K-mean algorithm. For high dimensional input vectors, such as the raw
2D GC vectors with a length of 15600, the PCA with the K-L transform trick (identical to the
kernel-PCA using Linear Kernel) is used to avoid directly operating on the covariance matrix
which in size is proportional to the square of the dimension of input vectors. And even for other
lower dimension data, the linear kernel PCA is also performed besides the conventional PCA
testing, and the better results are reported under the same category of (PCA+K-mean) in the
following table.
Training
samples
Raw 2D GC Data
Linear SVM
K-mean
PCA+K-mean
Feature Selection (a)
Feature Selection
(b)
PCA + K-mean
PCA + K-mean
2
0.9356
4
0.9711
6
0.9911
0.8956
0.8874
0.9244
0.8533
0.9156
0.9289
0.9022
0.9244
Table 3. Validate the feature selection Methods for 2D GC data.
Shown in table 3, the Linear SVM , as a supervised classification algorithm, always achieves
the best results under different training sample rate even though it operates on the raw data. While
the K-mean clustering, as an unsupervised method, either with PCA added or not, does not achieve
good results on raw 2D data. After feature selection, the clustering algorithms obtain higher
accuracies with smaller computation complexity.
One can also observe that the feature selection (b) has a positive correlation with the training
sample rate used at the linear SVM stage. Although the scheme (b) runs worse when the training
rate is small, it boosts to a higher precision compared with the scheme (a) as the training rate
increases. Also it is verifiable from the following figure 8 and figure 9. The summation of variance
percentage of the first 3 principle components in Figure 9 achieve 80%, which is much higher than
the total variance percentages of the first three principle components in Figure 8.
Figure 8. PCA and K-mean on the data using feature selection scheme (a)
Figure 9. PCA and K-mean on the data using feature selection scheme (b).
4. Conclusion
5. References
[1] Machine Learning, Tom Mitchell, McGraw Hill, 1997.
Download