Dao Lam Department of Electrical and Computer Engineering

advertisement
1
Dao Lam
Department of Electrical and Computer Engineering
Missouri University of Science and Technology, Rolla, MO 65409
UNSUPERVISED FEATURE LEARNING
CLASSIFICATION USING AN EXTREME
LEARNING MACHINE
Abstract—This paper presents a new approach, which we
call UFL-ELM, to classification using both unsupervised and
supervised learning. Unlike traditional approaches in which
features are extracted, hand-crafted, and then trained using timeconsuming, iterated optimization, this proposed method leverages
unsupervised feature learning to learn features from the data
themselves and then train the classifier using an extreme learning
machine to reach the analytic solution. The result is therefore
widely and quickly applied to universal data . Experiments
on a large dataset of images confirm the ease of use and
speed of training of this unsupervised feature learning approach.
Furthermore, the paper discusses how to speed up training, using
massively parallel programming.
I. I NTRODUCTION
Classification is one of the most important applications
of machine learning. Traditional classifiers require a human
operator to design features that represent high-level features
of the objects, such as Scale-invariant feature transform SIFT
and Histogram of Oriented Gradients HOG [15],[3]. A good
classifier requires progressively more data for training, which
also increases the time required for training. These requirements make classification a daunting task for machine vision
systems.
To tackle the disadvantage of hand-crafted features, several
methods of unsupervised feature learning (UFL) have been
researched, such as Sparse Coding , Deep Belief Nets, Auto
Encoder, and Independent Subspace Analysis [14], [7]. These
approaches also leverage the fact that unlabeled data are
plentiful and do not require much effort to collect. The data
that can be learned by UFL is universal, including audio,
image, and video data.
Extracting features from the dataset is not the only difficult
task in classification. Training the classifier remains an open
problem. A popular classifier is the Support Vector Machine
[8] and its variant [13] for multiclass classifications. Other
methods discussed in the literature include boosting [5], but
the performance of those methods are slow and requires some
parameters to be tuned.
In [10], Huang introduced the Extreme Learning Machine
(ELM) to ease the task of classification. ELM is a fast learning
feed-forward single hidden layer neural network, which can
approximate any nonlinear function, outperform t many other
classifiers, and provide very accurate regression [11].
The method presented in this paper represents a combination
of unsupervised feature learning and the extreme learning
machine (UFL-ELM) for classification. This combiantion produces two-fold results: first, it eliminates the requirement of
hand-crafting features; second, it provides a faster method of
training the classifier than does unsupervised learning.
The remainder of the paper is organized as follows: Section
II discusses UFL, followed by ELM. Then, in Section IV
we introduce the framework of UFL-ELM and discuss its
performance.
II. UFL
In computer vision and machine learning, features are the
highest representations of the investigated object. Previous
work has shown that hand-designed local features, such as
SIFT and HOG [15],[3], perform well but are difficult to
generalize over different kinds of datasets, such as video, text,
and sound. A growing interest in learning features directly
from data is satisfied with UFL [2] and is implemented through
encoders, such as the Sparse Encoder [16] and k-means [17].
Unsupervised feature learning consists of two steps: 1)
building the feature encoder, and 2) encoding the feature.
When building the feature encoder, unlabeled data are processed. Those unlabeled data can consist of the dataset itself,
computer-generated data, or data collected from the internet
or other related sources. This is an advantage of UFL over
traditional methods. Processing the unlabeled data involves
learning the feature encoder through the following steps:
1. Extract random patches from unlabeled training
2. Apply pre-processing to enhance the contrast of the
patches
3. Learn the feature encoder by using an unsupervised
learning algorithm, such as k-means or an auto encoder
Once the feature encoder is learned, given a piece of data,
we can extract the UFL feature by performing the following
steps:
1. Extract the patches from the data (Patches should adequately cover all of the data
2. Encode each patch using the feature encoder learned in
the previous step
3. Pool learned features together to reduce the number of
features
Figure 1 explains how to encode the feature when the input
data consists of images.
III. ELM
The architecture of an ELM for classification is depicted
in Figure 2. The most advantageous feature of ELM is the
way it is trained. Unlike many other neural networks that take
hours or even days to train because of their slow convergence
in the optimization process [4], ELM input weights can be
2
K channels
J;
I/
J,
.........
(n-w)/s+ 1
of classification.
Denote wi E RNHxn and Wo E RkXNHas the input
weight matrix and output weight matrix of ELM, where NH
is the number of neurons in the hidden layer.
Doing so yields
H = g(Wi
*X)
(1)
where H E RNH xN is the hidden layer output matrix of
ELM, and g is the activation function of the neuron.
Figure 1.
Learning the UFL features to represent an image. Images are
Once we obtain H, we can calculate the output of the output
divided into patches of w x w pixels. Each patch is encoded into a vector layer
based on the feature encoder learned in the unsupervised step. Features of the
patches that lie in the same quadrant as the image will be pooled together to
reduce the number of features used to represent an image. In this figure, there
are four times more UFL features than elements K in the feature encoder.
(Figure adapted from[2].)
(2)
Eq. 2 occurs because the output node activation function is
linear.
For training purposes, 0 should be as close to C as possible,
i.e. 110- Cll = 0.
ELM theory states that to achieve 110- Cll = 0 [11], we
can initialize Wi with a random value and compute W 0 as
W0
= pinv(H) * C
(3)
where pinv(H) represents the generalized inverse of a
matrix.
Though ELM can be modified to some extent to improve
its performance [9] or reduce its complexity [12], a simple
implementation is sufficient for several applications.
Once training is complete, we can use ELM to classify the
testing set.
IV. TECHNICAL APPROACH FRAMEWORK
Input Layer
Figure 2. ELM architecture [11]. ELM for classification is a feed-forward
single hidden neural network in which the number of input neurons equals
the number of features in the dataset and the number of output neurons equals
the number of classes to classify. The number of hidden neurons is usually
equivalent to the number of features.
initialized randomly, and the output weights can be determined
analytically by a pseudo inverse matrix operation.
An ELM classifier has 3 layers: 1) the input layer, whose
number of neurons equals the number of features in the
dataset; 2) the hidden layer, which has a non-linear activation
function; and 3) the output layer, whose number of output
neurons equals the number of classes.
Let X E RnxN = [x1,xn+1,...,xN], wherenisthenumber
of features and N is the number of data pieces, be the data
used to train the ELM. To include the bias value of the neuron,
we transform X into X by adding a row vector of all 1s, i.e.
"
X
X = [ 1 ].
Let C E RkxN = [c 1, c2, ..eN], where k is the number
of classes, and ci = [0, 0, .., 1, ..o]T is the vector of all zeros
except at the correct class vector, which is the expected output
In this section, we describe the new approach to classification using UFL-ELM. Given a dataset to classify, this
framework requires two steps: first, the UFL phase collects
more data and labels some of the data for learning and training;
second, ELM trains the classifier using the labeled data.
Specifically, for the UFL learning phase, the additional
data that were collected can be obtained from any source of
the same domain. For example, when working with image
classification, good sources of data include Flickr, Google
Images, and Bing Images. The additional data can even be
generated by computer graphics programs to help enrich the
feature learning.
The UFL phase uses unlabeled data to build the feature
encoder. It begins by densely sampling each of the data into
patches. Each patch then is vectorized and is considered an
object in unsupervised learning to learn the encoder's structure.
An unsupervised learning algorithm, such as k-means or an
auto encoder, is applied to those patches to learn the structure
of the encoder. This encoder then is used to learn features in
the labeled dataset.
At this point, for each labeled object, patches are sequentially extracted to represent the object at a low level. All of
the patches then are compared inside the feature encoder to
form the features of the object at a higher level, such as
edges or comers in the case of images. An object may need
several patches to represent it at a low level, and even more to
3
Labeled
Data
Unlabeled
Data
Classifier
Result
UFL
Encoder
Feature
Mapping
ELM
Training set
Labeled UFL
Features
Testing set
Figure 3. The UFL-ELM framework. The unlabeled data are used to build
the feature encoder to later learn the feature in the labeled data through a
feature mapping method. The labeled data then are divided into training and
testing sets to train the ELM and test overall performance.
represent it at a high level, so we need to pool those features
Figure 4. 400 centroids of the feature encoder using k-means clustering. The
to reduce the size of the object. Usually, the final number of features appear as vertical, horizontal and diagonal edges, and other features
features needed to represent the object is a few times more are represented by the colored stripes in the images
than the number of elements in the feature encoder.
After the labeled data are learned, they are split again into
two sets: the training set and the testing set. The training rearranging them into an image format. Each of the small
set is used to train the ELM. This training is straightforward squares in the figure represents a centroid used to encode the
and essentially involves initializing the input layer, which is a feature in the later feature mapping step. Each square appears
random weight, and then computing the output weight using as a horizontal, vertical or diagonal edge in the dataset. This
a pseudo inverse matrix calculation. This results in very fast is proof of successful unsupervised feature learning. Other
training. Once this step is complete, the testing set is fed into types of datasets, such as audio and video, have their own
ELM, and the output neuron that has the highest activation is corresponding criteria to prove the success of UFL [14], [6].
For each pixel in each image in either the training or the
chosen as the class of the input.
testing set, we extracted a 6x6x3 window and stack into a
V. E XPERIMENT
vector. We computed the distances from this vector to the 800
In this section, we describe the experiment used to test UFL- learned centroids from the k-means step. Then, we formed a
new vector from those 800 distances with the following rules:
ELM on an actual, large dataset.
The dataset we used to test our approach was Cifar-10 [1], If the distance was larger than the mean distance, then we kept
which consists of 60,000 32x32 color images in 10 classes, it; otherwise, we set that distance to 0.
The volume of the distance vector was 27x27x800. We then
with 60,000 images per class, mutually exclusive. 50,000 of
these images were used for training, and the remaining 10,000 pooled the features to reduce the size by summing up the
were used for testing. The dataset was organized into a 50.000 feature in the same quadrant. Finally, we concatenated the
x 3072 (3072 = 32x32x3) matrix for training and 10,000 x volume into a feature vector of 3,200 elements. This vector is
the UFL presentation of one image.
3072 for testing.
For classification, we used ELM with 3,200 inputs. The
For learning the UFL feature, we follow the available
implementation in [2]. We used a window with a size of activation function of hidden nodes was chosen as Sigmoid.
w = 6x6 pixels to randomly sample the training dataset to The number of output neurons was 10, corresponding to the
collect 400,000 patches. Each patch then was vectorized into 10 classes of objects in the dataset.
The number of hidden neurons must be defined when using
a column. Then, those patches were normalized and whitened
ELM. We ran the ELM with different numbers of hidden
to increase the contrast.
We then applied k-means into the enhanced patch matrix neurons, from 1,000 to 6,000. We ran the experiment using
to learn K 800 centroids. Technically, labeled data are not Matlab on a machine with an Intel Xeon E5645 CPU 2.4GHz
required for learning the centroids or the features in later and 12GB RAM. We attempted a run with 7,000 hidden
neurons but encountered an Out of Memory error.
stages. This is the advantage of UFL.
Figure 4, which is best viewed in color, depicts the feature
Table Ireports the performance of the classifier.
encoder. We plotted a random 100 of the 800 centroids after
There are two aspects of this table to consider: precision
4
Table I
UFL-ELM CLASSIFICATION USING CPU MATLAB IMPLEMENTATION
1
Train Accuracy
Test Accuracy
Train Time (s)
Test Time (s)
1ooo
.62
.59
108
8
1
2ooo
.68
.62
327
15
1
3ooo
.72
.63
675
24
1
4ooo
.75
.64
1205
33
1
5ooo
.77
.64
1845
42
1
6ooo
.80
.64
2766
49
Table II
UFL-ELM PERFORMANCE WITH CUDA SPEED UP
1
1
Train Accuracy
Test Accuracy
Train Time (s)
Test Time (s)
10oo
.62
.59
6.1
.7
1
2ooo
.68
.62
16.4
1.4
VI.
1500 +----------------
.---------
;::
1000
2000
3000
4000
5000
6000
Number af hidden neurons
Figure 5.
Speeding up the ELM classifier using CUDA. As the number of
hidden neurons increased, the time required for the pseudo inverse MATLAB
implementation followed the power law of the number of hidden neurons,
while the time required for CUDA implementation remained linear.
and time. Regarding the first aspect, the precision of the ELM
increased with the number of neurons. However, when there
were 4,000 hidden neurons, the increment of precision was
trivial. In fact, when we reduced the dataset size by half to
overcome the memory problem, we found that an increased
number of hidden neurons decreased the precision of the ELM
classifier.
The second aspect of the ELM classifier is time. As the
number of hidden neurons increased, so did the complexity,
which then increased the time needed for training. With 6,000
hidden neurons, it took MATLAB, with its proprietary matrix
manipulation optimization, more than 45 minutes to train the
ELM.
To reduce the training time, we leveraged the parallel
characteristics in the matrix, addition and multiplication, and
implemented it using the CUDA library. We ran the ELM on
a machine with a NYDIA Tesla C2050 and measured the time
required for training. The result is plotted in Fig. 5.
With GPU implementation, the timing improved dramatically. In fact, with 6,000 hidden neurons, GPU implementation
was 20 times faster than Matlab in training and 10 times faster
in testing.
The multiclass SVM classifier in [2] would yield 80%
training accuracy and 75% testing accuracy, and it requires
452s of running time. Our approach has 80% and 64%
respectively. Though the performance seems higher compared
to our ELM approach, the method requires a great deal of work
to tune and optimize the SVM. The needed time performing
cross-validation for parameter setting in theSVM approach is
substantial.
1
3ooo
.72
.63
31
2.0
1
4ooo
.75
.64
52
2.8
1
5ooo
.77
.64
77
3.4
1
6ooo
.80
.64
111
4.1
CONCLUSION
In this paper, we introduced a combination of unsupervised
and supervised learning for the problem of classification. For
unsupervised learning, we used unlabeled data to build the
feature encoding and then used that encoder to extract the
feature in the labeled data. Doing so leveraged the availability
of data from many other sources and eliminated the daunting
process of designing features suitable for each specific type
of data. Furthermore, for supervised learning, we exploited
the ELM for faster classifier training. The ELM was sped up
further using massive parallel programming due to its matrix
manipulation capabilities. The result on a large image dataset
confirmed the advantage of the approach
Acknowledgements
We would like to thank Adam Coates and Guang-Bin
Huang for their public codes. Partial support from the National
Science Foundation, the Missouri S&T Intelligent Systems
Center, and the Mary K. Finley Missouri Endowment is
gratefully acknowledged.
REFERENCES
[1] Cifar-10. In http://www.cs.toronto.edul kriz/cifar.html.
[2] Adam Coates, Honglak Lee, and Andrew Y Ng. An analysis of singlelayer networks in unsupervised feature learning. Ann Arbor, 1001:48109,
2010.
[3] N. Dalal and B. Triggs. Histograms of oriented gradients for human
detection. In Proc. IEEE Computer Society Conf Computer Vision and
Pattern Recognition CVPR 2005, volume 1, pages 886-893, 2005.
[4] Scott E Fahlman. An empirical study of learning speed in backpropagation networks. CMU Technical Report, 1988.
[5] Yoav Freund and Robert E Schapire. Experiments with a new boosting
algorithm. In Machine learning - International workshop then conference -, pages 148-156. Morgan Kaufmann Publishers, Inc., 1996.
[6] I. Goodfellow, Q. Le, A. Saxe, H. Lee, and A. Ng. Measuring invariances
in deep networks. In NIPS, 2010.
[7] Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning
algorithm for deep belief nets. Neural computation, 18(7):1527-1554,
2006.
[8] Chih-Wei Hsu and Chih-Jen Lin. A comparison of methods for multiclass support vector machines. Neural Networks, IEEE Transactions
on, 13(2):415 -425, mar 2002.
[9] Guang-Bin Huang, Xiaojian Ding, and Hongming Zhou. Optimization
method based extreme learning machine for classification. Neurocomputing, 74(1):155-163, 2010.
[10] Guang-Bin Huang, Qin- Yu Zhu, and Chee-Kheong Siew. Extreme learning machine: a new learning scheme of feedforward neural networks.
In Neural Networks, 2004. Proceedings. 2004 IEEE International Joint
Conference on, volume 2, pages 985-990. IEEE, 2004.
[11] Guang Bin Huang, Qin Yu Zhu, and Chee Kheong Siew. Extreme
learning machine: Theory and applications. Neurocomputing, 2006.
[12] Hieu Trung Huynh and Yonggwan Won. Evolutionary algorithm for
training compact single hidden layer feedforward neural networks.
In Neural Networks, 2008. IJCNN 2008.(IEEE World Congress on
Computational Intelligence). IEEE International Joint Conference on,
pages 3028-3033. IEEE, 2008.
5
[13] Takuya Inoue and Shigeo Abe. Fuzzy support vector machines for pattern classification. In Neural Networks, 2001. Proceedings. IJCNN’01.
International Joint Conference on, volume 2, pages 1449–1454. IEEE,
2001.
[14] Q. V. Le, W. Y. Zou, S. Y. Yeung, and A. Y. Ng. Learning hierarchical invariant spatio-temporal features for action recognition with independent
subspace analysis. In Proc. IEEE Conf. Computer Vision and Pattern
Recognition (CVPR), pages 3361–3368, 2011.
[15] D. G. Lowe. Object recognition from local scale-invariant features. In
Proc. Seventh IEEE Int Computer Vision Conf. The, volume 2, pages
1150–1157, 1999.
[16] Rajat Raina, Anand Madhavan, and Andrew Y Ng. Large-scale deep
unsupervised learning using graphics processors. In Proceedings of the
26th Annual International Conference on Machine Learning, volume
382, pages 873–880. ACM, 2009.
[17] R. Xu and D. Wunsch. Clustering. Wiley-IEEE Press, 2009.
Download