1 Dao Lam Department of Electrical and Computer Engineering Missouri University of Science and Technology, Rolla, MO 65409 UNSUPERVISED FEATURE LEARNING CLASSIFICATION USING AN EXTREME LEARNING MACHINE Abstract—This paper presents a new approach, which we call UFL-ELM, to classification using both unsupervised and supervised learning. Unlike traditional approaches in which features are extracted, hand-crafted, and then trained using timeconsuming, iterated optimization, this proposed method leverages unsupervised feature learning to learn features from the data themselves and then train the classifier using an extreme learning machine to reach the analytic solution. The result is therefore widely and quickly applied to universal data . Experiments on a large dataset of images confirm the ease of use and speed of training of this unsupervised feature learning approach. Furthermore, the paper discusses how to speed up training, using massively parallel programming. I. I NTRODUCTION Classification is one of the most important applications of machine learning. Traditional classifiers require a human operator to design features that represent high-level features of the objects, such as Scale-invariant feature transform SIFT and Histogram of Oriented Gradients HOG [15],[3]. A good classifier requires progressively more data for training, which also increases the time required for training. These requirements make classification a daunting task for machine vision systems. To tackle the disadvantage of hand-crafted features, several methods of unsupervised feature learning (UFL) have been researched, such as Sparse Coding , Deep Belief Nets, Auto Encoder, and Independent Subspace Analysis [14], [7]. These approaches also leverage the fact that unlabeled data are plentiful and do not require much effort to collect. The data that can be learned by UFL is universal, including audio, image, and video data. Extracting features from the dataset is not the only difficult task in classification. Training the classifier remains an open problem. A popular classifier is the Support Vector Machine [8] and its variant [13] for multiclass classifications. Other methods discussed in the literature include boosting [5], but the performance of those methods are slow and requires some parameters to be tuned. In [10], Huang introduced the Extreme Learning Machine (ELM) to ease the task of classification. ELM is a fast learning feed-forward single hidden layer neural network, which can approximate any nonlinear function, outperform t many other classifiers, and provide very accurate regression [11]. The method presented in this paper represents a combination of unsupervised feature learning and the extreme learning machine (UFL-ELM) for classification. This combiantion produces two-fold results: first, it eliminates the requirement of hand-crafting features; second, it provides a faster method of training the classifier than does unsupervised learning. The remainder of the paper is organized as follows: Section II discusses UFL, followed by ELM. Then, in Section IV we introduce the framework of UFL-ELM and discuss its performance. II. UFL In computer vision and machine learning, features are the highest representations of the investigated object. Previous work has shown that hand-designed local features, such as SIFT and HOG [15],[3], perform well but are difficult to generalize over different kinds of datasets, such as video, text, and sound. A growing interest in learning features directly from data is satisfied with UFL [2] and is implemented through encoders, such as the Sparse Encoder [16] and k-means [17]. Unsupervised feature learning consists of two steps: 1) building the feature encoder, and 2) encoding the feature. When building the feature encoder, unlabeled data are processed. Those unlabeled data can consist of the dataset itself, computer-generated data, or data collected from the internet or other related sources. This is an advantage of UFL over traditional methods. Processing the unlabeled data involves learning the feature encoder through the following steps: 1. Extract random patches from unlabeled training 2. Apply pre-processing to enhance the contrast of the patches 3. Learn the feature encoder by using an unsupervised learning algorithm, such as k-means or an auto encoder Once the feature encoder is learned, given a piece of data, we can extract the UFL feature by performing the following steps: 1. Extract the patches from the data (Patches should adequately cover all of the data 2. Encode each patch using the feature encoder learned in the previous step 3. Pool learned features together to reduce the number of features Figure 1 explains how to encode the feature when the input data consists of images. III. ELM The architecture of an ELM for classification is depicted in Figure 2. The most advantageous feature of ELM is the way it is trained. Unlike many other neural networks that take hours or even days to train because of their slow convergence in the optimization process [4], ELM input weights can be 2 K channels J; I/ J, ......... (n-w)/s+ 1 of classification. Denote wi E RNHxn and Wo E RkXNHas the input weight matrix and output weight matrix of ELM, where NH is the number of neurons in the hidden layer. Doing so yields H = g(Wi *X) (1) where H E RNH xN is the hidden layer output matrix of ELM, and g is the activation function of the neuron. Figure 1. Learning the UFL features to represent an image. Images are Once we obtain H, we can calculate the output of the output divided into patches of w x w pixels. Each patch is encoded into a vector layer based on the feature encoder learned in the unsupervised step. Features of the patches that lie in the same quadrant as the image will be pooled together to reduce the number of features used to represent an image. In this figure, there are four times more UFL features than elements K in the feature encoder. (Figure adapted from[2].) (2) Eq. 2 occurs because the output node activation function is linear. For training purposes, 0 should be as close to C as possible, i.e. 110- Cll = 0. ELM theory states that to achieve 110- Cll = 0 [11], we can initialize Wi with a random value and compute W 0 as W0 = pinv(H) * C (3) where pinv(H) represents the generalized inverse of a matrix. Though ELM can be modified to some extent to improve its performance [9] or reduce its complexity [12], a simple implementation is sufficient for several applications. Once training is complete, we can use ELM to classify the testing set. IV. TECHNICAL APPROACH FRAMEWORK Input Layer Figure 2. ELM architecture [11]. ELM for classification is a feed-forward single hidden neural network in which the number of input neurons equals the number of features in the dataset and the number of output neurons equals the number of classes to classify. The number of hidden neurons is usually equivalent to the number of features. initialized randomly, and the output weights can be determined analytically by a pseudo inverse matrix operation. An ELM classifier has 3 layers: 1) the input layer, whose number of neurons equals the number of features in the dataset; 2) the hidden layer, which has a non-linear activation function; and 3) the output layer, whose number of output neurons equals the number of classes. Let X E RnxN = [x1,xn+1,...,xN], wherenisthenumber of features and N is the number of data pieces, be the data used to train the ELM. To include the bias value of the neuron, we transform X into X by adding a row vector of all 1s, i.e. " X X = [ 1 ]. Let C E RkxN = [c 1, c2, ..eN], where k is the number of classes, and ci = [0, 0, .., 1, ..o]T is the vector of all zeros except at the correct class vector, which is the expected output In this section, we describe the new approach to classification using UFL-ELM. Given a dataset to classify, this framework requires two steps: first, the UFL phase collects more data and labels some of the data for learning and training; second, ELM trains the classifier using the labeled data. Specifically, for the UFL learning phase, the additional data that were collected can be obtained from any source of the same domain. For example, when working with image classification, good sources of data include Flickr, Google Images, and Bing Images. The additional data can even be generated by computer graphics programs to help enrich the feature learning. The UFL phase uses unlabeled data to build the feature encoder. It begins by densely sampling each of the data into patches. Each patch then is vectorized and is considered an object in unsupervised learning to learn the encoder's structure. An unsupervised learning algorithm, such as k-means or an auto encoder, is applied to those patches to learn the structure of the encoder. This encoder then is used to learn features in the labeled dataset. At this point, for each labeled object, patches are sequentially extracted to represent the object at a low level. All of the patches then are compared inside the feature encoder to form the features of the object at a higher level, such as edges or comers in the case of images. An object may need several patches to represent it at a low level, and even more to 3 Labeled Data Unlabeled Data Classifier Result UFL Encoder Feature Mapping ELM Training set Labeled UFL Features Testing set Figure 3. The UFL-ELM framework. The unlabeled data are used to build the feature encoder to later learn the feature in the labeled data through a feature mapping method. The labeled data then are divided into training and testing sets to train the ELM and test overall performance. represent it at a high level, so we need to pool those features Figure 4. 400 centroids of the feature encoder using k-means clustering. The to reduce the size of the object. Usually, the final number of features appear as vertical, horizontal and diagonal edges, and other features features needed to represent the object is a few times more are represented by the colored stripes in the images than the number of elements in the feature encoder. After the labeled data are learned, they are split again into two sets: the training set and the testing set. The training rearranging them into an image format. Each of the small set is used to train the ELM. This training is straightforward squares in the figure represents a centroid used to encode the and essentially involves initializing the input layer, which is a feature in the later feature mapping step. Each square appears random weight, and then computing the output weight using as a horizontal, vertical or diagonal edge in the dataset. This a pseudo inverse matrix calculation. This results in very fast is proof of successful unsupervised feature learning. Other training. Once this step is complete, the testing set is fed into types of datasets, such as audio and video, have their own ELM, and the output neuron that has the highest activation is corresponding criteria to prove the success of UFL [14], [6]. For each pixel in each image in either the training or the chosen as the class of the input. testing set, we extracted a 6x6x3 window and stack into a V. E XPERIMENT vector. We computed the distances from this vector to the 800 In this section, we describe the experiment used to test UFL- learned centroids from the k-means step. Then, we formed a new vector from those 800 distances with the following rules: ELM on an actual, large dataset. The dataset we used to test our approach was Cifar-10 [1], If the distance was larger than the mean distance, then we kept which consists of 60,000 32x32 color images in 10 classes, it; otherwise, we set that distance to 0. The volume of the distance vector was 27x27x800. We then with 60,000 images per class, mutually exclusive. 50,000 of these images were used for training, and the remaining 10,000 pooled the features to reduce the size by summing up the were used for testing. The dataset was organized into a 50.000 feature in the same quadrant. Finally, we concatenated the x 3072 (3072 = 32x32x3) matrix for training and 10,000 x volume into a feature vector of 3,200 elements. This vector is the UFL presentation of one image. 3072 for testing. For classification, we used ELM with 3,200 inputs. The For learning the UFL feature, we follow the available implementation in [2]. We used a window with a size of activation function of hidden nodes was chosen as Sigmoid. w = 6x6 pixels to randomly sample the training dataset to The number of output neurons was 10, corresponding to the collect 400,000 patches. Each patch then was vectorized into 10 classes of objects in the dataset. The number of hidden neurons must be defined when using a column. Then, those patches were normalized and whitened ELM. We ran the ELM with different numbers of hidden to increase the contrast. We then applied k-means into the enhanced patch matrix neurons, from 1,000 to 6,000. We ran the experiment using to learn K 800 centroids. Technically, labeled data are not Matlab on a machine with an Intel Xeon E5645 CPU 2.4GHz required for learning the centroids or the features in later and 12GB RAM. We attempted a run with 7,000 hidden neurons but encountered an Out of Memory error. stages. This is the advantage of UFL. Figure 4, which is best viewed in color, depicts the feature Table Ireports the performance of the classifier. encoder. We plotted a random 100 of the 800 centroids after There are two aspects of this table to consider: precision 4 Table I UFL-ELM CLASSIFICATION USING CPU MATLAB IMPLEMENTATION 1 Train Accuracy Test Accuracy Train Time (s) Test Time (s) 1ooo .62 .59 108 8 1 2ooo .68 .62 327 15 1 3ooo .72 .63 675 24 1 4ooo .75 .64 1205 33 1 5ooo .77 .64 1845 42 1 6ooo .80 .64 2766 49 Table II UFL-ELM PERFORMANCE WITH CUDA SPEED UP 1 1 Train Accuracy Test Accuracy Train Time (s) Test Time (s) 10oo .62 .59 6.1 .7 1 2ooo .68 .62 16.4 1.4 VI. 1500 +---------------- .--------- ;:: 1000 2000 3000 4000 5000 6000 Number af hidden neurons Figure 5. Speeding up the ELM classifier using CUDA. As the number of hidden neurons increased, the time required for the pseudo inverse MATLAB implementation followed the power law of the number of hidden neurons, while the time required for CUDA implementation remained linear. and time. Regarding the first aspect, the precision of the ELM increased with the number of neurons. However, when there were 4,000 hidden neurons, the increment of precision was trivial. In fact, when we reduced the dataset size by half to overcome the memory problem, we found that an increased number of hidden neurons decreased the precision of the ELM classifier. The second aspect of the ELM classifier is time. As the number of hidden neurons increased, so did the complexity, which then increased the time needed for training. With 6,000 hidden neurons, it took MATLAB, with its proprietary matrix manipulation optimization, more than 45 minutes to train the ELM. To reduce the training time, we leveraged the parallel characteristics in the matrix, addition and multiplication, and implemented it using the CUDA library. We ran the ELM on a machine with a NYDIA Tesla C2050 and measured the time required for training. The result is plotted in Fig. 5. With GPU implementation, the timing improved dramatically. In fact, with 6,000 hidden neurons, GPU implementation was 20 times faster than Matlab in training and 10 times faster in testing. The multiclass SVM classifier in [2] would yield 80% training accuracy and 75% testing accuracy, and it requires 452s of running time. Our approach has 80% and 64% respectively. Though the performance seems higher compared to our ELM approach, the method requires a great deal of work to tune and optimize the SVM. The needed time performing cross-validation for parameter setting in theSVM approach is substantial. 1 3ooo .72 .63 31 2.0 1 4ooo .75 .64 52 2.8 1 5ooo .77 .64 77 3.4 1 6ooo .80 .64 111 4.1 CONCLUSION In this paper, we introduced a combination of unsupervised and supervised learning for the problem of classification. For unsupervised learning, we used unlabeled data to build the feature encoding and then used that encoder to extract the feature in the labeled data. Doing so leveraged the availability of data from many other sources and eliminated the daunting process of designing features suitable for each specific type of data. Furthermore, for supervised learning, we exploited the ELM for faster classifier training. The ELM was sped up further using massive parallel programming due to its matrix manipulation capabilities. The result on a large image dataset confirmed the advantage of the approach Acknowledgements We would like to thank Adam Coates and Guang-Bin Huang for their public codes. Partial support from the National Science Foundation, the Missouri S&T Intelligent Systems Center, and the Mary K. Finley Missouri Endowment is gratefully acknowledged. REFERENCES [1] Cifar-10. In http://www.cs.toronto.edul kriz/cifar.html. [2] Adam Coates, Honglak Lee, and Andrew Y Ng. An analysis of singlelayer networks in unsupervised feature learning. Ann Arbor, 1001:48109, 2010. [3] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In Proc. IEEE Computer Society Conf Computer Vision and Pattern Recognition CVPR 2005, volume 1, pages 886-893, 2005. [4] Scott E Fahlman. An empirical study of learning speed in backpropagation networks. CMU Technical Report, 1988. [5] Yoav Freund and Robert E Schapire. Experiments with a new boosting algorithm. In Machine learning - International workshop then conference -, pages 148-156. Morgan Kaufmann Publishers, Inc., 1996. [6] I. Goodfellow, Q. Le, A. Saxe, H. Lee, and A. Ng. Measuring invariances in deep networks. In NIPS, 2010. [7] Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527-1554, 2006. [8] Chih-Wei Hsu and Chih-Jen Lin. A comparison of methods for multiclass support vector machines. Neural Networks, IEEE Transactions on, 13(2):415 -425, mar 2002. [9] Guang-Bin Huang, Xiaojian Ding, and Hongming Zhou. Optimization method based extreme learning machine for classification. Neurocomputing, 74(1):155-163, 2010. [10] Guang-Bin Huang, Qin- Yu Zhu, and Chee-Kheong Siew. Extreme learning machine: a new learning scheme of feedforward neural networks. In Neural Networks, 2004. Proceedings. 2004 IEEE International Joint Conference on, volume 2, pages 985-990. IEEE, 2004. [11] Guang Bin Huang, Qin Yu Zhu, and Chee Kheong Siew. Extreme learning machine: Theory and applications. Neurocomputing, 2006. [12] Hieu Trung Huynh and Yonggwan Won. Evolutionary algorithm for training compact single hidden layer feedforward neural networks. In Neural Networks, 2008. IJCNN 2008.(IEEE World Congress on Computational Intelligence). IEEE International Joint Conference on, pages 3028-3033. IEEE, 2008. 5 [13] Takuya Inoue and Shigeo Abe. Fuzzy support vector machines for pattern classification. In Neural Networks, 2001. Proceedings. IJCNN’01. International Joint Conference on, volume 2, pages 1449–1454. IEEE, 2001. [14] Q. V. Le, W. Y. Zou, S. Y. Yeung, and A. Y. Ng. Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), pages 3361–3368, 2011. [15] D. G. Lowe. Object recognition from local scale-invariant features. In Proc. Seventh IEEE Int Computer Vision Conf. The, volume 2, pages 1150–1157, 1999. [16] Rajat Raina, Anand Madhavan, and Andrew Y Ng. Large-scale deep unsupervised learning using graphics processors. In Proceedings of the 26th Annual International Conference on Machine Learning, volume 382, pages 873–880. ACM, 2009. [17] R. Xu and D. Wunsch. Clustering. Wiley-IEEE Press, 2009.