Multilayer Perceptron for analyzing satellite data Kamal Al-Rawi, and Raed Shadfan Department of Computer Science, Faculty of Information Technology, Petra University, P.O. Box 961343, Amman 11196, Jordan. e-mail: kamalr@uop.edu.jo k_alrawi@yahoo.com Abstract: Different ANN architectures of MLP have been trained by BP and used to analyze Landsat TM images. Two different approaches have been applied for training: an ordinary approach (for one hidden layer M-H1-L & two hidden layers M-H1-H2-L) and one-against-all strategy (for one hidden layer (M-H1-1)xL, & two hidden layers (M-H1-H2-1)xL). Classification accuracy up to 90% has been achieved using oneagainst-all strategy with two hidden layers architecture. The performance of oneagainst-all approach is slightly better than the ordinary approach. Keywords: MLP, Landsat, Classification. 2. Objectives 1. Introduction The Propagation principle (BP) of the The objective of this work is to Back Artificial Neural construct an automated system for Networks (ANN) has been developed by analyzing Landsat satellite data using In 1980’s the algorithm of the MLP. Different ANN architectures of Werbos [1]. developed MLP class are tested. Two different independently by many authors Le Cun [2], approaches are applied for learning, an Parker [3], and Rumelhart et al. [4]. ordinary BP ANN was further training using fully interconnected multi-output nets tagged as Computer analysis of satellite (M-H1-L & M-H1-H2-L) and one-against- images has been a very active area of all training using a collection of single- research Benediktsson Conventional al. et classification is [5] . usually output nets tagged as (M-H1-1, & M-H1H2-1). deployed for classifying patterns in these images. However, work in the last decade involved the use of neural networks in 3. Data The data for this study were place of conventional classifiers. The main over obtained by Landsat Thematic Mapper Maximum taken for the area around the Spanish City Likelihood Classifiers (MLC) that they are of “Talavera de la Reina”. The data non-parametric. contains 6 features for around 65,000 advantage of conventional neural networks classifiers as exemplars; each corresponds to different Multi-Layer Perceptron (MLP), electromagnetic wavelength. The spots with BP learning, is the most commonly corresponds to a total of thirteen different used neural network in the literature to classes: classify remotely sensed data. While some mountains, three types of fallow land, two authors have reported that conventional types classifiers perform better irrigated Mulder & Spreeuwers Mouchot [6] than MLP , Solaiman & [7] , many authors have reported that MLP performed better than MLC in classifying remotely sensed data Hepner et al. [8] , Heerman & Khazenie [9] , Paola & Schowengerdt [10], Yoshida & Omatu [11]. meadow, of natural land, wheat, vegetation, wetland, and alfalfa, forest, river. Typically, these classes contain non-linear separations. 4. Multilayer Perceptron 4.1 Architecture: 4.2 Learning and testing: The network topology used in this Two training schemes have been study is based on fully connected feed- adopted in this work. The first, termed as forward ANNs. The number of nodes in ordinary training, is based on training fully the input layer is equal to the number of connected feed-forward networks, each features presented by the data, while the with 13 outputs corresponding to the 13 number of nodes in the output layer (L) is classes of the image. The network is equal to the number of classes that this trained to score 1 at the output that data map to. At least one hidden layer corresponds to the correct class, and to must be added to the architecture in order score 0 at all the other outputs. The to treat the non-linear separation among second, termed as one-against-all, is based classes. Several networks with one and on training 13 different single-output two hidden layers, with different number networks as shown in figure-2, where each of nodes in each hidden layer, have been network corresponds to one class. The used. The architecture for a multilayer problem presented to each network is a perceptron with two hidden layers (M-H1- simple 2-classes problem: one class that is H2-L) is shown in figure-1. M represents associated with every network against all the number of nodes in the input layer, H1 other classes. Eventually, these single- represents the number of nodes in the first output networks are used to produce 13 hidden layer, H2 represents the number of outputs. In both training schemes, the final nodes in the second hidden layer, and L class is determined by choosing the output represents the number of nodes in the which has the highest score. output layer. Figure-1: The architecture of an ordinary MLP with two hidden layers. Number of nodes in the input layer, the first hidden layer, the second hidden layer, and the output layer are M, H1, H2, and L, respectively. 1 1 1 1 i j k M H1 H2 Figure-2: The architecture of the one-againstall network. Thirteen of this ANNs have been trained. One for each class. During classification, we present the input for each one. The class will be adapted from the one with the highest score 1 1 1 r i j k L M H1 H2 1 The data was randomly mixed and investigation must be conducted to reach a split into training and testing sets with more solid conclusion. These runs have 10% and 90% of the whole data been conducted on a computer with 1.8 respectively. The training set includes MHz processor, 128 MB RAM, and 40 around 6500 exemplars evenly distributed GB hard disk. amongst the 13 classes: 500 exemplars from each class. During experiments, learning rates greater than 0.1 resulted in networks being caught in local minima. As a safety measure, the learning rate used in this study was set to 0.001. The number of epochs was chosen to be 50,000 iterations Figure-3: The comparison of one-against-all versus the ordinary architecture. The upper chart shows the training time for one-againstall is slightly better than the ordinary one. The training and testing performance for oneagainst-all are little better than the ordinary one. Dark lines represent the one-against-all while the light lines represent the ordinary scheme. learning rate assigned. Learning time in hours to compensate for the relatively low 20.00 16.00 12.00 8.00 4.00 0.00 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 5. Results Number of nodes in the hidden layer X 13 for the two schemes using single hidden layer networks. The performance of the one-against-all scheme was a little better Learning performance % Many runs have been conducted 80.00 60.00 40.00 20.00 1 than the ordinary one. The results of these 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Number of nodes in the hidden layer X 13 runs are shown in figure-3. The training and testing performance using (6-208-13) and (6-16-1)x13 network topologies were (87%, 82%) and (87%, 83%) respectively. While it was expected that the learning time for the one-against-all scheme to be More runs have been conducted, lower than the ordinary scheme, due to the with two hidden layer networks, using fewer connections, the learning time was one-against-all scheme only. The results of about 16 hours for both approaches. A some of these runs are listed in table-1. possible explanation for this result is the The performance for training and testing, communication time calling 13 networks using (6-92-43-1) x13 topology was 93% from the computer memory in the one- and 90% respectively. However, the against-all networks is compensating for learning time was more than 130 hours. the computation time consumed by the The learning time has been reduced to processor to resolve the extra connections about 47 hours using architecture (6-42- in the ordinary networks. However, further 10-1) x 13, while the learning and testing performance have been reduced by less Table-1: Some runs for different architectures using two hidden layers. One-against-all approach is used. The best classification performance is 90%. The training time for this run is 131 hours. A more practical run, with classification performance 88%, required 47 hours. than 2%. The classification performance at the class level were class-1 91%, class-2 87%, class-3 96%, class-4 93%, class-5 93%, class-6 92%, class-7 96%, class-8 85%, class-9 85%, class-10 94%, class-11 L E A R N I N G I N P U T H1 H2 O U T P U T N O D E S 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 N O D E S 8 12 16 21 25 29 33 37 42 46 50 58 63 67 71 79 84 88 92 96 100 N O D E S 50 62 82 19 26 43 64 85 10 25 45 81 4 21 42 81 5 24 43 62 82 N O D E S 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 P E R F O R M A N C E % 87 90 91 90 91 91 92 92 92 92 92 93 92 93 93 93 93 93 93 93 93 88%, class-12 84%, and class-13 100%. T E S T I N G P E R F O R M A N C E % 84 86 87 87 88 88 89 89 88 88 89 89 89 89 89 90 89 89 90 90 89 The same data has been analyzed by Al-Rawi et al. L E A R N I N G T I M E / hr 52 70 90 39 49 68 94 119 47 68 90 134 61 84 109 158 80 105 131 157 186 [12] using Supervised ART-II that has been constructed by AlRawi et al. [13]. A similar performance was obtained. However, the training time was in minutes rather than days as the case of MLP. The training time using 9000 exemplars (one epoch is used) was few minutes. The performance training was and 94% testing and 86%, a good respectively. 6. Conclusions 1-The MLP gives performance when analyzing satellite images. However, it cannot be implemented in real time analysis due to very long training time. 2- One-against-all preferable over scheme is ordinary approach. When trying to classify a new data that has subset classes of the trained classes, one can use a subset of the trained ANNs that corresponds to the needed classes only. 3- Classification performance for [4] Rumelhart, D. E., Hinton, G. E., and MLP is little better than Williams, R. J., Parallel Supervised ART-II, however, distributed Supervised ART-II can be Explorations implemented for real problem Microstructure of Cognition (D. tasks. E. Processing: in Rumelhart and the J. L. McClelland, Eds), MIT Press, Cambridge, Massachusetts, (1986) 318-362. 7. Acknowledgement We gratefully thank Professor Santiago Ormeno Villajos, Dept. of Topographic & Cartographic Engineering, [5] Benediktsson, J. A., Swain, P. H., and Ersoy, O. K., IEEE Transaction University Polytechnic of Madrid, for on Geoscience and Remote the Landsat TM images. Our thank to Sensing, 28 (1990) 540-552. Professor Consuelo Gonzalo Martin and her group, Faculty of Information [6] Mulder, N. J., and Spreeuwers, L., Technology, University Polytechnic of International Madrid, for their efforts on preparation Remote the Landsat images. (IGARSS´91). Geoscience Sensing and Symposium Espo, Finland, (1991) 2211-2213. [7] Solaiman, B., and Mouchot, M. C, 8. References International Geoscience. and [1] Werbos, P. J., Ph.D. thesis, Harvard University, Cambridge, MA, USA (1974). Bienenstock, Sensing (IGARSS´94), Symposium Pasadena, CA, USA, (1994) 1413-1415. [2] Le Cun, Y. in Disordered Systems and Biological Remote Organization, (F. E. Fogelman Souli, and G. Weisbruch, Eds.), [8] Hepner, G. F., Logan,T., Ritter, N., and Bryant, N., Photogrammetric Engineering & Remote Sensing, 56 (1990) 469-473. France, Spring-Verlag, (1986) 233-340. [9] Heermann, P. D., and Khazenie, N., IEEE Transaction on Geoscience [3] Parker, D., MIT, Cambridge, MA, USA, technical report TR-87 (1986). and Remote Sensing, 30 (1992) 81-88. Testing performance % 90.00 80.00 70.00 60.00 50.00 40.00 30.00 20.00 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Number of nodes in the hidden layer X 13 [10] Paola, J. D., and Schowengerdt, R. A, Photogrammetric Engineering & Remote Sensing, 63 (1997) 535544. [11] Yoshida, T., and Omatu, S., IEEE Transaction on Geoscience and Remote Sensing, 32 (1994) 11031109. [12] Al-Rawi, K. R., Gonzalo, C., and Martinez, E., Remote Sensing in the 21st Century: Economic and Environmental (Casanova Applications, ed.), Balkema, Rotterdam (2000) 229-235. [13] Al-Rawi, K. R., Gonzalo, C., and Arquero, A., European Symposium on Artificial Neural Network ESANN´99, Bruges, Belgium (1999) 289-294.