Massive Parallelism in Neural Network Simulation From: AAAI Technical Report SS-93-04. Compilation copyright © 1993, AAAI (www.aaai.org). All rights reserved. AndreasZell, Niels Mache,Markus Hiittel, MichaelVogt University of Stuttgart, Institute for Parallel and Distributed HighPerformanceSystems(IPVR), Breitwiesenstr. 20-22, D-7000Stuttgart 80, Fed. Rep. Germany E-mail: zell @informatik.uni-stuttgart.de Abstract: Wehere present and comparedifferent massively parallel implementationsof multilayer feedforwardneural networks on a MasParMP-1216,a parallel SIMDcomputer with 16,384 processors. For multilayer feedforward networks we have obtained sustained rates of up to 348 MCPSand 129 MCUPSwith backpropagation, a high markfor this architecture and for general purpose SIMDcomputers. This paper focuses on the problemsof mappingneural networks to parallel hardware, on implementation problems in obtaining high propagation rates on a SIMDmachine and on problems with the resuiting learning algorithms. Keywords:connectionism, neural networks, neural networksimulators, massive parallelism, simulation 1 Introduction and Motivation In the last five to eight years, the focus of research in artificial intelligence (AI) has shifted froma purely symbolicview a view that encompassessubsymbolic or connectionist systems as another powerful computingmethodfor solving problems in artificial intelligence [Rumelhart, McClelland86]. Connectionist systems or artificial neural networks (ANNs) consist of a large numberof simple units (cells, artificial neurons) workingin parallel and exchanginginformation via networkof directed, weighted links (connections). This information usually only comprisesthe activation of the neurons, a single numericalvalue for each cell, whichis fed to the successor cells after being weightedby their connectinglinks. This is motivated by a rough analogy with the synaptic coupling of nerve cells. The information that the network has learned after training is usually distributed over the networkin the formof the matrix of all link weights. Our research group wants to understand the advantages and the trade-offs of the various artificial neural networkparadigmsand learning algorithms, their efficiency and generalization capabilities and their suitability for massivelyparallel implementation. Wehave developed a neural network simulator, SNNS,which has proven well suited for research on learning algorithms, on issues of visualizing networktopology, training and performanceand on parallel implementation of neural networks. It is also used in a numberof other university research groups and with growingacceptance in industry as a neural networkevaluation and prototyping tool. In this paper we are describing the experiences we gained in developing a massively parallel simulator kernel for SNNSrunning on our 16 K processor MasParMP-1216. 2 Stuttgart Neural NetworkSimulator SNNS(Stuttgart Neural NetworkSimulator) is an efficient and portable neural network simulation environmentfor Unix workstations developedat the Institute for Parallel and Distributed High PerformanceSystems, University of Stuttgart, Germany.It is a software tool to generate, train, test and visualize artificial neural networks.The wholenetworksimulator has been developed in C on Unix workstations. The graphical user interface was implemented under X-WindowsX11R4 (Athena widget se0, for maximalportability. 2.1 Structure of SNNS SNNS nowconsists of a sequential and a parallel simulator kernel and a graphical user interface. The simulator kernel operates on the internal representation of the neural networksand performsall operations of the learning and recall phase. It is closely coupled with the graphical user interface via an interface of function calls. Thesimulator kernel is written in C for efficiency and portability and has already been ported to a numberof architectures. 234 " X-Window graphicaluser interface XGUI user defined. activation internal ~ ~it ,,,,ta I ~ l~~ ~ ........ network ........................... procedures user dermed activation .. ..... data ~:’:~:::-- Fig. 1 Structure of the SNNSnetworksimulator consisting of sequential simulator kernel, parallel simulator kernel and graphical user interface The graphical user interface, based on X-Windows, is a powerful tool to construct the networktopology and visualize and modifysmall to mediumsized nets interactively. It can also be used to generate and save test patterns for small networks. To economizeon screen space the display elements are kept in separate windowsand thus can be arbitrarily arranged or hidden if desired. There are various waysto display or modifynodesand links or selected sets of them. Anintegrated help facility aids the novice with the interface. Networkscan be modifiedthrough the user interface during simulation. Units can be introduced, removed,or have their activation values changed. Connectionsamongthe units can be inserted, deleted, redirected, or have their strengths modified. Contrary to most other simulators most modifications can be done in a very simple point and click mannerdirectly from the visual representation of the networktopology. 2.2 Graphical User Interface The graphical user interface, based on X-Windows, is a powerful tool to construct the networktopology and visualize and modifysmall to mediumsized nets interactively. All display elements are kept in separate windowsand thus can be arbitrarily arranged or hiddenif desired. Thereare various waysto display or modifynodes and links or selected sets of them. Anintegrated help facility aids the novice with the interface. Networkscan be modifiedthrough the user interface during simulation. Units can be introduced, removed, or have their activation values changed. Connectionsamongthe units can be inserted, deleted, redirected, or havetheir strengths modified. Fig. 2 gives an overviewof the graphical user interface of SNNS. Contrary to manyother simulators most modifications can be done in a very simple point and click mannerdirectly from the visual representation of the networktopology.The user can control the visual representation of units (activation, output, number,name)and the display of links (directed, undirected, weighO.Connectionsand units can be displayed selectively, i.e. the user maychooseto display only those units whoseactivations or outputs exceeda given display threshold or only those links whoseweights are in a certain range. This allows watchingthe growthof units and the changeof link weights during learning. 235 II I II II ii II I Fig. 2 Graphical user interface of SNNS with a small letter recognition network: info panel (top left), 3Dcontrol panel (top center) and 3D-display (top fight), error graph (center left), SNNS banner (center), remote control panel right), 2D-display (bottom left), Hinton diagram(bottom center), 2D-display (bottom Thegraphical interface is both used to display a neural networkand to generate and manipulateit. Therefore, the user has a powerfulset of operations (insertion, deletion, copying,moving)at his use. Theseoperations maybe applied to individual units or to selections of units and mayaffect links as well, like ’copyall selected units with their input links’ or ’delete all links into the selected units’. Theseoperations allow a quick and convenientgeneration of networks. 2.3 Conneclionist Models supported by SNNS Fromits design SNNSsupports a wide variety of neural networkmodels. Anynetworkthat can be specified as a directed graph with weighted links maybe realized. The concept of sites which has been adapted from RCS[Goddard et al. 89] even allows multiple links between two single units. Most users of SNNSuse simple multilayer feedforward networks with one or two hidden layers with standard sigmoid activation functions (logistic, sine or tanh). However,recurrent networks have also been implemented. The following learning algorithms have been implementedin SNNS:"vanilla" backpropagation [Rumetharl, McClelland 86], backpropagation with momentum and weight decay and flat spot elimination, batch back-propagation, quickprop [Fahiman89], counterpropagation[Hecht-Nielsen 89], backpercolation [Jurik 89], cascade correlation [Fahiman 90], radial basis function networks (RBF) [Poggio, Girosi 89], ART1,ART2and ARTMAP [Carpenter, Ca’ossberg88], Tune-DelayNetworks[Waibel 89] and serf organizing feature maps. Not all of themare available in the public distribution, however. 236 2.4 Selected Applications of SNNS SNNSis currently used in at least 300 installations worldwide,approx, one third of themeach in Germany,other Europe and the U.S. Its mainuse is in university research but somecommercialresearch projects use SNNS as a prototyping tool to find optimal learning procedures, networksizes and learning parametersfor various neural networkapplications. Applications include rotation invariant pattern recognition, handwrittencharacter recognition, stock price prediction, recognition and classification of exogenic and endogeniccomponentsof event correlated brain potentials, noise reduction in natural language communicationin a telecom environment, prediction of secondary structure of proteins and texture analysis. 3 Massively parallel SNNS kernels on the MasPar MP-1 Twoparallel implementations for the SNNSkernel and one prototype implementation have been developed on our 16 K processor MasParMP-1216for multilayer feedforward networks. The goal of the parallelization was to enable the simulation of large neural networks, mainlyfor the tasks of imageprocessing, feature extraction and pattern and object recognition.The parallel simulator is integrated with the sequential simulator as an alternative simulator kernel. Fromthe XWindows basedgraphical user interface it is possible to switch betweenboth kernels at runtime, provided the user restricts himself to multilayer feedforward networks. 3.1 Architecture of the MP-1 The MasParMP-1216is a SIMDmachinewith up to 16,384 four-Bit processors. 32 processors are integrated on a single chip, 32 chips fit on a processor board. Our full scale modeldelivers a quoted peak performanceof 30,000 MIPS(32 bit addition) and 1,500 resp. 600 MFLOPS (32 bit resp. 64 bi0. There exist two separate communicationarchitectures on the MasPar:one is a 3-stage global router which allows up to 1024simultaneous connections betweenany two processors, the other is a torroidal two-dimensional 8 neighbour grid (X-net). Communication bandwidth is up to 1.5 GB/s peak global router and up to 24 GB/speak X-net communication.Fromthis data it can be seen that it is advisable to use the local grid as muchas possible since the communicationbandwidthis muchlarger than with the router. Also on our machinewe experienced relatively high router hardwarefailures whichforced our implementationsto avoid it if possible. The MasPar can be programmedwith parallel versions of C (AMPL)and Fortran. MPPE(MasPar parallel programming environment), an integrated graphical tool set based on X-Windows, facilitates programdevelopmentand debugging. Havinginvestigated the trade-offs of different approachesto parallelization of neural networks, as given in [Singer 90], [Grajski et al. 90], [Chinnet ai. 90] and [Zhanget al. 89] we decided on an implementationwhichcombinesunit parallelism with training vector parallelism. All implementationsof our parallel simulator kernel were done in MPL,a parallel extension of C. Twoof them have recently been converted to AMPL,the ANSIC extension of MPL. 3.2 Implementation with Unit-Parallelism and Training Pattern Parallelism The implementationof [Mache92] uses the following technique (Fig. 3): All hidden and output units of a vertical slice are mappedto a single processing element (PE) of the MasPar.The computationof unit activation is done in parallel for all units of a layer. Thus, a numberof processors is neededwhichequals the largest numberof processing elements in a layer, i.e. the width of the networkdetermines the numberof processors needed. If the numberof input units is greater than the numberof units of the other layers (which is very often the case), an additional PE is neededto store the remaining components of the input pattern and to send themto its neighbor whenthey are needed. Eachprocessor stores the weights of all of its input links. The processors are located in a logical ring communication structure whichcan easily be realized on the X-net grid (with possible copyingat the fringes). During forward or backwardpropagation, the intermediate values for the net input or the accumulatederror signal, resp., are shifted cyclically to the left. The weights are stored with a skew factor of 1 in each processor. This allows that all units of a layer performthe computationof the sumof all weightedpredecessor units’ outputs in a numberof steps equal to the size of the precedinglayer. Thealgorithmis very similar to a systolic matrix-vector multiplication algorithm. Since the width of a feedforward networkis usually muchsmaller than the numberof available processors on our MasPar system, multiple copies of the networkwith different input patterns are kept in the machineand are updated in parallel. In this wayweight changes have to be computedin each networkindividually without actually changing the weights. The sum of the weight changesis then computedand applied to all correspondingweights of the identical networkcopies. This resuits in a backpropagationalgorithm that is a mixture betweenonline and batch backpropagationwith the batch size at least equal to the numberof networkcopies in the machineor an integer multiple of it. 237 output Fig. 3 Parallel MasParSNNSkernel with a 6-3-4 feedforward network: all hidden and output neurons of a columnand their input weightsare mappedonto a single processor, all neuronsof a layer are trained in parallel (unit parallelism). The numberof processors needed is the numberof neurons of the biggest layer except the input layer, plus one. Multiple networkcopies with different input patterns are trained in parallel (training pattern parallelism). For an optimal 128-128-128networkwhich fits into the machinewithout an additional PE and which does not need copying at the end of a cycle this implementation we obtained 176 MCPS(connections per second) and 67 MCUPS(connection updates per second) for backpropagationtraining. The Nettalk network[Sejnowski, Rosenberg86], a 203-120-26network, can be trained with 41 MCUPSand operated with 98 MCPS.These times did not include the time for transfer of the input patterns from the frontend to the parallel machine. Oneadvantage of this approachis, that the numbersof processors used is not determinedby the size of the input layer, whichis usually muchlarger than any hidden or output layer. So a large numberof networks can be trained in parallel. A disadvantage is the fact that the one additional PE has to store muchmore pattern elements than the others. In an SIMD machinewith identical memoryallocation on all PEs this memorybecomesa limiting factor of howmanypatterns can be stored in parallel on the machine.Since pattern I/O wasthe limiting factor of our parallel implementation,a secondimplementation was performed. 3.3 SecondImplementationwith Unit-Parallelism and Training Pattern Parallelism Our second implementationwas done to alleviate the pattern I/O bandwidthproblemof the first implementation. Its main objective was to store as manypatterns as possible in the parallel PE memory,even if the numberof PEs needed to store the networkis larger. This implementationuses a numberof PEs whichis equal to the size of the biggest layer, including input layer. If the input layer is the biggest layer, all PEs store a similar numberof pattern components,otherwise some PEs may store no components. For an optimal 128-128-128 network we obtain sustained 348 MCPSin recall modeand 129 MCUPSfor backpropagation training. The NETtalknetwork can be recalled with 47 MCPSand trained with 17.6 MCUPS.These times include the time for the transfer of the input patterns and the results. Since the I/O times dominatedthe learning and recall times in the previous implementation, the speed improvementof the latter version was even greater than the figures tell. This speed improvementresulted from a new, better compiler installed in the meantime,and from extensive code optimization. The fact that in this schemeless networkscan be trained in parallel can be seen in NETtalkbenchmarks,but it is far less important than the time saved by faster pattern loading. 238 II Fig. 4 SecondParallel MasParSNNS kernel with a 6-3-4 feedforwardnetwork:all neuronsof a columnand their input links are mapped onto a single processor. Thenumberof processorsneededis as large as the maximum numberof neurons of anylayer. Multiplenetworkcopieswithdifferent input patternsare trained in parallel. Link-Parallel Implementation 3.4 Thelast implementation [Htitte192]is not a full SNNS kernel but wasintendedas a prototypeimplementation. It lacks the supportof all SNNS kernel functionsbut can read SNNS networkfiles. It is displayedin Fig. 3. First the networkis extendedonebias unit for eachlayer anddummy units to makeeach layer of equal size n.AUunits of adjacentlayers are connected.Theweightsto dummy units are initialized to zero andare preventedfrombeingupdatedby maskingthemin the last step of weightupdates. In our terminologyweightsfromsourcei to j are denotedby wij. If the weightmatricesconnectingadjacentlayers are denotedWl,...W m then if r is odd, the outgoingweightswij of unit i are mapped to columnsof the PEarray, with the sourceunit of lowestindex givingthe leftmostcolumn;if r is even, the outgoingweightswij of unit i are mapped to rowsof the PEarray, with the sourceunit of lowestindexgivingthe bottomrow. This parallel prototypeimplementation with link parallelismachieved136MCUPS for a fully connected127-127-127 network and 160 MCUPSfor a 127-127network on our MasParMP-1216. 4 Problems of the Parallel Simulator Kernels All three parallel SNNS kernels on the MasParyield impressiveperformancefigures. However,these results have only beenobtainedafter a lengthyperiodof optimizationandafter several completerewritesof the parallel kernel. Ourbiggest hurdle wasthe slowcommunication of the training patterns fromthe unix workstationto the parallel backend,whichin first tests tookminutesversusmilliseconds for the actual training. This couldbe improved a little withthe use of a parallel disk array or the parallel I/O RAM that are availablenowas expensiveoptionsof the machine.Alot of effort wastherefore spentto loadtraining patterns in large blocksandto keepas manyof themas possiblein the distributedparallel PEmemory. Anotherproblemconcernsthe hatchbackpropagation algorithmnecessaryto run the training pattern parallel implementations: For manyapplications with a large number of similar input patterns this learning algorithmis slowerthan online backpropagation. Wetested our simulatorwith character recognitionproblems¯In onecase weused 10,000scanneddigits "0" to "9". In this test the slowerconvergence of batchbackpropagation offset mostof the performance gainof the parallel architecture. However, someapplications needbatch backpropagation for convergence and others report better generalization results. Also, other batchlearningalgorithmslike quickprop[Fahlman88] maybe usedwith better results. 239 neural network processorarray w2T outputunits w w,r rrl [] r~ r~ r~ ~ [] [] [] [] [] [] rTl m m value!: ropagation I~"1 I Wl forward pass Wffforwardpass IIIIIII W~backwardpass T backwardpass W2 Fig. 5 Link-parallel prototype implementationwith a 5-3-4 feedforwardnetwork:Eachlayer is filled up with dummy nodesto the size of the largest layer. Thereis a bias unit for eachlayer, printed g~y.Theweightmatricesare mapped to the processorarray directly andin transposedform, in alternating order (W1, W2’,W3, W4’,...). Dummy weightswhich are set to 0 and preventedfromupdatingwith a maskare printed in grey. Patterns are mapped to the processorarray in diagonalorder. Thedirections of propagationchangein eachlayer accordingto the mappingof the weightmatrix. Conclusions -5 Wehere haveinvestigated different mappings of neural networksto a massivelyparallel SIMD computer.Thesedifferent implementations haveshownthat it is possible, albeit not at all easy to obtain impressiveperformance figures for neural networksimulation on current SIMD computers.However,these high marksare only obtainedfor simple networkarchitectures with a networksize that fits well into the parallel machine.Wehavelearnedthat propagationfigures quotedfor neural networkalgorithmsare only meaningfulif they take communication time fromdisk or workstationto the parallel machineinto account. Overcoming the I/O bottlenecks took mostof the time of the implementationsandforced several fundamental changesin the algorithms.Ourresults can be extendedto VLSIneural networkhardwarein the sensethat the lime go load training patterns into the parallel hardwaremustmatchthe speedof propagation. Anotherlesson learnedwasthat the speedadvantagegainedby a parallel implementation can be lost for certain applications becauseof the slowerbatch backpropagation algorithm.Theseresults havebeenobtainedwith precise floating point computations. It wouldhavebeenmoredifficult with fixed point arithmetic or special VLSIhardwarewith limited precision. 240 The limited precision might itself offset the speed advantage of fast VLSIhardware. Our last point is that these implementationsare no natural mappingsto parallel hardware, like e.g. each processor representing a neuron. Becauseof the limited communication bandwidthrather special mappingshave to be found to obtain high performance on current hardware. 6 Literature [Carpenter, Grossberg 88] Carpenter, G.A., Grossberg, S.: The ARTof Adaptive Pattern Recognitionby a Self-Organizing Neural Network, IEEE Computer, March 1988, 77-88 [Chinn et al. 90] G. Chinn, K.A. Grajski, C. Chen, C. Kuszmaul,S. Tomboulian:Systolic Array Implementationsof Neural Nets on the MasParMP-1Massively Parallel Processor, MasParCorp. Int. Report [Fahiman88] Fahlman,S.E.: Faster Learning Variations on Backpropagation,in [Touretzky et al. 88] [Fahhnan 90] S. E. Fahlman, C. Lebiere: The Cascade Correlation Learning Architecture, Report CMU-CS-90-100, School of ComputerScience, Carnegie Mellon University, Pittsburgh, PA15213, August 1991 [Goddardet al. 89] Goddard, N.H., Lynne, K.J., Mintz, T., Bukys, L.: The Rochester Connectionist Simulator: User Manual, Tech Report 233 (revised), Univ. of Rochester, NY,1989 [Grajski et al. 90] K.A. Grajski, G. Chinn, C. Cben, C. Kuszmaul, S. Tomboulian: Neural NetworkSimulation on the MasParMP-1Massively Parallel Processor, INNC,Paris, France, 1990 [Hecht-Nielsen 88] Hecht-Nielsen, R.: Neurocomputing,Addison-Wesley,1990 [I-I~lttel 92] M. Hilttel: Parallele Implementierungenmehrstufiger feedforward-Netzeauf einem SIMD-Parallelrechner, Studienarbeit Nr. 1124, Universi~t Stuttgart, Fakult~t Informatik, Juli 92 (in German) [Jurik 89] M. Jurik: Backpercolation, (unpublished) paper distributed by Jufik Research, PO2379, Aptos, CA95001 USA [Mache92] N. Mache:Entwicklungeines massiv parallelen Simulatorkerns for neuronale Netze auf der MasParMP-1216, DiplomarbeitNr. 845, Universi~t Stuttgart, Fakult~t Informatik, Feb. 92 (in German) [Poggio, Girosi 89] T. Poggio, F. Girosi: A Theory of Networksfor Approximationand Learning, A.I. MemoNo. 1140, A.I. Lab., M.I.T., 1989 [Rumelhart, McClelland86] Rumelhart, D.E., McClelland, J.A., the PDPResearch Group: Parallel Distributed Processing, Vol. 1, 2, MITPress, CambridgeMA,1986 [Sejnowski, Rosenberg86] T. J. Sejnowski, C.R. Rosenberg: NETtalk: a parallel networkthat learns to read aloud, The John Hopkins Univ. EE and Comp.Science Technical Report JHU/EECS-86/01,32 pp., also in: Anderson, Rosenfeld: Neurocomputing:Foundations of Research, ch. 40, pp. 661-672, MITPress, 1988 [Singer 90] A. Singer: Implementations of Artificial Neural Networkson the Connection Machine, Thinking Machines Corp. Tech. Rep. RL90-2, Jan. 1990 (also in Parallel Computing,summer1990) [q’ouretzky 89] Touretzky, D.: Advancesin Neural Information Processing Systems 1, MorganKaufmann,1989 [Touretzky et al. 88] Touretzky, D., Hinton, G., Sejnowski, T.: Proc. of the 1988 Connectonist ModelsSummerSchool, June 17-26, Carnegie Mellon Univ., MorganKaufmann,1988 [Vogt 92] M. Vogt: Implementiemngund Anwendungvon "Generalized Radial Basis Functions" in einem Simulator neuronaler Netze, DiplomarbeitNr. 875, Univ. Stuttgart, Fakultat Informatik, Jan. 92 (in German) [Waibel 89] A. Walbel: Consonant Recognition by Modular Construction of Large Phonemic Time-Delay Neural Networks, in Touretzky (Ed.): NIPS 1, pp. 215-223, MorganKaufmann,1989 [Zhanget al. 89] X. Zhang, M. Mckenna,J.P. Mesirov, D. L. Waltz: An efficient implementationof the Back-propagation algorithm on the Connection Machine CM-2, Thinking Machines Corp. TR [Zell et al. 90] A. Zell, Th. Korb, T. Sommer,R. Bayer. A Neural NetworkSimulation Environment,Proc. Applications of Neural NetworksConf., SPIE Vol. 1294, pp. 535-544 [Zell et al. 91a] A. Zell, N. Mache,T. Sommer.T. Korb: Recent Developmentsof the SNNSNeural NetworkSimulator, Applic. of Neural NetworksConf., Proc. SPIE’s 1991 AerospaceSensing Intl. Symp., Vol. No. 1469, April 1991, Orlando, Florida, pp. 708-719 [Zell et al. 91b] A. Zell, N. Mache,T. Sommer,T. Korb: Design of the SNNSNeural NetworkSimulator, 7th Austrian Artificial Intelligence Conf., Sept. 91, Wien, Informatik-Fachberichte287, Springer, pp. 93-102 [ZeU et al. 92] A. Zell, N. Mache, R. H~bner, M. Schmalzl, T. Sommer,T. Korb: SNNSUser Manual, Version 2.0, Universit~t Stuttgart, Fakult~t Informatik, ReportNo. 3/92 241