Massive Parallelism in Neural Network Simulation

Massive Parallelism
in Neural Network Simulation
From: AAAI Technical Report SS-93-04. Compilation copyright © 1993, AAAI (www.aaai.org). All rights reserved.
AndreasZell, Niels Mache,Markus
Hiittel, MichaelVogt
University of Stuttgart,
Institute for Parallel and Distributed HighPerformanceSystems(IPVR),
Breitwiesenstr. 20-22, D-7000Stuttgart 80,
Fed. Rep. Germany
E-mail: zell @informatik.uni-stuttgart.de
Abstract: Wehere present and comparedifferent massively parallel implementationsof multilayer feedforwardneural networks on a MasParMP-1216,a parallel SIMDcomputer with 16,384 processors. For multilayer feedforward networks we
have obtained sustained rates of up to 348 MCPSand 129 MCUPSwith backpropagation, a high markfor this architecture
and for general purpose SIMDcomputers. This paper focuses on the problemsof mappingneural networks to parallel hardware, on implementation problems in obtaining high propagation rates on a SIMDmachine and on problems with the resuiting learning algorithms.
Keywords:connectionism, neural networks, neural networksimulators, massive parallelism, simulation
1
Introduction and Motivation
In the last five to eight years, the focus of research in artificial intelligence (AI) has shifted froma purely symbolicview
a view that encompassessubsymbolic or connectionist systems as another powerful computingmethodfor solving problems in artificial intelligence [Rumelhart, McClelland86]. Connectionist systems or artificial neural networks (ANNs)
consist of a large numberof simple units (cells, artificial neurons) workingin parallel and exchanginginformation via
networkof directed, weighted links (connections). This information usually only comprisesthe activation of the neurons,
a single numericalvalue for each cell, whichis fed to the successor cells after being weightedby their connectinglinks.
This is motivated by a rough analogy with the synaptic coupling of nerve cells. The information that the network has
learned after training is usually distributed over the networkin the formof the matrix of all link weights.
Our research group wants to understand the advantages and the trade-offs of the various artificial neural networkparadigmsand learning algorithms, their efficiency and generalization capabilities and their suitability for massivelyparallel
implementation. Wehave developed a neural network simulator, SNNS,which has proven well suited for research on
learning algorithms, on issues of visualizing networktopology, training and performanceand on parallel implementation
of neural networks. It is also used in a numberof other university research groups and with growingacceptance in industry as a neural networkevaluation and prototyping tool. In this paper we are describing the experiences we gained in
developing a massively parallel simulator kernel for SNNSrunning on our 16 K processor MasParMP-1216.
2
Stuttgart Neural NetworkSimulator
SNNS(Stuttgart Neural NetworkSimulator) is an efficient and portable neural network simulation environmentfor Unix
workstations developedat the Institute for Parallel and Distributed High PerformanceSystems, University of Stuttgart,
Germany.It is a software tool to generate, train, test and visualize artificial neural networks.The wholenetworksimulator
has been developed in C on Unix workstations. The graphical user interface was implemented under X-WindowsX11R4
(Athena widget se0, for maximalportability.
2.1 Structure of SNNS
SNNS
nowconsists of a sequential and a parallel simulator kernel and a graphical user interface. The simulator kernel operates on the internal representation of the neural networksand performsall operations of the learning and recall phase. It
is closely coupled with the graphical user interface via an interface of function calls. Thesimulator kernel is written in C
for efficiency and portability and has already been ported to a numberof architectures.
234
"
X-Window
graphicaluser
interface
XGUI
user
defined.
activation
internal
~
~it
,,,,ta
I
~
l~~
~
........
network
...........................
procedures
user dermed
activation
..
.....
data
~:’:~:::--
Fig. 1 Structure of the SNNSnetworksimulator consisting of sequential simulator kernel, parallel simulator kernel and
graphical user interface
The graphical user interface, based on X-Windows,
is a powerful tool to construct the networktopology and visualize and
modifysmall to mediumsized nets interactively. It can also be used to generate and save test patterns for small networks.
To economizeon screen space the display elements are kept in separate windowsand thus can be arbitrarily arranged or
hidden if desired. There are various waysto display or modifynodesand links or selected sets of them. Anintegrated help
facility aids the novice with the interface. Networkscan be modifiedthrough the user interface during simulation. Units
can be introduced, removed,or have their activation values changed. Connectionsamongthe units can be inserted, deleted,
redirected, or have their strengths modified. Contrary to most other simulators most modifications can be done in a very
simple point and click mannerdirectly from the visual representation of the networktopology.
2.2
Graphical
User
Interface
The graphical user interface, based on X-Windows,
is a powerful tool to construct the networktopology and visualize and
modifysmall to mediumsized nets interactively. All display elements are kept in separate windowsand thus can be arbitrarily arranged or hiddenif desired. Thereare various waysto display or modifynodes and links or selected sets of them.
Anintegrated help facility aids the novice with the interface. Networkscan be modifiedthrough the user interface during
simulation. Units can be introduced, removed, or have their activation values changed. Connectionsamongthe units can
be inserted, deleted, redirected, or havetheir strengths modified. Fig. 2 gives an overviewof the graphical user interface
of SNNS.
Contrary to manyother simulators most modifications can be done in a very simple point and click mannerdirectly from
the visual representation of the networktopology.The user can control the visual representation of units (activation, output,
number,name)and the display of links (directed, undirected, weighO.Connectionsand units can be displayed selectively,
i.e. the user maychooseto display only those units whoseactivations or outputs exceeda given display threshold or only
those links whoseweights are in a certain range. This allows watchingthe growthof units and the changeof link weights
during learning.
235
II
I II II ii II I
Fig. 2 Graphical user interface of SNNS
with a small letter recognition network: info panel (top left), 3Dcontrol panel
(top center) and 3D-display (top fight), error graph (center left), SNNS
banner (center), remote control panel
right), 2D-display (bottom left), Hinton diagram(bottom center), 2D-display (bottom
Thegraphical interface is both used to display a neural networkand to generate and manipulateit. Therefore, the user has
a powerfulset of operations (insertion, deletion, copying,moving)at his use. Theseoperations maybe applied to individual
units or to selections of units and mayaffect links as well, like ’copyall selected units with their input links’ or ’delete all
links into the selected units’. Theseoperations allow a quick and convenientgeneration of networks.
2.3 Conneclionist Models supported by SNNS
Fromits design SNNSsupports a wide variety of neural networkmodels. Anynetworkthat can be specified as a directed
graph with weighted links maybe realized. The concept of sites which has been adapted from RCS[Goddard et al. 89]
even allows multiple links between two single units. Most users of SNNSuse simple multilayer feedforward networks
with one or two hidden layers with standard sigmoid activation functions (logistic, sine or tanh). However,recurrent networks have also been implemented. The following learning algorithms have been implementedin SNNS:"vanilla" backpropagation [Rumetharl, McClelland 86], backpropagation with momentum
and weight decay and flat spot elimination,
batch back-propagation, quickprop [Fahiman89], counterpropagation[Hecht-Nielsen 89], backpercolation [Jurik 89], cascade correlation [Fahiman 90], radial basis function networks (RBF) [Poggio, Girosi 89], ART1,ART2and ARTMAP
[Carpenter, Ca’ossberg88], Tune-DelayNetworks[Waibel 89] and serf organizing feature maps. Not all of themare available in the public distribution, however.
236
2.4
Selected
Applications
of SNNS
SNNSis currently used in at least 300 installations worldwide,approx, one third of themeach in Germany,other Europe
and the U.S. Its mainuse is in university research but somecommercialresearch projects use SNNS
as a prototyping tool
to find optimal learning procedures, networksizes and learning parametersfor various neural networkapplications. Applications include rotation invariant pattern recognition, handwrittencharacter recognition, stock price prediction, recognition
and classification of exogenic and endogeniccomponentsof event correlated brain potentials, noise reduction in natural
language communicationin a telecom environment, prediction of secondary structure of proteins and texture analysis.
3
Massively
parallel
SNNS kernels
on the MasPar MP-1
Twoparallel implementations for the SNNSkernel and one prototype implementation have been developed on our 16 K
processor MasParMP-1216for multilayer feedforward networks. The goal of the parallelization was to enable the simulation of large neural networks, mainlyfor the tasks of imageprocessing, feature extraction and pattern and object recognition.The parallel simulator is integrated with the sequential simulator as an alternative simulator kernel. Fromthe XWindows
basedgraphical user interface it is possible to switch betweenboth kernels at runtime, provided the user restricts
himself to multilayer feedforward networks.
3.1
Architecture
of the MP-1
The MasParMP-1216is a SIMDmachinewith up to 16,384 four-Bit processors. 32 processors are integrated on a single
chip, 32 chips fit on a processor board. Our full scale modeldelivers a quoted peak performanceof 30,000 MIPS(32 bit
addition) and 1,500 resp. 600 MFLOPS
(32 bit resp. 64 bi0. There exist two separate communicationarchitectures on the
MasPar:one is a 3-stage global router which allows up to 1024simultaneous connections betweenany two processors, the
other is a torroidal two-dimensional 8 neighbour grid (X-net). Communication
bandwidth is up to 1.5 GB/s peak global
router and up to 24 GB/speak X-net communication.Fromthis data it can be seen that it is advisable to use the local grid
as muchas possible since the communicationbandwidthis muchlarger than with the router. Also on our machinewe experienced relatively high router hardwarefailures whichforced our implementationsto avoid it if possible.
The MasPar can be programmedwith parallel versions of C (AMPL)and Fortran. MPPE(MasPar parallel programming
environment), an integrated graphical tool set based on X-Windows,
facilitates programdevelopmentand debugging.
Havinginvestigated the trade-offs of different approachesto parallelization of neural networks, as given in [Singer 90],
[Grajski et al. 90], [Chinnet ai. 90] and [Zhanget al. 89] we decided on an implementationwhichcombinesunit parallelism
with training vector parallelism. All implementationsof our parallel simulator kernel were done in MPL,a parallel extension of C. Twoof them have recently been converted to AMPL,the ANSIC extension of MPL.
3.2 Implementation
with
Unit-Parallelism
and Training
Pattern
Parallelism
The implementationof [Mache92] uses the following technique (Fig. 3): All hidden and output units of a vertical slice are
mappedto a single processing element (PE) of the MasPar.The computationof unit activation is done in parallel for all
units of a layer. Thus, a numberof processors is neededwhichequals the largest numberof processing elements in a layer,
i.e. the width of the networkdetermines the numberof processors needed. If the numberof input units is greater than the
numberof units of the other layers (which is very often the case), an additional PE is neededto store the remaining components of the input pattern and to send themto its neighbor whenthey are needed. Eachprocessor stores the weights of
all of its input links. The processors are located in a logical ring communication
structure whichcan easily be realized on
the X-net grid (with possible copyingat the fringes). During forward or backwardpropagation, the intermediate values for
the net input or the accumulatederror signal, resp., are shifted cyclically to the left. The weights are stored with a skew
factor of 1 in each processor. This allows that all units of a layer performthe computationof the sumof all weightedpredecessor units’ outputs in a numberof steps equal to the size of the precedinglayer. Thealgorithmis very similar to a systolic
matrix-vector multiplication algorithm.
Since the width of a feedforward networkis usually muchsmaller than the numberof available processors on our MasPar
system, multiple copies of the networkwith different input patterns are kept in the machineand are updated in parallel. In
this wayweight changes have to be computedin each networkindividually without actually changing the weights. The sum
of the weight changesis then computedand applied to all correspondingweights of the identical networkcopies. This resuits in a backpropagationalgorithm that is a mixture betweenonline and batch backpropagationwith the batch size at least
equal to the numberof networkcopies in the machineor an integer multiple of it.
237
output
Fig. 3 Parallel MasParSNNSkernel with a 6-3-4 feedforward network: all hidden and output neurons of a columnand
their input weightsare mappedonto a single processor, all neuronsof a layer are trained in parallel (unit parallelism). The
numberof processors needed is the numberof neurons of the biggest layer except the input layer, plus one. Multiple networkcopies with different input patterns are trained in parallel (training pattern parallelism).
For an optimal 128-128-128networkwhich fits into the machinewithout an additional PE and which does not need copying
at the end of a cycle this implementation we obtained 176 MCPS(connections per second) and 67 MCUPS(connection
updates per second) for backpropagationtraining. The Nettalk network[Sejnowski, Rosenberg86], a 203-120-26network,
can be trained with 41 MCUPSand operated with 98 MCPS.These times did not include the time for transfer of the input
patterns from the frontend to the parallel machine.
Oneadvantage of this approachis, that the numbersof processors used is not determinedby the size of the input layer,
whichis usually muchlarger than any hidden or output layer. So a large numberof networks can be trained in parallel. A
disadvantage is the fact that the one additional PE has to store muchmore pattern elements than the others. In an SIMD
machinewith identical memoryallocation on all PEs this memorybecomesa limiting factor of howmanypatterns can be
stored in parallel on the machine.Since pattern I/O wasthe limiting factor of our parallel implementation,a secondimplementation was performed.
3.3 SecondImplementationwith Unit-Parallelism and Training Pattern Parallelism
Our second implementationwas done to alleviate the pattern I/O bandwidthproblemof the first implementation. Its main
objective was to store as manypatterns as possible in the parallel PE memory,even if the numberof PEs needed to store
the networkis larger. This implementationuses a numberof PEs whichis equal to the size of the biggest layer, including
input layer. If the input layer is the biggest layer, all PEs store a similar numberof pattern components,otherwise some
PEs may store no components.
For an optimal 128-128-128 network we obtain sustained 348 MCPSin recall modeand 129 MCUPSfor backpropagation
training. The NETtalknetwork can be recalled with 47 MCPSand trained with 17.6 MCUPS.These times include the
time for the transfer of the input patterns and the results. Since the I/O times dominatedthe learning and recall times in the
previous implementation, the speed improvementof the latter version was even greater than the figures tell. This speed
improvementresulted from a new, better compiler installed in the meantime,and from extensive code optimization. The
fact that in this schemeless networkscan be trained in parallel can be seen in NETtalkbenchmarks,but it is far less important than the time saved by faster pattern loading.
238
II
Fig. 4 SecondParallel MasParSNNS
kernel with a 6-3-4 feedforwardnetwork:all neuronsof a columnand their input
links are mapped
onto a single processor. Thenumberof processorsneededis as large as the maximum
numberof neurons
of anylayer. Multiplenetworkcopieswithdifferent input patternsare trained in parallel.
Link-Parallel
Implementation
3.4
Thelast implementation
[Htitte192]is not a full SNNS
kernel but wasintendedas a prototypeimplementation.
It lacks the
supportof all SNNS
kernel functionsbut can read SNNS
networkfiles. It is displayedin Fig. 3.
First the networkis extendedonebias unit for eachlayer anddummy
units to makeeach layer of equal size n.AUunits of
adjacentlayers are connected.Theweightsto dummy
units are initialized to zero andare preventedfrombeingupdatedby
maskingthemin the last step of weightupdates. In our terminologyweightsfromsourcei to j are denotedby wij. If the
weightmatricesconnectingadjacentlayers are denotedWl,...W m then if r is odd, the outgoingweightswij of unit i are
mapped
to columnsof the PEarray, with the sourceunit of lowestindex givingthe leftmostcolumn;if r is even, the outgoingweightswij of unit i are mapped
to rowsof the PEarray, with the sourceunit of lowestindexgivingthe bottomrow.
This parallel prototypeimplementation
with link parallelismachieved136MCUPS
for a fully connected127-127-127
network and 160 MCUPSfor a 127-127network on our MasParMP-1216.
4
Problems of the Parallel Simulator Kernels
All three parallel SNNS
kernels on the MasParyield impressiveperformancefigures. However,these results have only
beenobtainedafter a lengthyperiodof optimizationandafter several completerewritesof the parallel kernel. Ourbiggest
hurdle wasthe slowcommunication
of the training patterns fromthe unix workstationto the parallel backend,whichin
first tests tookminutesversusmilliseconds
for the actual training. This couldbe improved
a little withthe use of a parallel
disk array or the parallel I/O RAM
that are availablenowas expensiveoptionsof the machine.Alot of effort wastherefore
spentto loadtraining patterns in large blocksandto keepas manyof themas possiblein the distributedparallel PEmemory.
Anotherproblemconcernsthe hatchbackpropagation
algorithmnecessaryto run the training pattern parallel implementations: For manyapplications with a large number
of similar input patterns this learning algorithmis slowerthan online
backpropagation.
Wetested our simulatorwith character recognitionproblems¯In onecase weused 10,000scanneddigits
"0" to "9". In this test the slowerconvergence
of batchbackpropagation
offset mostof the performance
gainof the parallel
architecture. However,
someapplications needbatch backpropagation
for convergence
and others report better generalization results. Also, other batchlearningalgorithmslike quickprop[Fahlman88] maybe usedwith better results.
239
neural network
processorarray
w2T
outputunits
w
w,r
rrl [] r~
r~ r~ ~
[] [] []
[] [] []
rTl m m
value!: ropagation
I~"1
I
Wl forward pass Wffforwardpass
IIIIIII
W~backwardpass
T backwardpass
W2
Fig. 5 Link-parallel prototype implementationwith a 5-3-4 feedforwardnetwork:Eachlayer is filled up with dummy
nodesto the size of the largest layer. Thereis a bias unit for eachlayer, printed g~y.Theweightmatricesare mapped
to
the processorarray directly andin transposedform, in alternating order (W1, W2’,W3, W4’,...). Dummy
weightswhich
are set to 0 and preventedfromupdatingwith a maskare printed in grey. Patterns are mapped
to the processorarray in
diagonalorder. Thedirections of propagationchangein eachlayer accordingto the mappingof the weightmatrix.
Conclusions
-5
Wehere haveinvestigated different mappings
of neural networksto a massivelyparallel SIMD
computer.Thesedifferent
implementations
haveshownthat it is possible, albeit not at all easy to obtain impressiveperformance
figures for neural
networksimulation on current SIMD
computers.However,these high marksare only obtainedfor simple networkarchitectures with a networksize that fits well into the parallel machine.Wehavelearnedthat propagationfigures quotedfor
neural networkalgorithmsare only meaningfulif they take communication
time fromdisk or workstationto the parallel
machineinto account. Overcoming
the I/O bottlenecks took mostof the time of the implementationsandforced several
fundamental
changesin the algorithms.Ourresults can be extendedto VLSIneural networkhardwarein the sensethat the
lime go load training patterns into the parallel hardwaremustmatchthe speedof propagation.
Anotherlesson learnedwasthat the speedadvantagegainedby a parallel implementation
can be lost for certain applications
becauseof the slowerbatch backpropagation
algorithm.Theseresults havebeenobtainedwith precise floating point computations. It wouldhavebeenmoredifficult with fixed point arithmetic or special VLSIhardwarewith limited precision.
240
The limited precision might itself offset the speed advantage of fast VLSIhardware.
Our last point is that these implementationsare no natural mappingsto parallel hardware, like e.g. each processor representing a neuron. Becauseof the limited communication
bandwidthrather special mappingshave to be found to obtain high
performance on current hardware.
6
Literature
[Carpenter, Grossberg 88] Carpenter, G.A., Grossberg, S.: The ARTof Adaptive Pattern Recognitionby a Self-Organizing
Neural Network, IEEE Computer, March 1988, 77-88
[Chinn et al. 90] G. Chinn, K.A. Grajski, C. Chen, C. Kuszmaul,S. Tomboulian:Systolic Array Implementationsof Neural
Nets on the MasParMP-1Massively Parallel Processor, MasParCorp. Int. Report
[Fahiman88] Fahlman,S.E.: Faster Learning Variations on Backpropagation,in [Touretzky et al. 88]
[Fahhnan 90] S. E. Fahlman, C. Lebiere: The Cascade Correlation Learning Architecture, Report CMU-CS-90-100,
School
of ComputerScience, Carnegie Mellon University, Pittsburgh, PA15213, August 1991
[Goddardet al. 89] Goddard, N.H., Lynne, K.J., Mintz, T., Bukys, L.: The Rochester Connectionist Simulator: User Manual, Tech Report 233 (revised), Univ. of Rochester, NY,1989
[Grajski et al. 90] K.A. Grajski, G. Chinn, C. Cben, C. Kuszmaul, S. Tomboulian: Neural NetworkSimulation on the
MasParMP-1Massively Parallel Processor, INNC,Paris, France, 1990
[Hecht-Nielsen 88] Hecht-Nielsen, R.: Neurocomputing,Addison-Wesley,1990
[I-I~lttel 92] M. Hilttel: Parallele Implementierungenmehrstufiger feedforward-Netzeauf einem SIMD-Parallelrechner,
Studienarbeit Nr. 1124, Universi~t Stuttgart, Fakult~t Informatik, Juli 92 (in German)
[Jurik 89] M. Jurik: Backpercolation, (unpublished) paper distributed by Jufik Research, PO2379, Aptos, CA95001 USA
[Mache92] N. Mache:Entwicklungeines massiv parallelen Simulatorkerns for neuronale Netze auf der MasParMP-1216,
DiplomarbeitNr. 845, Universi~t Stuttgart, Fakult~t Informatik, Feb. 92 (in German)
[Poggio, Girosi 89] T. Poggio, F. Girosi: A Theory of Networksfor Approximationand Learning, A.I. MemoNo. 1140,
A.I. Lab., M.I.T., 1989
[Rumelhart, McClelland86] Rumelhart, D.E., McClelland, J.A., the PDPResearch Group: Parallel Distributed Processing, Vol. 1, 2, MITPress, CambridgeMA,1986
[Sejnowski, Rosenberg86] T. J. Sejnowski, C.R. Rosenberg: NETtalk: a parallel networkthat learns to read aloud, The
John Hopkins Univ. EE and Comp.Science Technical Report JHU/EECS-86/01,32 pp., also in: Anderson,
Rosenfeld: Neurocomputing:Foundations of Research, ch. 40, pp. 661-672, MITPress, 1988
[Singer 90] A. Singer: Implementations of Artificial Neural Networkson the Connection Machine, Thinking Machines
Corp. Tech. Rep. RL90-2, Jan. 1990 (also in Parallel Computing,summer1990)
[q’ouretzky 89] Touretzky, D.: Advancesin Neural Information Processing Systems 1, MorganKaufmann,1989
[Touretzky et al. 88] Touretzky, D., Hinton, G., Sejnowski, T.: Proc. of the 1988 Connectonist ModelsSummerSchool,
June 17-26, Carnegie Mellon Univ., MorganKaufmann,1988
[Vogt 92] M. Vogt: Implementiemngund Anwendungvon "Generalized Radial Basis Functions" in einem Simulator neuronaler Netze, DiplomarbeitNr. 875, Univ. Stuttgart, Fakultat Informatik, Jan. 92 (in German)
[Waibel 89] A. Walbel: Consonant Recognition by Modular Construction of Large Phonemic Time-Delay Neural Networks, in Touretzky (Ed.): NIPS 1, pp. 215-223, MorganKaufmann,1989
[Zhanget al. 89] X. Zhang, M. Mckenna,J.P. Mesirov, D. L. Waltz: An efficient implementationof the Back-propagation
algorithm on the Connection Machine CM-2, Thinking Machines Corp. TR
[Zell et al. 90] A. Zell, Th. Korb, T. Sommer,R. Bayer. A Neural NetworkSimulation Environment,Proc. Applications of
Neural NetworksConf., SPIE Vol. 1294, pp. 535-544
[Zell et al. 91a] A. Zell, N. Mache,T. Sommer.T. Korb: Recent Developmentsof the SNNSNeural NetworkSimulator,
Applic. of Neural NetworksConf., Proc. SPIE’s 1991 AerospaceSensing Intl. Symp., Vol. No. 1469, April 1991,
Orlando, Florida, pp. 708-719
[Zell et al. 91b] A. Zell, N. Mache,T. Sommer,T. Korb: Design of the SNNSNeural NetworkSimulator, 7th Austrian
Artificial Intelligence Conf., Sept. 91, Wien, Informatik-Fachberichte287, Springer, pp. 93-102
[ZeU et al. 92] A. Zell, N. Mache, R. H~bner, M. Schmalzl, T. Sommer,T. Korb: SNNSUser Manual, Version 2.0, Universit~t Stuttgart, Fakult~t Informatik, ReportNo. 3/92
241