From: AAAI Technical Report SS-93-04. Compilation copyright © 1993, AAAI (www.aaai.org). All rights reserved. Parallel NeuralNetworkTraining EdP. AndertJr. and Thomas J. Bartolac ConceptualSoftwareSystemsInc. P.O. Box727, YorbaLinda, CA92686-0727,U.S.A. Phone:(714) 996-2935,Fax: (714) 572-1950 email: andert@ orion.oac.uci.edu Introduction Connectionistapproachesin artificial intelligence can benefit fromimplementations that take advantageof massiveparallelism. Thenetworktopologiesencounteredin connectionistworkoften involve suchmassiveparallelism. This paperdiscussesadditional parallelism utilized prior to the actual network implementation. Parallelismis usedto facilitate aggressiveneural networktraining algorithmsthat enhance the accuracyof the resulting network.Thecomputationalexpenseof the training approachto training is minimizedthrough the use of massiveparallelism. Thetraining algorithms are being implemented in a prototypefor a challengingsensorprocessingapplication. Ourneural networktraining approachproducesnetworksthat are superior to the limited accuracy of typical connectionistapplications. Theapproachaddressesfunction complexity,networkcapacity, and training algorithmaggressiveness.Symboliccomputing tools are usedto developalgebraicrepresentations for the gradient and Hessianof the least-squarescost functionsfor feed-forwardnetworktopologies,with fully~ected and locally-connected layers. These representations are then transformed into block-structured form. Theyare then integrated with a full-Newtonnetworktraining algorithm and executedon vector/parallel computers. Gradient-basedTraining Mostresearchers train with the "back-propagation"algorithm [McCleliand86], a simplified version of gradient descent. Thenamecomesfromthe conceptthat as a giveninput flows forwardthrough the network, the resulting output is comparedwith the desired output, and that error is propagated backward throughthe network,generatingthe changesto the network’sweightsas it progresses. Oneadvantageof back-propis that it only requires the calculation of the training cost function gradient. Thelayered structure of the networkallowsthe gradient to be expandedby the chainrule method, resulting in each nodebeingable to compute"its own"partial derivativesand weightchangesbasedsolely onlocal information,suchas the error propagated to it fromthe previouslayer. Unfortunately,back-propis not efficient. It is notoriouslyslow, especially as the solution is approached,due to the fact that the gradient vanishesat the solution. Superiormethodsexist, suchas conjugategradient, whichstill only require the cost functiongradient(and a fewadditionalglobal values) to be computed.Thesemethodsconverge morequickly, but inevitably slow downas the solution is approached. This slow-downis morethan a matter of patience, however. It also indicates a lack of "aggressiveness".In attemptingto learn a complicatedmapping,an unaggressivetraining algorithmmay causethe networkto learn only the moreobviousfeatures of the mapping.In this case, the networkwill never achievean arbitrarily high accuracyin approximatingthe mapping,no matter howcomplicatedthe Parallel NeuralNetwork Training network maybe. This is similar to Simpson’smethodfor numerical integration, in which decreasing the step size achieves an increase in accuracy only to a certain limit, after whichfurther step size decreases only causes round-off error to corrupt the solution. Hessian-based Training Moreaggressive training algorithms rely on the Hessian of the cost function. This allows the networkto learn the moresubtle features of a complicatedmapping,since the Hessianis able to represent the more subtle relationships between pairs of weights in the network. The full-Newton methoddirectly calculates the Hessian, while quasi-Newton methods approximate the Hessian, or its inverse. These algorithms converge more quickly, especially as the solution is approached, since the Hessian does not vanishat the solution. Far from the solution, the Hessianis not always positive-definite, and tile Newtonstep cannot be assumedto be a descent direction. For this reason it is necessary to modifythe full-Newton methodby adjusting the Hessian until it is positive-definite. Different methodsfor doing this have been developed [Dennis 83], all with the goal of minimizingthe amountby which the Hessian is perturbed. The Hessian can be perturbed by a very small amount,whichvaries as the training progresses, using a technique that we have developed. In exchange for the aggressiveness and superior convergence rates, these algorithms are more computationallyexpensive per iteration. Furthermore,the expense growswith networksize at a rate faster than that for gradient-basedtechniques, typically by a factor of N, for a networkhavingN weights. A further hindrance for full-Newtonmethodsis the need to explicitly calculate the Hessian. Since approximatingit numerically is risky for complicatedcost functions, researchers must derive the analytic expressions for each of the terms in the Hessian. There are of order 2L2 algebraically unique terms in the Hessianfor a networkwith L adjacently-connectedhidden layers. Finally, computinga Hessian requires global information, since each element refers to a different pair of weights. Therefore a Hessian cannot be mappedeasily onto the existing topology of its network, as can be donefor a gradient, andso typically it is calculated "off-line". These disadvantages have deterred manyfrom Hessian-based approaches, except for the simplest and smallest of networks. This, in tum, has limited the application of neural networks as solution estimators for difficult function evaluations or inverse problems. Twotechnologies exist that can removethe limitations to Hessian-based learning algorithms. Theseare vector/parallel computerarchitectures and block-structured algorithms. The technologies makeHessian-based training algorithms a more practical consideration. The increase in their computational expense with network size is reduced by the commensurateincrease in efficiency with whichthey can be calculated on vector/parallel architectures. The overall growth, of order N3 2for a networkof N weights, must still be paid, but nowthe researcher can choose to have N, or even N of the cost, paid in computerhardware,with the remainderpaid in executiontime. Performance Advantages High-performancecomputers that offer vector-based architectures or parallel computingunits promisefaster execution for problemsthat are highly vectorizable or parallelizable. The evaluation of a Hessian is a good example of such a problem. Ideally, by providing an N-fold increase in the computationalhardware, a problemcould be solved N limes faster. 8 Parallel Neural NetworkTraining In reality, however, most problems cannot be completely vectorized or parallelized, and the remaining serial portion degrades the overall performance, leading to a case of diminishing returns, as described by Amdahl’sLaw. Furthermore, computer technology is such that the computingunits can typically perform their operations on two operands faster than the memoryand bus can supply the operands or digest the result. This has led to a changein strategy in the design of compute-bound algorithms. Nowthe goal is to perform manyoperations on a small set of data, repeating across the data set, rather than the previous methodof performing a small numberof operations across the entire data set, repeating for the required list of operations. In this way the overhead cost of movingthe data can be amortized over manyoperations, and with clever cache architectures, can be hiddenentirely. This approachleads to the data set being partitioned into blocks, and hence the name"block-structured algorithms". Preliminary results indicate that blockstructured algorithms moreclosely achieve the ideal of a linear speed-up[GaUivan90]. Recently LAPACK was released [Anderson 92], a block-structured version of the LINPACK and EISPACK libraries. These portable routines make calls to BLAS3(Basic Linear Algebra System), and take into account the effects of high-performanCe architectures, such as vector registers, parallel computing units, and cached memorystructures. The BLAS3routines are matrix-matrix operations, which represent the largest ratio of computation to data movementfor fundamental operations. They complement the earlier BLAS1and BLAS2 (vector-vector and vector-matrix operations),and all are written in forms optimizedfor each specific computerarchitecture, often in the native assemblylanguage. The combination of these two technologies makes Hessian-based training algorithms a more practical consideration. The increase in their computationalexpense with networksize can be efficiently offset by an increase in the amountof computerhardwarehosting the training algorithm. But in addition to simply deriving the Hessians, they must be cast into block-structured form, to take maximum advantage of the vector/parallel architectures they will be evaluated on. Whereas the standard approach to defining the size of a block is to matchit with the vector length in the computer’s vector registers or the numberof units that can computein parallel, here the layered nature of the network suggests one wouldchooseblock sizes accordingto the sizes of the network’slayers. This will lead to each block being, in general, rectangular in shapeand of different sizes (see figure 1). > W3 03 Figure la: A typical 3-layer feed forward network 9 Parallel Neural NetworkTraining Q3 % o2 % Q 1 Figurelb: Theblock structure of the Hessianfor the abovenetworkshowingrelative block sizes. Whenthese blocksizes are smaller than the computer’svector length (actually pipeline depth) numberof parallel units, there will be someloss of efficiency. However,for large networks,with many nodesper layer, the situation will be reversed, and each block in the Hessianwill be partitioned at executiontime into a numberof smaller, uniformly-sizedsub-blocks, as dictated by the "size" of the computer.Forimproving the training for large networks,the latter casewill prevail. Notethat since a gradientis a one-dimensional object, it doesnot lend itself to a block-structured form, and at best can be representedin termsof vector-vectorand vector-matrixoperations. Therefore, gradient-basedtraining algorithmswill enjoysomedegreeof speed-upwhenexecutedon vector or parallel architectures, but since algorithmsbased on BLAS1 and BLAS2 routines cannotcompletelyhide the cost of moving data, their speed-upwill still be less thanthat of block-structured Hessian-based algorithms. PreliminaryNetworkTrainingPerformance Results for a SensorProcessingApplication Theapproachfor aggressivenetworktraining wasapplied to a sensorprocessingapplication. This applicationwasa prototypeachievingencouraging preliminaryresults. Theapplication involvesseparating closely-spacedobjects for missile defensesystemfocal-planesensor data processing.Thetarget ideally producesa single point of illumination, but in reality the point is blurred somewhat by the optics of the sensor.This causesthe point to illuminateseveraldetectors (blur circle). Asa result, the subsequentdata processingmustreconstitute the original point’s amplitudeandlocationon the basis of the signals actually receivedfromthe detectors illuminatedby the point’s blur circle. Occasionally,twopoint-targets are in close proximity,and their blur circles overlapon the focal plane. Forthis "closely-spacedobjects"(CSO) problem,the task of reconstituting the original two points’ locations and intensities is muchmore challenging. It is difficult to developaccurate estimators based on simplelinear combinationsof the detectorreturns, dueto the complexity of patternspossiblewithoverlapping blur circles. TheCSOproblemis an exampleof an inverse function approximationproblem.It is possible to train a feed-forwardneural networkto mapfromdetector retumspaceto blur circle parameterspace. Its inputis the actualdetectorreturn values,andits outputis the corresponding blur circle parameters. Figure 2 showsa typical networktraining session. It showstraining set and testing set cost function valuesvs. numberof epochsin the training exercise. This networkhas 8 input units, 15 hidden units, and six output units. Thetraining and testing sets had 378and 7497members,respectively. Note the suddendropin the cost functionvaluesat certain iterations. This is an exampleof howthe aggressive 10 Parallel NeuralNetwork Training full-Newton trainingalgorithmquicklytakesadvantage of encountered opportunities to drasticallyimprove thenetwork’s level of training. 0.8 0.7 0.6 0.5 8.4 ................ ) ................... : ..................... , .................... 0.3 0.2 0 1000 2000 9000 Training Epochs 4000 Figure2: Networktrainingsession with 15 hiddenunits sampledat a density of 3. Figure3 showsthe results of trainingnetworks withdifferentnumbers of hiddenunits onthe oneparameterversion of the CSOproblem.For a given networksize, as the numberof examplesof the functionareusedin the trainingset, the trainingset cost functionincreases(the problem is gettingharder to learn),andthe testingset cost functionvaluesreduce(thereis a morecarefulspecificationof thefunction), asymptotically approaching eachotheras the functionspecificationbecomes sufficientlydense. Fordifferentnetwork sizes, the cost functionvalueat whichthe trainingandtesting sets meetis different,withlargernets resultingin lowervalues.Furthermore, withincreasednetwork size, the function to be learnedmustbe sampled to a higherdensitybeforethe trainingandtesting set cost functionvalues meet.Bothof these effects wouldbe expected,since a networkwitha larger hiddenlayer has a greater capacityfor learninga complicated mapping to a betteraccuracy. Figure4 showsa similar plot for networkssimultaneouslytrainedon the six CSOparameters. Thesenetworkshad 5, 10, and 15 hiddenunits. Notethe samebasic shapeto these curvesas in the previousfigure. Thesmallestnetworkreachesits limit at about0.41, for a samplingdensityof 4. The medium-sized network achievesabout0.37 for a densityof 5, andthe largest network,althoughnot yet in its asymptotic limit, doessuggestit will continuethe trendin asymptotic cost functionvalueandsampling density. 11 Parall¢l NeuralNetworkTraining I .+. (A/a) -- 2 HIDDEN UNITS (n/b) I I II (CIc) ,, 4HIDDEN ~2 ~ (D/d,) :" I, oa’e, :I" II" :’ II’illl ~ I , i l ¯, , III a llnllrlP +It" .-S I,,, III . Ill , _ "’’’"" l I I te II ......... ¯ 10 ..... + UNITS . o I " I I IllllillllllalSlS ’ 8 HIDDEN Ill I I " ’ " ’ ’ I l £ ’~ I I I ! . ,el .~i , .....+...llllllllllllW , +-Ill (,/,) - ~ .=o=.UNto , Imlla i I ,~i,IIIIbl,.,]l I lllldll i " 5 HIDDICN UIlIT8 (G/g)- Iltllllll,,,u,e,I ~t i l ~;..,..,,.,, ~. -, . ,, 611 ..11olO,111Be UNITS """ " ° =°"+I" , |’’| ’ 0 -.a xzDoma mlzTs 111 5’0 ...... ? . . i ....... :30 # 14llg~! i J I ......... ¯ 40 SO Ill ~IUkIHZNG8IT ......... I i ........ I0 ¯ ......... ’70 Figure3: Theresults of training networkswith differing numbersof hiddenunits on the one-parameter version of the CSOproblem. Theseresults wereaccomplished throughthe development of a full-Newtontraining algorittun for the two-layer feed-forward neural network. It was written in double-precision FORTRAN, and was developed to execute on an Intel i860XR pipelined processor (11 LINPACK DP megaFLOPS), implementedas an add-in board to a 486-basedPC. The two major componentsof the execution time taken by the training programwere the calculation of the Hessian, and several matrix-manipulation algorithms.Thetraining algorithmwasable to utilize the i860 pipelineand cacheby applyingthe blocking techniquesdiscussedearlier whencalculating the Hessian,and utilized several LAPACK routines for the matrixcalculations. Thetraining algorithmreadily lends itself to paraUelization.Thecalculationof the Hessian,which is performedover the entire training set, can be spread across several processors, each dedicatedto a portion of the training set. Theauxiliary matrix calculations implemented in LAPACK routines can be parallelized at the DO-loop level, givena compiler(e.g., Alliant) that can recognizesuchstructures. anticipateinvestigatingtheseopportunitiesin the nearfuture. 12 8{ Parallel NeuralNetwork Training 1 0.9 0.8 0.7 ._ 0.6 I- 0.5 0 2 3 4 5 Training Set SamplingDensity Figure 4: Network performancevs. problem size. References [Anderson 92] E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra,J. [Dennis 83] Dennis, J. Jr., and Schnabel, R. (1983). Numerical Methods for Unconstrained Optimization and Nonlinear Equations. Prentice-HaU. [GaUivan90] ’ GaUivan,K., et al. (1990). Parallel Algorithms for Matrix Computations.the Sodtcty for Industrial and Applied Mathematics. [McCleUand86] McCleUand, J., Rumelhart, D., and the PDPResearch Group. (1986). Parallel Distributed Processing; Explorations in the Microstructure of Cognition. MITPress. DuCroz, A. Grecnbaum, S. Hammarling, A. McKenney, S. Ostrouchov, and D. Sorenscn; LAPACK Usc#s Guide; Society for Industrial and Applied Mathematics,Philadelphia, 1992. 13