Parallel Neural Network Training Ed P. Andert Jr. and Thomas

From: AAAI Technical Report SS-93-04. Compilation copyright © 1993, AAAI (www.aaai.org). All rights reserved.
Parallel NeuralNetworkTraining
EdP. AndertJr. and Thomas
J. Bartolac
ConceptualSoftwareSystemsInc.
P.O. Box727, YorbaLinda, CA92686-0727,U.S.A.
Phone:(714) 996-2935,Fax: (714) 572-1950
email: andert@
orion.oac.uci.edu
Introduction
Connectionistapproachesin artificial intelligence can benefit fromimplementations
that take
advantageof massiveparallelism. Thenetworktopologiesencounteredin connectionistworkoften involve
suchmassiveparallelism. This paperdiscussesadditional parallelism utilized prior to the actual network
implementation.
Parallelismis usedto facilitate aggressiveneural networktraining algorithmsthat enhance
the accuracyof the resulting network.Thecomputationalexpenseof the training approachto training is
minimizedthrough the use of massiveparallelism. Thetraining algorithms are being implemented
in a
prototypefor a challengingsensorprocessingapplication.
Ourneural networktraining approachproducesnetworksthat are superior to the limited accuracy
of typical connectionistapplications. Theapproachaddressesfunction complexity,networkcapacity, and
training algorithmaggressiveness.Symboliccomputing
tools are usedto developalgebraicrepresentations
for the gradient and Hessianof the least-squarescost functionsfor feed-forwardnetworktopologies,with
fully~ected and locally-connected layers. These representations are then transformed into
block-structured form. Theyare then integrated with a full-Newtonnetworktraining algorithm and
executedon vector/parallel computers.
Gradient-basedTraining
Mostresearchers train with the "back-propagation"algorithm [McCleliand86], a simplified
version of gradient descent. Thenamecomesfromthe conceptthat as a giveninput flows forwardthrough
the network, the resulting output is comparedwith the desired output, and that error is propagated
backward
throughthe network,generatingthe changesto the network’sweightsas it progresses.
Oneadvantageof back-propis that it only requires the calculation of the training cost function
gradient. Thelayered structure of the networkallowsthe gradient to be expandedby the chainrule method,
resulting in each nodebeingable to compute"its own"partial derivativesand weightchangesbasedsolely
onlocal information,suchas the error propagated
to it fromthe previouslayer.
Unfortunately,back-propis not efficient. It is notoriouslyslow, especially as the solution is
approached,due to the fact that the gradient vanishesat the solution. Superiormethodsexist, suchas
conjugategradient, whichstill only require the cost functiongradient(and a fewadditionalglobal values)
to be computed.Thesemethodsconverge morequickly, but inevitably slow downas the solution is
approached.
This slow-downis morethan a matter of patience, however. It also indicates a lack of
"aggressiveness".In attemptingto learn a complicatedmapping,an unaggressivetraining algorithmmay
causethe networkto learn only the moreobviousfeatures of the mapping.In this case, the networkwill
never achievean arbitrarily high accuracyin approximatingthe mapping,no matter howcomplicatedthe
Parallel NeuralNetwork
Training
network maybe. This is similar to Simpson’smethodfor numerical integration, in which decreasing the
step size achieves an increase in accuracy only to a certain limit, after whichfurther step size decreases
only causes round-off error to corrupt the solution.
Hessian-based Training
Moreaggressive training algorithms rely on the Hessian of the cost function. This allows the
networkto learn the moresubtle features of a complicatedmapping,since the Hessianis able to represent
the more subtle relationships between pairs of weights in the network. The full-Newton methoddirectly
calculates the Hessian, while quasi-Newton methods approximate the Hessian, or its inverse. These
algorithms converge more quickly, especially as the solution is approached, since the Hessian does not
vanishat the solution.
Far from the solution, the Hessianis not always positive-definite, and tile Newtonstep cannot be
assumedto be a descent direction. For this reason it is necessary to modifythe full-Newton methodby
adjusting the Hessian until it is positive-definite. Different methodsfor doing this have been developed
[Dennis 83], all with the goal of minimizingthe amountby which the Hessian is perturbed. The Hessian
can be perturbed by a very small amount,whichvaries as the training progresses, using a technique that we
have developed.
In exchange for the aggressiveness and superior convergence rates, these algorithms are more
computationallyexpensive per iteration. Furthermore,the expense growswith networksize at a rate faster
than that for gradient-basedtechniques, typically by a factor of N, for a networkhavingN weights.
A further hindrance for full-Newtonmethodsis the need to explicitly calculate the Hessian. Since
approximatingit numerically is risky for complicatedcost functions, researchers must derive the analytic
expressions for each of the terms in the Hessian. There are of order 2L2 algebraically unique terms in the
Hessianfor a networkwith L adjacently-connectedhidden layers.
Finally, computinga Hessian requires global information, since each element refers to a different
pair of weights. Therefore a Hessian cannot be mappedeasily onto the existing topology of its network, as
can be donefor a gradient, andso typically it is calculated "off-line".
These disadvantages have deterred manyfrom Hessian-based approaches, except for the simplest
and smallest of networks. This, in tum, has limited the application of neural networks as solution
estimators for difficult function evaluations or inverse problems.
Twotechnologies exist that can removethe limitations to Hessian-based learning algorithms.
Theseare vector/parallel computerarchitectures and block-structured algorithms.
The technologies makeHessian-based training algorithms a more practical consideration. The
increase in their computational expense with network size is reduced by the commensurateincrease in
efficiency with whichthey can be calculated on vector/parallel architectures. The overall growth, of order
N3 2for a networkof N weights, must still be paid, but nowthe researcher can choose to have N, or even N
of the cost, paid in computerhardware,with the remainderpaid in executiontime.
Performance Advantages
High-performancecomputers that offer vector-based architectures or parallel computingunits
promisefaster execution for problemsthat are highly vectorizable or parallelizable. The evaluation of a
Hessian is a good example of such a problem. Ideally, by providing an N-fold increase in the
computationalhardware, a problemcould be solved N limes faster.
8
Parallel Neural NetworkTraining
In reality, however, most problems cannot be completely vectorized or parallelized, and the
remaining serial portion degrades the overall performance, leading to a case of diminishing returns, as
described by Amdahl’sLaw.
Furthermore, computer technology is such that the computingunits can typically perform their
operations on two operands faster than the memoryand bus can supply the operands or digest the result.
This has led to a changein strategy in the design of compute-bound
algorithms. Nowthe goal is to perform
manyoperations on a small set of data, repeating across the data set, rather than the previous methodof
performing a small numberof operations across the entire data set, repeating for the required list of
operations.
In this way the overhead cost of movingthe data can be amortized over manyoperations, and with
clever cache architectures, can be hiddenentirely. This approachleads to the data set being partitioned into
blocks, and hence the name"block-structured algorithms". Preliminary results indicate that blockstructured algorithms moreclosely achieve the ideal of a linear speed-up[GaUivan90].
Recently LAPACK
was released [Anderson 92], a block-structured version of the LINPACK
and
EISPACK
libraries. These portable routines make calls to BLAS3(Basic Linear Algebra System), and
take into account the effects of high-performanCe
architectures, such as vector registers, parallel computing
units, and cached memorystructures.
The BLAS3routines are matrix-matrix operations, which represent the largest ratio of
computation to data movementfor fundamental operations. They complement the earlier BLAS1and
BLAS2
(vector-vector and vector-matrix operations),and all are written in forms optimizedfor each
specific computerarchitecture, often in the native assemblylanguage.
The combination of these two technologies makes Hessian-based training algorithms a more
practical consideration. The increase in their computationalexpense with networksize can be efficiently
offset by an increase in the amountof computerhardwarehosting the training algorithm.
But in addition to simply deriving the Hessians, they must be cast into block-structured form, to
take maximum
advantage of the vector/parallel architectures they will be evaluated on. Whereas the
standard approach to defining the size of a block is to matchit with the vector length in the computer’s
vector registers or the numberof units that can computein parallel, here the layered nature of the network
suggests one wouldchooseblock sizes accordingto the sizes of the network’slayers. This will lead to each
block being, in general, rectangular in shapeand of different sizes (see figure 1).
>
W3
03
Figure la: A typical 3-layer feed forward network
9
Parallel Neural NetworkTraining
Q3
%
o2
%
Q
1
Figurelb: Theblock structure of the Hessianfor the abovenetworkshowingrelative block sizes.
Whenthese blocksizes are smaller than the computer’svector length (actually pipeline depth)
numberof parallel units, there will be someloss of efficiency. However,for large networks,with many
nodesper layer, the situation will be reversed, and each block in the Hessianwill be partitioned at
executiontime into a numberof smaller, uniformly-sizedsub-blocks, as dictated by the "size" of the
computer.Forimproving
the training for large networks,the latter casewill prevail.
Notethat since a gradientis a one-dimensional
object, it doesnot lend itself to a block-structured
form, and at best can be representedin termsof vector-vectorand vector-matrixoperations. Therefore,
gradient-basedtraining algorithmswill enjoysomedegreeof speed-upwhenexecutedon vector or parallel
architectures, but since algorithmsbased on BLAS1
and BLAS2
routines cannotcompletelyhide the cost
of moving
data, their speed-upwill still be less thanthat of block-structured
Hessian-based
algorithms.
PreliminaryNetworkTrainingPerformance
Results for a SensorProcessingApplication
Theapproachfor aggressivenetworktraining wasapplied to a sensorprocessingapplication. This
applicationwasa prototypeachievingencouraging
preliminaryresults. Theapplication involvesseparating
closely-spacedobjects for missile defensesystemfocal-planesensor data processing.Thetarget ideally
producesa single point of illumination, but in reality the point is blurred somewhat
by the optics of the
sensor.This causesthe point to illuminateseveraldetectors (blur circle). Asa result, the subsequentdata
processingmustreconstitute the original point’s amplitudeandlocationon the basis of the signals actually
receivedfromthe detectors illuminatedby the point’s blur circle. Occasionally,twopoint-targets are in
close proximity,and their blur circles overlapon the focal plane. Forthis "closely-spacedobjects"(CSO)
problem,the task of reconstituting the original two points’ locations and intensities is muchmore
challenging. It is difficult to developaccurate estimators based on simplelinear combinationsof the
detectorreturns, dueto the complexity
of patternspossiblewithoverlapping
blur circles.
TheCSOproblemis an exampleof an inverse function approximationproblem.It is possible to
train a feed-forwardneural networkto mapfromdetector retumspaceto blur circle parameterspace. Its
inputis the actualdetectorreturn values,andits outputis the corresponding
blur circle parameters.
Figure 2 showsa typical networktraining session. It showstraining set and testing set cost
function valuesvs. numberof epochsin the training exercise. This networkhas 8 input units, 15 hidden
units, and six output units. Thetraining and testing sets had 378and 7497members,respectively. Note
the suddendropin the cost functionvaluesat certain iterations. This is an exampleof howthe aggressive
10
Parallel NeuralNetwork
Training
full-Newton
trainingalgorithmquicklytakesadvantage
of encountered
opportunities
to drasticallyimprove
thenetwork’s
level of training.
0.8
0.7
0.6
0.5
8.4
................
) ...................
: .....................
, ....................
0.3
0.2
0
1000 2000
9000
Training
Epochs
4000
Figure2: Networktrainingsession with 15 hiddenunits sampledat a density of 3.
Figure3 showsthe results of trainingnetworks
withdifferentnumbers
of hiddenunits onthe oneparameterversion of the CSOproblem.For a given networksize, as the numberof examplesof the
functionareusedin the trainingset, the trainingset cost functionincreases(the problem
is gettingharder
to
learn),andthe testingset cost functionvaluesreduce(thereis a morecarefulspecificationof thefunction),
asymptotically
approaching
eachotheras the functionspecificationbecomes
sufficientlydense.
Fordifferentnetwork
sizes, the cost functionvalueat whichthe trainingandtesting sets meetis
different,withlargernets resultingin lowervalues.Furthermore,
withincreasednetwork
size, the function
to be learnedmustbe sampled
to a higherdensitybeforethe trainingandtesting set cost functionvalues
meet.Bothof these effects wouldbe expected,since a networkwitha larger hiddenlayer has a greater
capacityfor learninga complicated
mapping
to a betteraccuracy.
Figure4 showsa similar plot for networkssimultaneouslytrainedon the six CSOparameters.
Thesenetworkshad 5, 10, and 15 hiddenunits. Notethe samebasic shapeto these curvesas in the
previousfigure. Thesmallestnetworkreachesits limit at about0.41, for a samplingdensityof 4. The
medium-sized
network
achievesabout0.37 for a densityof 5, andthe largest network,althoughnot yet in
its asymptotic
limit, doessuggestit will continuethe trendin asymptotic
cost functionvalueandsampling
density.
11
Parall¢l NeuralNetworkTraining
I
.+.
(A/a) -- 2 HIDDEN UNITS
(n/b)
I
I
II
(CIc) ,, 4HIDDEN
~2 ~
(D/d,)
:"
I, oa’e,
:I" II"
:’ II’illl
~ I
,
i l
¯,
,
III
a llnllrlP
+It"
.-S
I,,,
III
.
Ill
,
_
"’’’""
l
I
I
te
II
.........
¯
10
.....
+
UNITS
.
o
I " I
I
IllllillllllalSlS
’
8 HIDDEN
Ill
I
I
"
’
"
’
’
I
l
£
’~
I
I
I
!
.
,el
.~i
,
.....+...llllllllllllW
,
+-Ill
(,/,) - ~ .=o=.UNto
,
Imlla
i I ,~i,IIIIbl,.,]l
I lllldll
i
" 5 HIDDICN UIlIT8
(G/g)-
Iltllllll,,,u,e,I ~t i
l ~;..,..,,.,, ~. -, . ,,
611 ..11olO,111Be
UNITS
""" " ° =°"+I"
,
|’’|
’
0
-.a xzDoma mlzTs
111
5’0
......
? . . i .......
:30
# 14llg~!
i J
I .........
¯
40
SO
Ill ~IUkIHZNG8IT
.........
I
i
........
I0
¯
.........
’70
Figure3: Theresults of training networkswith differing numbersof hiddenunits
on the one-parameter
version of the CSOproblem.
Theseresults wereaccomplished
throughthe development
of a full-Newtontraining algorittun for
the two-layer feed-forward neural network. It was written in double-precision FORTRAN,
and was
developed to execute on an Intel i860XR pipelined processor (11 LINPACK
DP megaFLOPS),
implementedas an add-in board to a 486-basedPC. The two major componentsof the execution time
taken by the training programwere the calculation of the Hessian, and several matrix-manipulation
algorithms.Thetraining algorithmwasable to utilize the i860 pipelineand cacheby applyingthe blocking
techniquesdiscussedearlier whencalculating the Hessian,and utilized several LAPACK
routines for the
matrixcalculations.
Thetraining algorithmreadily lends itself to paraUelization.Thecalculationof the Hessian,which
is performedover the entire training set, can be spread across several processors, each dedicatedto a
portion of the training set. Theauxiliary matrix calculations implemented
in LAPACK
routines can be
parallelized at the DO-loop
level, givena compiler(e.g., Alliant) that can recognizesuchstructures.
anticipateinvestigatingtheseopportunitiesin the nearfuture.
12
8{
Parallel NeuralNetwork
Training
1
0.9
0.8
0.7
._ 0.6
I- 0.5
0
2
3
4
5
Training Set SamplingDensity
Figure 4: Network performancevs. problem size.
References
[Anderson 92]
E. Anderson,
Z. Bai, C. Bischof, J. Demmel,
J. Dongarra,J.
[Dennis 83]
Dennis, J. Jr., and Schnabel, R. (1983). Numerical Methods for Unconstrained
Optimization and Nonlinear Equations. Prentice-HaU.
[GaUivan90] ’
GaUivan,K., et al. (1990). Parallel Algorithms for Matrix Computations.the Sodtcty
for Industrial and Applied Mathematics.
[McCleUand86]
McCleUand, J., Rumelhart, D., and the PDPResearch Group. (1986). Parallel
Distributed Processing; Explorations in the Microstructure of Cognition. MITPress.
DuCroz, A. Grecnbaum,
S. Hammarling, A. McKenney, S. Ostrouchov, and D. Sorenscn; LAPACK
Usc#s
Guide; Society for Industrial and Applied Mathematics,Philadelphia, 1992.
13