Scaling Relationships in Back-propagation

advertisement
Complex Systems 2 (1988) 39-44
Scaling R elationships in Back-propagation Learning
Ge r ald T e sauro
Bob J anssens
Center for Compl ex Systems Research, University of Illinois at Urbana-Champaign
508 South Sixth St reet, Champaign, IL 61820, US A
A bst rac t. We prese nt an empirical st udy of th e required trainin g
time for neural networks to learn to compute the parity function using
the back -propagation lear ning algorithm, as a function of t he numb er
of inp uts. The parity funct ion is a Boolean predica te whose orde r is
equal to th e number of inpu t s. \Ve find t hat t he t rain ing time behav es
roughly as 4" I where n is the num ber of inp ut s, for values of n betw een
2 and 8. T his is consistent with recen t t heoretical a nalyses of similar
algorit hms. As a pa rt of thi s stu dy we sea rched for optimal par a meter
tunings for each value of n. We suggest that the learning rate should
decrease faster than lin, the momentum coefficient should approach
1 exponentially, and the initial random weight scale should remain
approximately constant.
One of th e most important current issues in th e t heory of supervised lea rning in neural networks is th e subject of scaling prop erties [4,] 3], i.e., how well
"t he networ k behaves (as measured by, for exa mple, th e t rain ing t ime or the
maximum generalization perform ance) as t he comp utational ta sk becomes
larger and mor e comp licated . T here are man y possible ways of measurin g
t he size or complexity of a t ask; however, there ar e st rong reasons to believe
that th e most useful and most imp or t an t indi cat or is t he predicat e orde r k,
as defined by Minsky and Paper t [61. The exact technical definition of t he
predicate order of a Boolean funct ion is somewhat obscur e; however , a n int uitive way of underst anding it is as t he minima l numb er of input uni ts which
can provide at least partial information as t o t he correct output st ate. Wh ile
analyti c results for scaling dependence on k has not been carr ied out for hackpropagat ion or Boltzmann machine lea rning, a na lysis of related algorit hms
[1,2,14,15) sugges ts th at t he minimum traini ng t ime, and th e minimum num ber of distinct t rai ning patterns, should increase exponentially with the order
k, probabl y as c k , where c is some numerica l constant, a nd possibly as bad
as '"'" nk, where n is th e number of input units.
In t his brief techn ical not e, we report a n empirical measurement of the
scaling of training tim e with predicate order for the back-propagation learning paradigm. Th e task chosen for train ing t he network s was t he parity
'""J
lC)
1988 Comolex Svstems Pub lications. Inc.
40
Gerald Tesauro and Bob Janssens
fun ction , i.e. the funct ion which returns 1 if an odd number of inpu ts unit s
are on, a nd 0 if an even number of input un it s are on. We chose t he parity
funct ion because of it s simp le definition and symmetry propertie s, becau se
a n a na lyt ic solut ion for the network weights is known [9], and because n-bi t
pa rity is known t o ha ve predica te order k = n . T his can be seen fro m th e
in tui ti ve defini tio n of pr edica t e order. For n- bit parity, no su bset of input
units of size less than n provides an y information as to t he correct an swer;
t his mu s t be det ermined by ob serving all n inpu t valu es. Thus, t he parity
function has t he m axim um po ssib le pr edicate order , and in t hat sense sho uld
be represent ati ve of t he hardest po ssible funct ions to learn using standard
ba ck-propagat ion learni ng .
Ou r empirical st udy was carried out as follows. For ea ch val ue of n
examine d , we set up a back-propagat ion network with n input units, 2n
hidden units, a nd 1 output uni t , with full conn ectivi ty from inp ut s to hidd ens
and [rom hidd ens to ou tpu t. We chose to use 2n hidden uni t s, because the
minimum requ ired number of hidden unit s can be shown t o be n [9], and
with more th an t he minima l nu mber th e problem of trapping in local min ima
sho uld be reduced [8] . Th e init ial weights wer e set to random valu es on a scale
n
7' which was varied as a par t of the st udy. The 2 t raining patterns were t hen
cycled t hrou gh in order, and t he weights were adjusted afte r each pa t tern.
The lea rn ing rate and momentum parameters t a nd (Y, as defined in [12], were
a lso varied in thi s st udy. The training continued unti l the net work pe rformed
cor rect ly on ea ch of t he 2n possib le pa tterns, wit h correct performance be ing
defined as coming within a mar gin of 0.1 of t he correct an swer. If t he net work
had not rea ched pe rfect performance by a cer t ain cutoff t ime, the t raining
r un was abandoned, and th e net work was reported to be st uck in a loc al
rrummum.
The questi on of paramet er t uning is high ly non -trivia l in thi s st udy. Wh en
com pa ring results fo r two different networks of size n and n ', it is no t at all
clear a priori what set t ing of param eter values con stitu te s a fair compari son .
V\'e decide d to sea rch for, at each value of n , the va lues of e, 0: a nd r which gave
optimal average tra ining time . Th is necessitated trying empiricall y several
d ifferent combin atio ns of par am et er set t ings, bu t produced added int eresting
results as to how the optimal parameter values scale with increasin g n .
E a rly in th e st udy, we found t hat th e lea rni ng of the pari ty function is an
ext reme ly noi sy stochas t ic process, with wide variations in t he t raining time
from run to ru n over many orders of magnitude. In order to ach ieve good
stat ist ics on the average training time, it was nece ssary to do 10,000 learn ing
run s for each data po int (n ,(,(Y ,r). Th is produced average t rain ing times
wit h at least two significant figur es. T here is a n add itional pr obl em in that
for t he learning run s which become st uck in local minima , t he training t ime
is infini te, and th us a dir ect averaging of train ing times will not work. VVe
defined t he average training t ime to be th e inverse of the average t raining
rate, where individual train ing ra tes were defined as t he inverse of in dividual
t ra ining t im es. With t h is definition , t here is no problem in com put ing th e
avera.ge, since t he t raining rate for a ru n stuck in a local min imum is zero.
41
Scaling Rela tionships in Back-propagation Learning
n
2
3
4
5
6
7
8
font
30 ± 2
20 ±2
12 ±2
7±1
3±1
[2]
[1.5]
Tont
.94 ± .01 2.5 ±.2
95 ± 2
.96 ± .01 2.5 ± .5
265±5
.97 ± .02 2.8± .5
1200 ± 60
.98 ± .01 2.5 ± .5
4100 ± 300
.985 ± .01
20000 ± 2500
[2.51
[100000)
[.99]
[2.51
[2.oi
[500000)
[.991
Q' Oll t
r ont
T ontl 2n
24
33
75
130
310
800
2000
T ontl4 n
5.9
4.1
4.5
4.0
4.9
6.1
7.6
Table 1: Measured optimal averaging training times and opt imal parameter sett ings for e-blt parity as a funct ion of number of inputs n .
Th e optimized learning parameters were: learning rate c, momentu m
const ant e, and initia l random weight scale r . For 2 ::; n ::; 5, 10000
r uns at each combination of parameter settings were used to compu te
t he average. For 6 ::; n S 8, only a few hundred run s were made.
Quantit ies in brackets were not opt imized.
Our result s are presen ted in t able 1. Note t ha t th e average training t ime
increases extremely rap idly wit h increasing n . Since there is a fur th er increase
in sim ulat ion t ime proportional to t he nu mber of weights in th e network (""
n 2 ) , it became ext remely difficu lt to achieve well-optimized , accurate t ra ining
t imes beyond n = 5. (We estimate t hat rv 1014 logic op eration s were used in
the ent ire st udy, taking several weeks of CPU t ime on a Sun work station. ) For
t he fina l two values of n, no attempt was made to find the optimal para met er
values , and t he reported op t imal training times are probably onl y accu rate
to wit hin 50%. Also note t hat when the average t ra inin g ti mes are divi ded
by Z1\. (this gives the required numb er of repititions of each training pa t tern ),
th e resul t ing figu res ar e still defin ite ly increa sing with increasing n; thu s the
t rai ning time increases fast er th an 21\. . Dividing by 4\ however , prod uces
figur es which are roughly constant. T his provides suggest ive evidence (which
is by no means conclusive) that th e training t ime increases roughly as rv
4 1l • This would be cons istent with analysis of relat ed algorit hms . Volper
and Hampson [l,Z] discuss several relat ed learning algorithms whi ch ut ilize
a mechanism for adding new hidden nodes to the network, and which are
capable of lear ning arbitrary Boo lean funct ions . For hard problem s such as
parity, t hese a lgor ithms generally require training t ime s rv Z" , with rv 2n
hidden nodes; thus train ing times rv 41l wit h a fixed polynomial numb er of
hidden nod es are reas ona ble.
Regarding the issu e of optimal parameter values, we discuss each in tu rn .
The optimal value of t: appears to fall off somewhat faster t han lin , possibly
as n- 3 / 2 or n- 2 • Hinton [3] suggest s on th e basis of fan -in argu ment s t ha t
the learning rate shou ld decrease as lin. Ou r results support th e notion t hat
t he learning rat e should decrease with increasing fan-in, although the exa ct
rate of decrease is undetermined.
T he op ti mal momentum coefficient app ear s to ap proach I as n increases.
We ca nno t be su re whet her t he rate of approach is exp onential or po lynomi al;
Gerald Tesauro an d Bob Janssens
42
16
20
.98
.96
o
139.8
(.0122)
.94
.92
.90
.88
118.0
(.0106)
116.6
(.0109)
116.6
(.0080)
108.3
(.0105)
102.0
(.0097)
101.2
(.0142)
103.3
(.0205)
24
152.0
(.0219)
112.2
(.0096)
99.2
(.0117)
95.1
(.0194)
97.6
(.0418)
104.0
(.0672)
e
28
129.7
(.0235)
105.8
(.0136)
95.0
(.0255)
95.9
(.0509)
102.7
(.0988)
32
36
101.6
(.0179)
95.0
(.0425)
101.9
(.1047)
99.5
(.0287)
98.7
(.0759)
111.9
(.1733)
40
124.4
(.0389)
996
(.0482)
106.1
(.1291)
44
102.6
(.0698)
Table 2: Average t ra ining times , an d in parent heses, fracti on of runs
stuck in local minima, for 2-bit parity (XO R) as a Iunction oflearning
rate e and mom ent um coeffi cient fro In eac h, the initial ra ndom weigh t
scale was r = 2.5.
however , an expontential approach would be consistent with t he following
simple argument . The quantity (1 - 0) is essent ially a decay te rm which
ind icates how long th e weight change du e to a pa rt icu lar pattern persists.
In o ther words, the use of a momentum t erm is in some sense equ ivalen t
to compu ting a weight upd at e by averaging over a numb er of patterns. If
t he decay rate decreas es exponent ially, say as 2- n , t his would indicate t hat
the optimal number of patterns to average over increases as 2n , or that t he
fract ion of pat terns in comparison to all possible patterns remains constant.
Fina lly, regard ing t he optimal scale of the initial random weight s, we
were surprised to find th at t his par ameter remained ap proximately constant
as n increased. We would have expected some sort of decrease in t he random
weight scale. For exam ple, one could argue t ha t the average level of act ivation
of a hidden unit due to n ra ndom weights from t he inp ut units would be of
orde r n 1 / 2 ; thu s the weight scale should decrease as n- 1/ 2 to keep th e input
constant . However , thi s does not app ea r to be t he case empirically.
It is also inte rest ing to note what happens as t he pa rameters are varied
from their op timal values. For aU t hree pa rameters we find as the parameter
is increased , t he learn ing goes faste r for ru ns which do not get st uck in local
minima . However 1 once t he parameter goes beyond t he optimal value, th ere is
a rapid increase in th e fraction of learn ing runs which do become st uck. Thus
t he parameter values we report are essentially t he largest possible values at
which one can operate wit hout getting st uck in local minima an unacceptable
perce ntage of t he time. Ta ble 2 provides an illustration of how th e average
training time and the fraction of ru ns st uck in local minima behave as the
parameters c and 0' are varied , for the case n = 2.
The apparent scaling behaviors we have found for th e traini ng times and
Scaling R elationships in Back-propagation Learning
43
paramet er values are somewhat counter-intuitive, and moti vate t he search
for an analyt ic explanat ion. Also we poin t out t hat t he requi rement t hat t he
network learn the par ity function perfectl y is a severe requirement , which
may have contributed to t he ext remely poor scaling behav ior. It would be
interestin g to measu re t he tra ining time needed to reach some fixed level of
performance less t han 100% , alt hough we conjecture that such t rain ing times
would st ill increase exponentially.
In conclusion, while it is difficult to ext rapolate to t he large-n limit on the
basis of values of n between 2 and 8, we do have suggestive evidence th at th e
required t raining t ime for th e back-propagati on algorithm increases exponentially wit h t he pred icate order of the computation being t rai ned. This would
indicate t hat ea rlier ent husiat ic proj ections of t he use of back-propagation to
lea rn arb itrarily complicated function s were overly optimistic. The learn ing
of such functions, while no longer impossible as wit h perceptron network s,
still remains effectively intractab le. Thus back-propagation, and probably all
known supervised learni ng algorithms for connectionist networ ks, should be
used only for learning of low-order computations. For high-ord er probl ems
such as connectedn ess and pari ty, alternative procedu res should be investiga ted .
Acknowledgem ents
We th ank Terr y Sejnowski for provid ing the back-pro pagation simulato r code
and for helpful discussions. This work was partially supported by th e National Center for Supercomputi ng Application s.
References
[1] S. E. Hamp son and D. J. Volper, "Linear functio n neurons: st ructu re and
training," Biological Cy bernetics, (1986) 203-217.
[2] S. E. Hampson and D. J . Volper, "Disju ncti ve models of boolean catego ry
learnin g," Biological Cy bernetics, (1987) 121-137.
[3] D. C. Plau t , S. J . Nowlan and G. E. Hinton , "Experiments on learn ing by
back propagatio n," Carnegie-Mellon University, Departm ent of Computer
Science Technical Repor t ( 1986) CMU-CS-86-126.
[4] G. E. Hinton , "Connectionist learning procedures," Carnegie-Mellon University, Department of Compute r Science Technical Report (1987) CMU·CS-87115.
[5J Y. Le Cun , II. A learni ng procedure for asymme tric network ," P roceedings of
Cognitiva (Paris), 85 (1985) 599-604.
[61 M. Minsky and S. Papert , Perceptcon s, (MIT P ress, Cambridge, MA, 1969).
[7] D. E. Rumelhar t , G. E. Hinton , an d R. J . Williams, "Learn ing intern al
represent ations by error prop aga tion," in Parallel Dist ributed Processing:
44
Gerald Tesauro and Bob Jans sens
Explorations in the Microstru cture of Cognition . Vol. 1: Foundat ions, D. E.
Rum elh art and J. 1. McClelland eds., (MIT Press, 1986) 318-362.
[8J D. E. Rum elhart, G. E. Hinton , and R. J. Williams, "Learn ing representa tions by back-p rop agating errors ," Nat ure, 323 (1986) 533-536.
[9] D. E. Rumelhart an d J. 1. McClellan d (eds.), Parallel Distributed Pro cessing: Explorations in th e Microstru cture of Cognition. Vol. 1: Found ations,
(MIT P ress, 1986).
[10] D. E . Rumelha rt an d J. L. McClellan d (ed s.) , Parallel Distribu ted Pro cessing: Explorations in the Mic rost r uctur e of Cognition . Vol. 2: P sychological
and Biological Models, (M IT P ress, 1986).
[11] T. J. Sejnow ski, P. K . Kienk er and G . E. Hinton , "Lea rning symme try groups
with hidd en units: beyond the percept ron," in Ev olut ion, Games and Learning : Models for A daptat ion in Machines and Nat ure, D. Fa.rmer et al. eds.,
(Nort h-Hollan d, 1986) 260-275.
[12] T . J. Sejnow ski and C. R. Rosenberg, "Par allel Net works that Learn to
Pronou nce En glish Text ," Comp lex Syst ems, 1 (1987) 145-1 68.
[13] G. Tesauro, "Scaling relationshi ps in back-propagatio n learning: dependence
on t raining set size," Complex Sy stems, 1 (1987) 367-372.
[14] 1. G . Valiant , (I A theory of the learnabl e," Communications of tbe A CM ,
27 :11 (1984) 1134- 1142.
[15] L. G. Valiant , "Learning disju ncti ons of conjunction s," Proceedings of the
Int ernat ional Joint Conference on Ar tificial In telligence, (1985) 560-566.
Download