Complex Systems 2 (1988) 39-44 Scaling R elationships in Back-propagation Learning Ge r ald T e sauro Bob J anssens Center for Compl ex Systems Research, University of Illinois at Urbana-Champaign 508 South Sixth St reet, Champaign, IL 61820, US A A bst rac t. We prese nt an empirical st udy of th e required trainin g time for neural networks to learn to compute the parity function using the back -propagation lear ning algorithm, as a function of t he numb er of inp uts. The parity funct ion is a Boolean predica te whose orde r is equal to th e number of inpu t s. \Ve find t hat t he t rain ing time behav es roughly as 4" I where n is the num ber of inp ut s, for values of n betw een 2 and 8. T his is consistent with recen t t heoretical a nalyses of similar algorit hms. As a pa rt of thi s stu dy we sea rched for optimal par a meter tunings for each value of n. We suggest that the learning rate should decrease faster than lin, the momentum coefficient should approach 1 exponentially, and the initial random weight scale should remain approximately constant. One of th e most important current issues in th e t heory of supervised lea rning in neural networks is th e subject of scaling prop erties [4,] 3], i.e., how well "t he networ k behaves (as measured by, for exa mple, th e t rain ing t ime or the maximum generalization perform ance) as t he comp utational ta sk becomes larger and mor e comp licated . T here are man y possible ways of measurin g t he size or complexity of a t ask; however, there ar e st rong reasons to believe that th e most useful and most imp or t an t indi cat or is t he predicat e orde r k, as defined by Minsky and Paper t [61. The exact technical definition of t he predicate order of a Boolean funct ion is somewhat obscur e; however , a n int uitive way of underst anding it is as t he minima l numb er of input uni ts which can provide at least partial information as t o t he correct output st ate. Wh ile analyti c results for scaling dependence on k has not been carr ied out for hackpropagat ion or Boltzmann machine lea rning, a na lysis of related algorit hms [1,2,14,15) sugges ts th at t he minimum traini ng t ime, and th e minimum num ber of distinct t rai ning patterns, should increase exponentially with the order k, probabl y as c k , where c is some numerica l constant, a nd possibly as bad as '"'" nk, where n is th e number of input units. In t his brief techn ical not e, we report a n empirical measurement of the scaling of training tim e with predicate order for the back-propagation learning paradigm. Th e task chosen for train ing t he network s was t he parity '""J lC) 1988 Comolex Svstems Pub lications. Inc. 40 Gerald Tesauro and Bob Janssens fun ction , i.e. the funct ion which returns 1 if an odd number of inpu ts unit s are on, a nd 0 if an even number of input un it s are on. We chose t he parity funct ion because of it s simp le definition and symmetry propertie s, becau se a n a na lyt ic solut ion for the network weights is known [9], and because n-bi t pa rity is known t o ha ve predica te order k = n . T his can be seen fro m th e in tui ti ve defini tio n of pr edica t e order. For n- bit parity, no su bset of input units of size less than n provides an y information as to t he correct an swer; t his mu s t be det ermined by ob serving all n inpu t valu es. Thus, t he parity function has t he m axim um po ssib le pr edicate order , and in t hat sense sho uld be represent ati ve of t he hardest po ssible funct ions to learn using standard ba ck-propagat ion learni ng . Ou r empirical st udy was carried out as follows. For ea ch val ue of n examine d , we set up a back-propagat ion network with n input units, 2n hidden units, a nd 1 output uni t , with full conn ectivi ty from inp ut s to hidd ens and [rom hidd ens to ou tpu t. We chose to use 2n hidden uni t s, because the minimum requ ired number of hidden unit s can be shown t o be n [9], and with more th an t he minima l nu mber th e problem of trapping in local min ima sho uld be reduced [8] . Th e init ial weights wer e set to random valu es on a scale n 7' which was varied as a par t of the st udy. The 2 t raining patterns were t hen cycled t hrou gh in order, and t he weights were adjusted afte r each pa t tern. The lea rn ing rate and momentum parameters t a nd (Y, as defined in [12], were a lso varied in thi s st udy. The training continued unti l the net work pe rformed cor rect ly on ea ch of t he 2n possib le pa tterns, wit h correct performance be ing defined as coming within a mar gin of 0.1 of t he correct an swer. If t he net work had not rea ched pe rfect performance by a cer t ain cutoff t ime, the t raining r un was abandoned, and th e net work was reported to be st uck in a loc al rrummum. The questi on of paramet er t uning is high ly non -trivia l in thi s st udy. Wh en com pa ring results fo r two different networks of size n and n ', it is no t at all clear a priori what set t ing of param eter values con stitu te s a fair compari son . V\'e decide d to sea rch for, at each value of n , the va lues of e, 0: a nd r which gave optimal average tra ining time . Th is necessitated trying empiricall y several d ifferent combin atio ns of par am et er set t ings, bu t produced added int eresting results as to how the optimal parameter values scale with increasin g n . E a rly in th e st udy, we found t hat th e lea rni ng of the pari ty function is an ext reme ly noi sy stochas t ic process, with wide variations in t he t raining time from run to ru n over many orders of magnitude. In order to ach ieve good stat ist ics on the average training time, it was nece ssary to do 10,000 learn ing run s for each data po int (n ,(,(Y ,r). Th is produced average t rain ing times wit h at least two significant figur es. T here is a n add itional pr obl em in that for t he learning run s which become st uck in local minima , t he training t ime is infini te, and th us a dir ect averaging of train ing times will not work. VVe defined t he average training t ime to be th e inverse of the average t raining rate, where individual train ing ra tes were defined as t he inverse of in dividual t ra ining t im es. With t h is definition , t here is no problem in com put ing th e avera.ge, since t he t raining rate for a ru n stuck in a local min imum is zero. 41 Scaling Rela tionships in Back-propagation Learning n 2 3 4 5 6 7 8 font 30 ± 2 20 ±2 12 ±2 7±1 3±1 [2] [1.5] Tont .94 ± .01 2.5 ±.2 95 ± 2 .96 ± .01 2.5 ± .5 265±5 .97 ± .02 2.8± .5 1200 ± 60 .98 ± .01 2.5 ± .5 4100 ± 300 .985 ± .01 20000 ± 2500 [2.51 [100000) [.99] [2.51 [2.oi [500000) [.991 Q' Oll t r ont T ontl 2n 24 33 75 130 310 800 2000 T ontl4 n 5.9 4.1 4.5 4.0 4.9 6.1 7.6 Table 1: Measured optimal averaging training times and opt imal parameter sett ings for e-blt parity as a funct ion of number of inputs n . Th e optimized learning parameters were: learning rate c, momentu m const ant e, and initia l random weight scale r . For 2 ::; n ::; 5, 10000 r uns at each combination of parameter settings were used to compu te t he average. For 6 ::; n S 8, only a few hundred run s were made. Quantit ies in brackets were not opt imized. Our result s are presen ted in t able 1. Note t ha t th e average training t ime increases extremely rap idly wit h increasing n . Since there is a fur th er increase in sim ulat ion t ime proportional to t he nu mber of weights in th e network ("" n 2 ) , it became ext remely difficu lt to achieve well-optimized , accurate t ra ining t imes beyond n = 5. (We estimate t hat rv 1014 logic op eration s were used in the ent ire st udy, taking several weeks of CPU t ime on a Sun work station. ) For t he fina l two values of n, no attempt was made to find the optimal para met er values , and t he reported op t imal training times are probably onl y accu rate to wit hin 50%. Also note t hat when the average t ra inin g ti mes are divi ded by Z1\. (this gives the required numb er of repititions of each training pa t tern ), th e resul t ing figu res ar e still defin ite ly increa sing with increasing n; thu s the t rai ning time increases fast er th an 21\. . Dividing by 4\ however , prod uces figur es which are roughly constant. T his provides suggest ive evidence (which is by no means conclusive) that th e training t ime increases roughly as rv 4 1l • This would be cons istent with analysis of relat ed algorit hms . Volper and Hampson [l,Z] discuss several relat ed learning algorithms whi ch ut ilize a mechanism for adding new hidden nodes to the network, and which are capable of lear ning arbitrary Boo lean funct ions . For hard problem s such as parity, t hese a lgor ithms generally require training t ime s rv Z" , with rv 2n hidden nodes; thus train ing times rv 41l wit h a fixed polynomial numb er of hidden nod es are reas ona ble. Regarding the issu e of optimal parameter values, we discuss each in tu rn . The optimal value of t: appears to fall off somewhat faster t han lin , possibly as n- 3 / 2 or n- 2 • Hinton [3] suggest s on th e basis of fan -in argu ment s t ha t the learning rate shou ld decrease as lin. Ou r results support th e notion t hat t he learning rat e should decrease with increasing fan-in, although the exa ct rate of decrease is undetermined. T he op ti mal momentum coefficient app ear s to ap proach I as n increases. We ca nno t be su re whet her t he rate of approach is exp onential or po lynomi al; Gerald Tesauro an d Bob Janssens 42 16 20 .98 .96 o 139.8 (.0122) .94 .92 .90 .88 118.0 (.0106) 116.6 (.0109) 116.6 (.0080) 108.3 (.0105) 102.0 (.0097) 101.2 (.0142) 103.3 (.0205) 24 152.0 (.0219) 112.2 (.0096) 99.2 (.0117) 95.1 (.0194) 97.6 (.0418) 104.0 (.0672) e 28 129.7 (.0235) 105.8 (.0136) 95.0 (.0255) 95.9 (.0509) 102.7 (.0988) 32 36 101.6 (.0179) 95.0 (.0425) 101.9 (.1047) 99.5 (.0287) 98.7 (.0759) 111.9 (.1733) 40 124.4 (.0389) 996 (.0482) 106.1 (.1291) 44 102.6 (.0698) Table 2: Average t ra ining times , an d in parent heses, fracti on of runs stuck in local minima, for 2-bit parity (XO R) as a Iunction oflearning rate e and mom ent um coeffi cient fro In eac h, the initial ra ndom weigh t scale was r = 2.5. however , an expontential approach would be consistent with t he following simple argument . The quantity (1 - 0) is essent ially a decay te rm which ind icates how long th e weight change du e to a pa rt icu lar pattern persists. In o ther words, the use of a momentum t erm is in some sense equ ivalen t to compu ting a weight upd at e by averaging over a numb er of patterns. If t he decay rate decreas es exponent ially, say as 2- n , t his would indicate t hat the optimal number of patterns to average over increases as 2n , or that t he fract ion of pat terns in comparison to all possible patterns remains constant. Fina lly, regard ing t he optimal scale of the initial random weight s, we were surprised to find th at t his par ameter remained ap proximately constant as n increased. We would have expected some sort of decrease in t he random weight scale. For exam ple, one could argue t ha t the average level of act ivation of a hidden unit due to n ra ndom weights from t he inp ut units would be of orde r n 1 / 2 ; thu s the weight scale should decrease as n- 1/ 2 to keep th e input constant . However , thi s does not app ea r to be t he case empirically. It is also inte rest ing to note what happens as t he pa rameters are varied from their op timal values. For aU t hree pa rameters we find as the parameter is increased , t he learn ing goes faste r for ru ns which do not get st uck in local minima . However 1 once t he parameter goes beyond t he optimal value, th ere is a rapid increase in th e fraction of learn ing runs which do become st uck. Thus t he parameter values we report are essentially t he largest possible values at which one can operate wit hout getting st uck in local minima an unacceptable perce ntage of t he time. Ta ble 2 provides an illustration of how th e average training time and the fraction of ru ns st uck in local minima behave as the parameters c and 0' are varied , for the case n = 2. The apparent scaling behaviors we have found for th e traini ng times and Scaling R elationships in Back-propagation Learning 43 paramet er values are somewhat counter-intuitive, and moti vate t he search for an analyt ic explanat ion. Also we poin t out t hat t he requi rement t hat t he network learn the par ity function perfectl y is a severe requirement , which may have contributed to t he ext remely poor scaling behav ior. It would be interestin g to measu re t he tra ining time needed to reach some fixed level of performance less t han 100% , alt hough we conjecture that such t rain ing times would st ill increase exponentially. In conclusion, while it is difficult to ext rapolate to t he large-n limit on the basis of values of n between 2 and 8, we do have suggestive evidence th at th e required t raining t ime for th e back-propagati on algorithm increases exponentially wit h t he pred icate order of the computation being t rai ned. This would indicate t hat ea rlier ent husiat ic proj ections of t he use of back-propagation to lea rn arb itrarily complicated function s were overly optimistic. The learn ing of such functions, while no longer impossible as wit h perceptron network s, still remains effectively intractab le. Thus back-propagation, and probably all known supervised learni ng algorithms for connectionist networ ks, should be used only for learning of low-order computations. For high-ord er probl ems such as connectedn ess and pari ty, alternative procedu res should be investiga ted . Acknowledgem ents We th ank Terr y Sejnowski for provid ing the back-pro pagation simulato r code and for helpful discussions. This work was partially supported by th e National Center for Supercomputi ng Application s. References [1] S. E. Hamp son and D. J. Volper, "Linear functio n neurons: st ructu re and training," Biological Cy bernetics, (1986) 203-217. [2] S. E. Hampson and D. J . Volper, "Disju ncti ve models of boolean catego ry learnin g," Biological Cy bernetics, (1987) 121-137. [3] D. C. Plau t , S. J . Nowlan and G. E. Hinton , "Experiments on learn ing by back propagatio n," Carnegie-Mellon University, Departm ent of Computer Science Technical Repor t ( 1986) CMU-CS-86-126. [4] G. E. Hinton , "Connectionist learning procedures," Carnegie-Mellon University, Department of Compute r Science Technical Report (1987) CMU·CS-87115. [5J Y. Le Cun , II. A learni ng procedure for asymme tric network ," P roceedings of Cognitiva (Paris), 85 (1985) 599-604. [61 M. Minsky and S. Papert , Perceptcon s, (MIT P ress, Cambridge, MA, 1969). [7] D. E. Rumelhar t , G. E. Hinton , an d R. J . Williams, "Learn ing intern al represent ations by error prop aga tion," in Parallel Dist ributed Processing: 44 Gerald Tesauro and Bob Jans sens Explorations in the Microstru cture of Cognition . Vol. 1: Foundat ions, D. E. Rum elh art and J. 1. McClelland eds., (MIT Press, 1986) 318-362. [8J D. E. Rum elhart, G. E. Hinton , and R. J. Williams, "Learn ing representa tions by back-p rop agating errors ," Nat ure, 323 (1986) 533-536. [9] D. E. Rumelhart an d J. 1. McClellan d (eds.), Parallel Distributed Pro cessing: Explorations in th e Microstru cture of Cognition. Vol. 1: Found ations, (MIT P ress, 1986). [10] D. E . Rumelha rt an d J. L. McClellan d (ed s.) , Parallel Distribu ted Pro cessing: Explorations in the Mic rost r uctur e of Cognition . Vol. 2: P sychological and Biological Models, (M IT P ress, 1986). [11] T. J. Sejnow ski, P. K . Kienk er and G . E. Hinton , "Lea rning symme try groups with hidd en units: beyond the percept ron," in Ev olut ion, Games and Learning : Models for A daptat ion in Machines and Nat ure, D. Fa.rmer et al. eds., (Nort h-Hollan d, 1986) 260-275. [12] T . J. Sejnow ski and C. R. Rosenberg, "Par allel Net works that Learn to Pronou nce En glish Text ," Comp lex Syst ems, 1 (1987) 145-1 68. [13] G. Tesauro, "Scaling relationshi ps in back-propagatio n learning: dependence on t raining set size," Complex Sy stems, 1 (1987) 367-372. [14] 1. G . Valiant , (I A theory of the learnabl e," Communications of tbe A CM , 27 :11 (1984) 1134- 1142. [15] L. G. Valiant , "Learning disju ncti ons of conjunction s," Proceedings of the Int ernat ional Joint Conference on Ar tificial In telligence, (1985) 560-566.