Faculty Working Paper 91-0114 330 B385 COPY 2 Learning in a Partially Hard- Wired Recurrent Network oi The Library APR i University of at Department of Economics University of Illinois the 1991 » Urbana-Oham^iQ''' K. Hornik Department of Economics Technische Universitat Wien, Austria Bureau of Economic and Business Research Commerce and Business Administration University of Illinois at Urbana-Champaign College of BEBR FACULTY WORKING PAPER NO. 91-0114 College of Commerce and Business Administration University of Illinois at CIrbana-Champaign February 1991 Learning in a Partially Hard-Wired Recurrent Network Kuan Department of Economics University of Illinois at Urbana-Champaign C.-M. and K. Hornik und Wahrscheinlichkeitstheorie Technische Universitat Wien, Vienna, Austria Institut fur Statistik Digitized by the Internet Archive in 2011 with funding from University of Illinois Urbana-Champaign http://www.archive.org/details/learninginpartia114kuan Abstract In this paper we propose a partially hard-wired Elman network. of our approach is A distinct feature that only minor modifications of existing on-line and off-line learning algorithms are necessary in order to implement the proposed network. This allows researchers to adapt easily to trainable recurrent networks. Given this network architecture, we show that back-propagation estimates mean square distributed. in a general dynamic environment the standard for the learnable connection weights can converge to a error minimizer with probability one and are asymptotically normally • \ Introduction 1 Neural network models have been successfully applied in a wide variety of dis- Typically, applications of networks with at least partially modifiable ciplines. interconnection strengths are based on the so-called multilayer feedforward architecture, in which all signals are transmitted in one direction without feedbacks. In a dynamic context, however, a feedforward network senting certain sequential behavior when its may have difficulties in repre- inputs are not sufficient to characterize temporal features of target sequences (Jordon, 1985). From the cognitive point of view, a feedforward network can perform only passive cognition, in that puts cannot be adjusted by an internal mechanism (Norrod, O'Neill, Sz when man static inputs are present dynamic environments. In view of these problems, researchers have recently networks with feedback connections, see i.e., out- Gat, 1987). These deficiencies thus restrict the applicability of feedforward neural network models in networks, its (1988), Williams & Zipser (1988), and recurrent variables compactly Kuan been studying recurrent e.g., Jordon (1986), El- (1989). In a recurrent network, summarize the past information and, together with other input variables, jointly determine the network outputs. Because recurrent variables are generated by the network, they are functions of the network connection weights. Owing (BP) algorithm for to this parameter dependence, the standard back-propagation feedforward networks cannot be applied because the correct gradient search direction Kuan, Hornik standard BP &; (cf. Rumelhart, Hinton, & it fails to take Williams, 1986). White (1990) propose a recurrent BP algorithm generalizing the algorithm to various recurrent networks. However, this algorithm has quite complex updating equations and restrictions, and therefore cannot be used straightforwardly by recurrent networks practitioners. we suggest an In this paper easier way to Elman (1988) network, focus on a variant of the unit activations serve as recurrent variables. tions implement recurrent networks. between the recurrent units and in We which only a subset of hidden We propose their inputs. This to hard-wire the connec- approach has the following advantages. First, the resulting network avoids the aforementioned problem of pa- rameter dependence. Second, the necessary constraints on recurrent connections suggested by Kuan, Hornik, Third, off-line learning & White (1990) can made is easily be imposed by hard-wiring. possible for the proposed network. Consequently, only minor modifications of existing on-line and off-line learning algorithms are needed. This is very convenient for neural network practitioners. Given this hard- wired network, we show that in general dynamic environments the resulting mean squared estimates converge to a error minimizer with probability one BP and are asymptotically normally distributed. Our convergence results extend the results of Kuan, Hornik, & to the results of White (1990) Kuan White (1990) k, for feedforward networks. This paper proceeds as follows. In section 2 we works. In section 3 algorithms. we networks and are analogous for general recurrent discuss a variant of the We establish strong consistency and briefly review recurrent net- Elman network and its learning asymptotic normality of the learn- ing estimates in section 4. Section 5 concludes the paper. Proofs are deferred to the appendix. 2 A Recurrent Networks three layer recurrent network with k input units, activation function ip, and m output units with / hidden units with common common activation function <p can be written in the following generic form: where the subscript is the / x 1 ot = $(Wa + at = *(Cx + Dr + rt = G(x t b) t -i r t -i,0), t the i;x is vector of hidden unit activations, o is vector of network inputs, a 1 m the x 1 vector of network denote the unitwise activation rules respectively hidden layer, and r is v) t indexes time, x t $ and ^ compactly outputs, t t is the n x computed through some generic function 1 in the output vector of recurrent variables which G from the previous input Xt-i, the previous recurrent variable Pt_i, and = the vector of all / [vec(C)', vec(D) , vec(W)',b', v']', network connection weights. (In what follows, ' denotes transpose, the vec operator stacks the columns of a matrix one underneath the other, is the euclidean length of a vector That is, = ${WV(Cx + Dr +b) + r = G{x -\,r the recurrent variables networks. When r t = rt = as ot the network output r. t t is \v\ v.) More compactly, the above network can be written t and t _ v) t u (1) 6). (2) jointly determined by the external inputs Clearly, different choices of G x* and yield different recurrent o -\ (output feedback), t G(ar«_i,r t -i,tf) = *(W*(C*«-i + Dr and we obtain the Jordon (1986) network. When rt = t . 1 + b) + v), a -\ (hidden unit activation t I feedbacks), = G(x r, t -i, r t _ u 9) = *(Cx t _i + Drt -i + b), and we have the Elman (1988) network. By recursive substitution, (2) r< where = Gto.i.r,.!,*) -1 x' = (x<_ i , Xt_2, = G(z,_i,G(zt-2,n_ 2 ,0),0) = • • • , complex nonlinear function of input x t , we may becomes interpret r xo) 9 t is •••=: ^(z'-\0), the collection of past inputs. and the entire past of x t . Hence, r t is a In contrast with external as "internal" input, in the sense that ated by the network. Given a recurrent network, the standard BP it is gener- algorithm for feedforward networks does not perform correct gradient search over the parameter space because it fails to take the dependence of r t on the learnable network weights into account. Consequently, meaningful convergence cannot be guaranteed (Kuan, 1989). Kuan, Hornik, & White (1990) propose a recurrent carefully calculating the correct gradients BP algorithm which, by and including additional derivative up- dating equations, maintains the desired gradient search property. To ensure proper convergence behavior, their results also suggest some restrictions on the network connection weights. That bility" region effort is is, parameters estimates are projected into some "sta- whenever they violate the imposed constraints. Thus, much more needed in programming appropriate learning algorithms for recurrent net- works. Moreover, some of their conditions to ensure convergence of the recurrent BP algorithm are rather stringent. J A 3 Partially Hard- Wired In this section we suggest an (1988) network. As learning algorithms easier way we have discussed is to Elman Network implement a variant of the Elman in section 2, improper convergence of the mainly due to the dependence of the internal inputs the modifiable network parameters. modify the Elman network as is To circumvent depicted in figure this rt on problem, we propose to 1. outputs fr feedforward hidden units feedback hidden units inputs previous feedback unit activations I Figure 1. The proposed partially hard-wired recurrent network. hard- wired connections are represented by The hidden I — 1/ units, J) — respectively =>. units are partitioned into two groups containing and only the units Intuitively, the units in the first in Modifiable and If respectively lr = the second group serve as recurrent units. group play the standard role in artificial neural networks, whereas the task of the recurrent units to "index" information is on Furthermore, the connections between the recurrent units and previous inputs. their inputs are hard-wired. Hence, a is partitioned as a = [a.,a'r ]', where a/ the is vations of the (purely feedforward) hidden units in the x lr 1 first If x 1 vector of acti- group, and a r is the vector of activations of the feedback (recurrent) hidden units in the second group such that the connection matrices If C and rt = D and the bias vector a r>t -i. 6 are partitioned com- formably as 'D/ Cf' C= D= c rj 'h = b Jr .A-. then C where now D r and 6r r , D r a J,t = *(CfXt a r ,t - W(Cr Xt + and b~ r + Dfa r ,t-i + DrUr.t-l b f ) + br), are fixed due to hard-wiring. Different choices of Cr , determine how the past information should be represented, hence they are problem-dependent and should be left to Hence, writing the proposed network ot in = $(WV(Cx + Da rit - + t l researchers. a nonlinear functional form, b) + v) we have =: F(x t ,a r t -i,0.) (3) = G{x (4) , and rt = <Jr,t-i = ^(C-Zt-i + L> r ar,t-2 + br ) t -i,a rit -2,0), where now 6 = [vec(Wy, vec(C/)', vec(L> / ) / , b' f , v']' » is the p x m(l + l) 1 + vector which contains + lj(k + lr 1), the learnable network weights, where p := all and / ^=[vec(C r ) ,vec(D r ) contains the hard-wired weights. = G{x rt cf. all equation t - rt _ U hard-wired weights Because r is t t rt — a r( _i not a function of the learnable weights respect to the learnable weigths t, BP the algorithm t where rj t is is thus avoided. practice it is F 1 ,*), t and the +i =9\ + the aforementioned 9, follows that the standard It applicable to the proposed network with is Letting y denote the target pattern presented 9. t is Tlt V 9 F(x t] M -F(x r ,9 )(i t t learning rate employed at time derivatives of =: ^(z*" = -2,6) ,6) a function of the entire past of x algorithm for feedforward networks at time t becomes 9. problem of parameter dependence BP is ,6 r ] recursive substitution, (4) u ~9) = G(xt-i,G(x - 2 ,r Thus, (2). By / / / t necessary to keep the BP estimates 9. in ,r t ,$t )), VqF and with respect to the components of t is (5) the matrix of partial However, in both theory and some compact subset of IRP , thus preventing the entries from becoming extremely large. This, being a typical requirement in the Kuan & White convergence analysis of the (1990) and Kuan, Hornik, BP & White type of algorithms, see (1990), can, if e.g., not automatically guaranteed by the algorithm, be accomplished by applying a projection operator t which maps M p onto to the BP estimates. Usually, a truncation device convenient for this purpose. This requirement entails inactive when very large trunaction little loss bounds are imposed. because it is is usually we only have In light of (5), to BP modify the existing Furthermore, incorporate the internal inputs a r into the algorithm. training data set off-line learning given, the internal inputs a rt can be calculated is methods such 0. recurrent networks quite easily. It is then interesting to know Let {V t the properties of the and (3) (4). This is BP Algorithm be some sequence of random variables defined on a probability space } (fi,^7 P), , T\ be the cr-algebra generated by Vr VT +i, , . . . , Vj, and let {Z t } sequence of square integrable random variables on that probability space. write in Z, 2 E l t the turn. Asymptotic Properties of the 4 and first, These advantages allow researchers to adapt to algorithm (5) applied to the proposed network given by we now a fixed if as nonlinear least squares can then be applied to estimate the learnable weights topic to which algorithm slightly to tm(Zt) (P), f° r i.e., ||Z|| ^le = conditional expectation (^IZI 2 1 E{Zt\T\-m) and • || || for the be a We norm /2 . ) Definition 4.1. Let vm := sup \\Zt -E}t%(Z t )\\. t Then {Z um t } is = 0(m x ) near epoch dependent (NED) on as m— * {V } t of size —a if for some A < —a, oo. This definition conveys the idea that a random variable depends essentially on the information generated by much on less current" V and t does not depend too the information contained in the distant future or past. magnitude of the dies out. "more or More size of u m details , larger the the faster the dependence of the remote information on near epoch dependence can be found McLeish (1975), and Gallant The k White (1988). 8 in Billingsley (1968), The lemma below ensures that recurrent variables are well behaved and do not have too long memory. Lemma and 4.2. Let {r t common the random generated by bounded is Notice that if 1. variables (which is In what is the input data {x NED on {x triviality, z result is t = follows we compactly Ml < )| {V } on t , where M^ = : of size —a. form a sequence of independent } NED sequence), then {r t } need not of arbitrarily large size, see Gallant } & epoch dependence write the algorithm (5) as (y't ,x't )' and h t (0) =0t + Vth = VgF(x t t (0 t ,r ,0)(y t t ), — F(x cf. Ljung (1977). This approach Sanger (1989), Kuan & cf. e.g., t ,r ,0)). t Our consistency (ODE) method based on the ordinary differential equation neural network learning algorithms, We —a input environments. Clark (1978), k Kuan ( t of size t but a necessity when dealing with feedback networks Ot+l where NED bounded sequence {V } on bounded and continuously is xp NED pp. 27-31). Hence, introducing the concept of near not a technical in stochastic a is If |vec(D r a special case of an is necessarily be mixing but White (1988, where {x t } (4), derivative. first then {r t } l^'l*7 )!) Remark be hidden unit activation function differentiable with sn Pa£R \ now well-known is Oja (1982), Oja White (1990), Kuan, Hornik, & & Kushner of in &; analyzing Karhunen (1985), White (1990), and Hornik (1990). need the following notation. Let r The piecewise linear interpolation of {6 t ^1 TJ± 9\r) = -) ( 9, + } = and, for t > 1, := let rt with interpolation intervals (^) t + i , r 6 [rtl rt+1 ), ^)l {rj t } is i]i- and each for Observe We A.l. "left shift" is t, its = in particular that 0*(O) 0°{rt ) = 6t . impose the following conditions. {V t } and {z {V } t or (ii) is am are defined } some that for (i) t r > on a complete probability space a mixing sequence with mixing coefficients of size —r/(r the sequence {z t r t \ is } t ) < — 2) NED on {l^} of size (ii) A. 3. |vec(Z) r {r] t } is )| < first M~\ — r/2(r— 1) of size —1 with sup t |x t < | M x < oo oo. (4), and V are continuously differentiate of order has bounded m and A. 2. For the network architecture as specified in (3) and <f) P) such 4, and sup E(\y (i) (Q.,J-, 3. t/> is bounded and order derivative. where M^ = sup,^ \^'{<r)\. a sequence of positive real numbers such that J2t It — oo and ]P t rfi < oo. A. 4. For each 6 0, h(0) = limt E(h (0)) exists. t A.l allows the data to exhibit a considerable amount of dependence in the sense that they are functions of the (possibly infinite) history of an underlying mixing sequence. For more details on a- and ^-mixing sequences we refer to White (1984). Assuming that the external inputs x nicalities t are uniformly bounded needed to establish convergence and causes no 10 simplifies some tech- loss of generality, as . Kuan pointed out by White (1990). &, Desired generality ing the y t sequence to be unbounded. is assured by allow- Note that typical choices ^ such for the logistic squasher and hyperbolic tangent squasher satisfy A.2(i). A.2(ii) is needed White (1990) in lemma BP The the constraint suggested by networks. A. 3 is Kuan, Hornik & a typical restriction on the is needed to define the associated ODE whose solution the limiting path of the interpolated processes {#'(•)}. result Theorem is Condition types of algorithms. For example, learning rates of order l/t satisfy this condition. A. 4 is and for general recurrent learning rates for trajectory 4.2 as below follows from corollary 3.5 of Kuan For 4.3. the network given by (3) and &c (4) White (1990). and the algorithm (5), suppose that assumptions A.1-A.4 hold. Then (a) one, (b) bounded and equiconiinuous on bounded intervals with probability {#'(•)} is Let and 0* all limits one, as t Remark h{9) — 0, 0, and if 9 t in let V(Q*) C 9 — h{9). enters a compact subset and thus in particular, ifQ C ODE JR P be the domain of attraction of 0* ofV(Q*) infinitely often with probability X>(0*), then with probability one, 6 — t 0* — oo. 2. Because the elements 0* of 0* solve the equation lim E{h(z t t , rt , 0)) = they (locally) minimize UmE\y - F(x t Theorem ODE be the set of all (locally) asymptotically stable equilibria of this contained Then, of convergent subsequences satisfy the 4.3 thus shows that the BP ,r t ,9)\ 2 (6) . estimates can converge to a error minimizer with probability one. Note conditional on t 9. I 11 however that this mean squared convergence occurs Remark same By 3. T~ the Toeplitz lemma, linvr as (6). Therefore, the (on-line) BP l Ylt=i &\yt ~ F{x t ,r t ,d)\ 2 is the estimates converge to the same limit as the (off-line) nonlinear least squares estimator. Remark As y 4. is t not required to be bounded, our strong consistency result holds under less stringent conditions than those of Kuan, Hornik BP for the fully recurrent To cific T] t = (t + l) -1 we consider the algorithm & White {t l) -1 } is l 2 l) / (0 t + The piecewise constant + for — for far; in particular, BP es- Kuan, BP al- 0") be the sequence of normalized estimates. interpolation of U t on [0,oo) with interpolation intervals defined as U{r) and again, with the spe- (1990) give only a consistency result for their recurrent gorithm.) Let Ut '— {(/. (5) (Note that no limiting distribution results . timators in recurrent networks have been published thus Hornik, White (1990) algorithm. establish asymptotic normality choice & each t =U t its "left shift" is U t (r) T€[T t , ,Tt+l ) t defined as = U(r + t r), r>0. Finally, let H(0) :=lim£[V*/i,(0)] where Ip is the Our & Huang processes can sponding SDE /p /2, p-dimensional identity matrix. result follows Kushner + from the stochastic differential (1979). In contrast with the now shown equation (SDE) approach of ODE approach, the interpolated to converge weakly to the solution paths of a corre- with respect to the Skorohod topology. For more details on weak 12 convergence we refer to Billingsley (1968). The following conditions asymptotic normality su Pt x t\ \ M < x < B.2. A. 2 holds with B.3. 9" G int(0) result. and {z B.l. A.l(i) holds, is suffice for the <fr t } is a stationary sequence oo sup, E(\y t and ip s \ ) < NED on {Vj} of = —8 with oo. continously differentiate of order such that h{9") size and all 4. eigenvalues of H{9*) have negative real parts. The result Theorem T] t = one, 9 t — with Consider the network given by 4.4. (t » below follows from corollary 3.6 of Kuan + l)~ 9* as t l , (3) and k. White (1990). (4) and the algorithm (5) suppose that assumptions B.1-B.3 hold and that with probability —* oo. Then {£/'(•)} converges weakly to the stationary solution of the stochastic differential equation dU{r) where W = ~H(9*)U(t) dr + 2 E(9* f' dW(r), denotes the standard p-variate Wiener process and oo E(0*):=lim Yl E{h j t (e*)h t+j (9*)']. = — co In particular, {t where a —' signifies + l)^ 2 {9 -9*) t convergence -E^ N(0,S(9*)), in distribution and /•OO S(9' ):= exp(77(0* )s) E exp(77(0* )s) ds Jo is the unique solution to the matrix equation H(9")S 13 + SH{9*)' — — D(tf*) , . Remark SDE 5. If rj t = (t + 1) *R, where theorem 4.4 becomes dU(r) in = R is a nonsingular p x p matrix, the H{9*)U{t)<1t + RH{9*) l l 2 dW{r), and the I covariance matrix of the asymptotitc distribution of 9 t becomes RS(9*)R' Remark converges to 9* 6. If the probability that 9 t is positive, but less than one, the above theorem provides the limiting distribution conditional on convergence to 9* . Hence, if each 9* £ 0* 0* contains only , and many finitely points, assumption B.3 is satisfied for 9 t converges with probability one to one of the elements of then the asymptotic distribution of 9 t is 0* a mixture of N(9* ,S(9*)) distributions, weighted relative to the convergence probabilities. Conclusions 5 In this paper we propose a partially hard-wired Elman network, a subset of hidden-unit activations is which only in allowed to feed back into the network and connections between these hidden units and input layer are hard-wired. feature of our approach is A that existing on-line and off-line learning algorithms can be slightly modified to implement the proposed network. (Note that learning is not possible for a fully learnable recurrent network.) This convenient for researchers. standard BP distinct Our results also show that particularly the estimates from the algorithm adapted to this network can converge to a error minimizer with probability one is off-line mean squared and are asymptotically normally distributed. These asymptotic properties are analogous to those of the standard and recurrent BP algorithms. As the convergence connection weights trast 9, results in this paper are conditional on the hard-wired the resulting weight estimates are not fully optimal, in con- with fully learnable recurrent networks. To improve the performance of the I 14 I proposed network, one can train the network with various hard-wired connection weights and search for the best performing architecture. f 15 Appendix Lemma A. Let {x t NED be } {V } on of size t —a and let the square integrable sequence {r t } be generated by the recursion - G(x rt Suppose that G(-,r,0) satisfies exists a finite constant L such p < -,9) is a such that for 1 that for all - Proof. \\rt t } is We NED first - E}t^(r < g(x 2 ,r,9)\ - L\x x on {V t } - < g(x,r 2 ,0)\ p\ ri there exists some r2 \. observe that t )\\ < \\G{x t . u rt . u 6)-G{E\tT\^-i),E\tT 2 {r < \\G(x t . u r t . u 9)-G(EltT 2 ^t.i), r,.!, 0)|| < LH*,.! - tions i.e., of size —a. 5)-£j+™(G(x,-i best there x 2 \, - ||G(*,-i 1 r l - ll first i.e., all x, = where the r, r, contraction mapping uniformly in x, \G(x, ri ,0) Then {r -i,rt -i,6). Lipschitz condition uniformly in a \g(x u r,9) and that G(x, t £?S- 2 (a,.i)|| + I r1 -i p||r 4 -! - > 3))|| £j+r t .,),6)\\ 2 (r«-i)||, inequality follows from the fact that El*™(G(xt-i,rt-i,9)) mean square predictor of G(x t -i, r«_i, 6) among all and the second inequality follows from the triangle 16 7 .T /*™ -measurable inequality. is the func- Hence, we ] I obtain Ur m , where i/ xm and ur%m are the must show that on {V we can < Com x ° By Vx,m I and for all NED m (al) pVr,m-l, and {r coefficients for {x t } 0(m x is as ) < 1, we can m— t }, respectively. Because {x oo. t } is We NED Co and some Ao < —a such that find a finite constant the fact that p . pa < . some A < —a, fr ,m for of size a, } t < Lux m -1 + mo and some a > find 1 such that > mo, {m/{m+ Ao l)) < a. Let r-. I/q := CoL(T ^r,m I max mj° ' — 1 per J L We now this for prove by induction that for trivially true is some m > m , m all I > mo, v r ,m £ Dom x ° For . m= ?no, by the definition of Do. Suppose we have already shown that ^ r ,m < Doin x ° Then, using (al), . < LC m Xo + pD m Xo = (LCo + pD )(m < (LCo + )(7(m+l) Ao < D (m+l) Ao p£> + l) A °(m/(m + A° 1)) , completing the induction step and thus the proof of the lemma. Proof of (4) is Lemma 4.2. bounded and thus lemma A, mapping it suffices to in r. By boundedness trivially show that of ip, the sequence {r square integrable. Hence, G is As by assumption the in t } generated by view of the above Lipschitz continuous in x and a contraction first 17 derivative of xl> is uniformly bounded, G is L — M^\C r clearly Lipschitz continuous in x with Lipschitz constant a matrix, then |.A| := max{|ylx| of partial derivatives of G = |x| : with respect to maximal singular value of root of the 1}.) Similarly, let V r r. V G (If \- A is denote the matrix r Note that |V r G(x, r, 9)\ is the square G, and thus by a well-known result from linear algebra, < (trace < M = M^|vec(D r )| \V r G(x,r,l)\ = By assumption, p < \G(x,r u d) 1. As tA (V r G(x,r,0)V r G(x,r,0)')) (trace(D r 1/2 ^) 1/2 P- •' clearly, - G(x,r 2 ,0)\ < sup \V r G(x,r,e)\ \r l -r 2 \<p\ri - r2 \, r G is a contraction mapping Proof of theorem 4.3. White (1990), which we in r, We thereby completing the proof of verify the conditions of corollary 3.5 of shall briefly refer to as {r — 1, where t } £t upper bound of [KW] and thus also {£ = [x't ,r't ]', for the } are {£t}, NED bounded sequences and let K% :— K% x 0, both F(£, •) Kuan & follows from It {£ : |£| < and V$F(f , lemma on {V*} of which establishes condition C.l of [KW]. Let sequence requires that in t 4.2. [KW]. Their conditions A. 4 and C.3 are explicitly assumed (our assumptions A. 3 and A. 4). 4.2 that lemma M$ size be an A/^}. Condition C.2 •) satisfy a Lipschitz condition with Lipschitz constants L\{£) and Lo{£,), respectively, where L\ and Ln are Lipschitz continuous Lipschitz condition. It is in £, and that both F(-,0) and V$F(-,0) straightforward to show that continuous differentiability of A.2(i) ensures these Lipschitz conditions. White satisfy a (1990). 18 See also corollary 4.1 of Kuan k. I Proof of theorem 4.2 ensures that {r t {r t } is 4.4. } is We verify the NED on {Vt } of also stationary. Hence, {£ t } is Lemma conditions of corollary 3.6 of [KW]. size —8. Stationarity of {x a stationary sequence NED t } implies that on {V t } of size —8, which estabishes condition D.l of [KW]. Condition D.2 of [KW] follows from B.3 and the moment condition of B.l. Finally, as in the preceding proof, four times continuous differentiability of B.2 ensures the Lipschitz conditions imposed in condition D.3 of [KW]. See also corollary 4.2 of t 19 Kuan k White (1990). References Billingsley, P. (1968). Elman, Convergence of probability measures. (1988). Finding structure in time. J. L. CLR New York: Wiley. Report 8801, Center for Research in Language, University of California, San Diego. Gallant, A. R. k, White, H. (1988). A and inference unified theory of estimation for nonlinear dynamic models. Oxford: Basil Blackwell. Hornik, K., Kuan, C.-M. (1990). Convergence analysis of local feature extrac- k. tion algorithms. University of Jordon, M. I. BEBR Illinois, (1985). Commerce, Discussion Paper 90-1717, College of Urbana-Champaign. The learning of representations for sequential perfor- mance. Ph.D. Dissertation, University of California, San Diego. Jordon, M. I. (1986). ICS Report 8604, Serial order: a parallel distributed processing approach. Institute for Cognitive Science, University of California, San Diego. Kuan, C.-M. (1989). Estimation of neural network models. Ph.D. thesis, De- partment of Economics, University of California, San Diego. Kuan, C.-M., Hornik, K., & White, H. (1990). Some convergence results for learning in recurrent neural networks. Proceedings of the Sixth Yale Work- shop on Adaptive and Learning Systems, Ed. K. S. Narendra, New Haven: Yale University, 103-109. Kuan, C.-M., & White, H. (1990). Recursive M-estimation, nonlinear regression and neural network learning with dependent observations. ing Paper 90-1703, College of Commerce, University BEBR Work- of Illinois, Urbana- Champaign. ! 20 Kushner, H. J., & Stochastic approximation methods for Clark, D. S. (1978). constrained and unconstrained systems. Kushner, H. J., & New York: Springer Verlag. Huang, H. (1979). Rates of convergence for stochastic ap- AM Journal of Control and proximation type algorithms. SI Optimization, 17, 607-617. Ljung, L. (1977). Analysis of recursive stochastic algorithms. tions on IEEE Transac- Automatic Control, AC-22, 551-575. McLeish, D. (1975). A maximal inequality and dependent strong laws. Annals of Probability, 3, 829-839. Norrod, F. E., O'Neill, M. in neural networks. D., & Gat, E. (1987). Feedback-induced sequentiality In Proceedings of the ference on Neural Networks (pp. Oja, E. (1982). A simplified neuron II: IEEE First International 251-258). San Diego: model as a principal SOS Con- Printing. component analyzer. Journal of Mathematics and Biology, 15, 267-273. Oja, E., & Karhunen, J. (1985). On stochastic approximation of the eigenvectors and the eigenvalues of the expectation of a random matrix. Journal of Mathematical Analysis and Applications, 106, 69-84. Rumelhart, D. E., Hinton, G. E., &: Williams, R. J. (1986). Learning internal representations by error propagation. In D. E. Rumelhart, land, & The PDP L. McClel- Research Group, Parallel distributed processing: Explo- rations in the microstructures of cognition, (pp. MA: MIT J. I: 318-362). Cambridge, Press. Sanger, T. D. (1989). Optimal unsupervised learning in a single-layer linear feedforward neural network. Neural Networks, 2, 459-473. 21 Williams, R. J., k, Zipser, A D. (1988). learning algorithm for continually running fully recurrent neural networks. ICS Report 8805, Cognitive Science, University of California, San Diego. 22 Institute of