Training Feed-Forward Neural Networks with Monotonicity Requirements Antoine Mahul 1 Alexandre Aussem 2 Research Report LIMOS/RR-04-11 June 2004 1 antoine.mahul@isima.fr 2 alex.aussem@isima.fr Abstract In this paper, we adapt the classical learning algorithm for feed-forward neural networks when monotonicity is require in the input-output mapping. Such requirements arise, for instance, when prior knowledge of the process being observed is available. Monotonicity can be imposed by the addition of suitable penalization terms to the error function. The objective function, however, depends nonlinearly on the first-order derivatives of the network mapping. We show that these derivatives can easily be obtained by an extension of the standard back-propagation algorithm procedure. This yields a computationally efficient algorithm with little overhead compared to back-propagation. Keywords: Neural networks, Monotonicity, Non-linear optimization, Penalty methods. Résumé Dans cet article, nous adaptons l’algorithme d’apprentissage classique des réseaux de neurones feedforward lorsque la monotonie est exigée dans la relation à apprendre. De telles exigences apparaissent, par exemple, lorsqu’une connaissance a priori sur le processus observé est disponible. La monotonie peut être imposée en ajoutant des termes de pénalisation adéquats à la fonction d’erreur. Cependant, la fonction objectif dépend alors de façon non-linéaire des dérivées de la fonction représentée par le réseau. Nous montrons que ces dérivées peuvent être aisément obtenues par une extension de l’algorithme standard de rétro-propagation. Cela amène à un algorithme efficace avec un surcoût de calcul faible par rapport à la rétro-propagation. Mots clés : Réseaux de neurones, Monotonie, Optimisation non-linéaire, Méthodes de pénalité. LIMOS / Blaise Pascal University (FRANCE) Research Report RR-04-11 Introduction By virtue of their universal approximation capabilities, feed-forward neural networks are good candidates for the approximation of continuous non-linear mappings. However, inclusion of prior knowledge into the network training can lead to significant improvements in the network performance, especially when the amount of training data is limited. For interpolation problems, monotonicity requirements can easily be imposed by the addition of suitable penalization terms to the error function in much the same way as smoothness is imposed by the adjunction of suitable regularization terms. The new error function, however depends non-linearly on the first-order derivatives of the network mapping, and so the standard back-propagation algorithm cannot be applied. In this paper, we derive a computationally efficient learning algorithm, for a feedforward network of arbitrary topology, which can be used to minimize such penalized error functions. The derivatives with respect to the weights for a multi-layer perceptron are obtained by an extension of the back propagation algorithm procedure. As an example [5], consider the optimization problems arising typically in Traffic Engineering where the overall operational performance of the communication network is to be maximized while respecting some predefined Quality of Service (QoS) requirements. Unfortunately, the QoS, usually expressed in terms of average response time and/or loss rate, is difficult to express analytically in terms of the incoming traffic characteristics. Also, it is particularly appealing to train a MLP as a ”black-box” on simulation data for the evaluation of the QoS values. As delay, loss and jitter are monotonically increasing with respect to the incoming traffic rates, inclusion of this prior knowledge into the training procedure is important. It may also be a stringent requirement for the optimization algorithms to converge. 1 Learning with monotonicity requirements We consider in this paper feed-forward network of arbitrary topology. We first introduce our notations. 1.1 Definitions and notations Let N be the set of neurons. Nin ⊂ N is the subset of input neurons and N out ⊂ N is the subset of output neurons. A is the set of arcs, and w c = wij the weight of an arc c = (i, j) ∈ A. For given input vector x and weight vector w, s i (x, w) and ai (x, w) are the input value and the activity of a neuron i ∈ N . fi is the activation function of a neuron i ∈ N . The value of the input neurons are set to x = {si , i ∈ Nin }. Inputs and outputs are determined by the relations: X si (x, w) = wki ak (x, w) ∀i ∈ / Nin (1) k∈N ,(k,i)∈A ai (x, w) = fi (si (x, w)) ∀i ∈ N (2) We also note a0i (x, w) = fi0 (si (x; )) and a00i (x, w) = fi00 (si (x, w)). Definition. For a given input x and a given weight vector w, the jacobian matrix J(x, w) of a neural 2 LIMOS / Blaise Pascal University (FRANCE) network is defined by: J(x, w) = Jij (x, w) i∈Nout ,j∈N with Research Report RR-04-11 in ∂yi ∂ai Jij (x, w) = = (x, w) ∂xj ∂sj ∀i ∈ Nout , ∀j ∈ Nin 1.2 Learning problem subject to monotonicity constraints An output value yi (i ∈ Nout ) of the neural network is increasing (resp. decreasing), on a compact set ∂yi K ⊂ R|Nin | , according to an input value xj (j ∈ Nin ) if the corresponding element Jij (x, w) = ∂x j of the jacobian matrix is positive, for all x ∈ K. So, a monotonic increase constraint on the regression function is: Jij (x, w) ≥ 0 i ∈ Nout , j ∈ Nin , ∀x ∈ K (C) The learning is classically achieved by a gradient descent method. Generally, we seek to minimize the quadratic error of estimation Eq on a given base B of examples (w ∈ R|A| is the weight vector of the neural network): X X 2 min E (w) = a (x, w) − y q i i (P0 ) w∈R|A| (x,y) ∈ B i ∈ Nout With the monotonic constraints, the learning problem becomes: X X 2 ai (x, w) − yi min E(w) = (x,y) ∈ B i ∈ Nout (P1 ) ∀x ∈ K, ∀i ∈ Nout , ∀j ∈ Nin s.t. Jij (x, w) ≥ 0 |A| w∈R (C1 ) The monotonicity requirement is difficult to enforce on the whole domain, K. So we consider the restricted problem: X X 2 min E(w) = ai (x, w) − yi (x,y) ∈ B i ∈ Nout (P2 ) s.t. Jij (x, w) ≥ 0 ∀(x, ·) ∈ B, ∀i ∈ Nout , ∀j ∈ Nin (C2 ) |A| w∈R 1.3 Penalty method The principles of the penalty method (see [1, 6] for details) to solve problem (P2 ) are briefly recalled in this section. Let (P ) be an non-linear optimization problem with constraints: min f (x) (P ) s.t. gi (x) ≥ 0 ∀i ∈ [1..m] n x∈R 3 LIMOS / Blaise Pascal University (FRANCE) Research Report RR-04-11 Penalty methods solve a sequence of unconstrained subproblems which approach iteratively the infinite penalty function (σ(x) = 0 if all constraints are valid and σ(x) = ∞ otherwise). The subproblems (SPk ) are : ( m X ϕ(gi (x)) (SPk ) minn Φ(x, µ) = f (x) + µ x∈R i=1 where ϕ is a function from R to R. For instance, • if ϕ = ϕ1 : x → 21 min(0, x) 2 , Φ matches the quadratic penalty function, • if ϕ = ϕ2 : x → − log(x), Φ matches the logarithmic barrier function. The penalty function Φ(w, µ) associated to problem (P 2 ) is then (w ∈ R|A| is the vector of the weights of the neural network): X X X ϕ Jij (x, w) Φ(w, µ) = E(w) + µ (x,y) ∈ B i ∈ Nout j ∈ Nin = X X ai (x, w) − yi i ∈ Nout (x,y) ∈ B = X | {z E(x,w) E(x, w) + µP (x, w) (x,y) ∈ B 2 } +µ X X ϕ Jij (x, w) i ∈ Nout j ∈ Nin | {z P (x,w) ! } At each iteration of the penalty method, we have to solve the following subproblems: X (Pk ) min Φ(w, µk ) = E(x, w) + µk P (x, w) w (x,y) ∈ B The sequence µk has to meet lim µk = ∞ in the case of penalty functions and lim µk = 0 in the k→∞ k→∞ case of barrier functions. In order to apply a gradient descent method for solving these subproblems, we must be able to compute the gradient ∇Φ(w, µ) : X ∇E(x, w) + µ ∇P (x, w) ∀c ∈ A (3) ∇Φ(w, µ) = (x,y) ∈ B The term ∇E can be computed by the back-propagation algorithm. We focus our attention on the penalty term ∇P in the next section, whose components are: X X ∂Jij (x, w) ∂P (x, w) = ϕ0 Jij (x, w) ∀c ∈ A, ∀x ∈ B (4) ∂wc ∂wc i ∈ Nout j ∈ Nin 2 Forward-backward algorithm for gradient computation In this section, the derivation of the gradient ∇P is presented in details, in the same way as was done in [3]. Let x be an input vector of the neural network and w a weight vector. In order to simplify the notations, we will omit x and w in the sequel. 4 LIMOS / Blaise Pascal University (FRANCE) Research Report RR-04-11 2.1 The jacobian matrix The calculation of Jij can be performed in a forwards or backwards as shown here. Proposition 1 (Forward relation for jacobian elements). We have, for all (i, j) ∈ N × N in : Jij = a0i i δj (δji is the Kronecker symbol) if i ∈ Nin X wki Jkj (5) otherwise. k,(k,i)∈A Proof. Equation (5) can be demonstrated by using (2) in the definition of Jij : Jij = ∂ai ∂si ∂ai ∂si = = a0i ∂sj ∂sj ∂sj ∂sj Then, according to (1), we have ∂si = ∂sj X ∂ak ∂sj |{z} wki k,(k,i)∈A ∀(i, j) ∈ N × Nin Jkj Proposition 2 (Backward relation for jacobian elements). For all (i, j) ∈ N out × N Jij = 0 i aj δj 0 aj if j ∈ Nout , X wjk Jik (6) otherwise. k,(j,k)∈A Proof. Let (i, j) ∈ Nout × N . Jij = ∂ai ∂ai ∂aj ∂ai = = a0j ∂sj ∂aj ∂sj ∂aj So if j ∈ / Nout , by using the classical formulation of the back-propagation: ∂ai = ∂aj X ∂ai ∂sk ∂ai = wjk ∂sk ∂aj ∂sk k,(j,k)∈A |{z} k,(j,k)∈A X wjk Finally, if j ∈ Nout , then ∂ai = δij . ∂aj 2.2 Differentiation of the jacobian matrix For i ∈ Nout , j ∈ Nin and c ∈ A, we calculate Remark. Let κij = ∂Jij . ∂wc ∂si , defined for (i, j) ∈ N × Nin . ∂sj 5 LIMOS / Blaise Pascal University (FRANCE) Research Report RR-04-11 Then we can write, according to the demonstration of proposition 1: Jij = a0i κij and i δj X κij = wki Jkj (7) if i ∈ Nin , (8) otherwise. k,(k,i)∈A k Proposition 3. Note νij = ∂Jij , ∀i ∈ Nout , ∀j ∈ Nin et ∀k ∈ N . We have the following back∂sk propagation relation: 00 i aj κjk δj k X νij = 00 0 k w a κ J + a ν jl jk il j j il if k ∈ Nout , otherwise. (9) l,(jl)∈A Proof. This can be easily demonstrated by using the back-propagation formula (6). Let i ∈ Nout , j ∈ Nin and k ∈ N . If k ∈ / Nout , then X X ∂a0j ∂Jij ∂ 0 0 ∂Jil wjl Jil wjl Jil = aj = + aj ∂sk ∂sk ∂sk ∂sk l,(j,l)∈A l,(j,l)∈A X ∂Jil ∂sj = wjl a00j Jil + a0j ∂sk ∂sk l,(j,l)∈A X = wjl a00j Jil κjk + a0j νilk l,(j,l)∈A Then, if j ∈ Nout : ∂a0j ∂a0j ∂sj ∂Jij ∂ 0 i aj δj = δji = = δji = δji a00j κjk ∂sk ∂sk ∂sk ∂sj ∂sk Proposition 4. For i ∈ Nout , j ∈ Nin and c = (k, l) ∈ A, we have ∂Jij = Jil Jkj + ak νilj ∂wkl (10) Proof. As j ∈ Nin , we can write ∂Jij ∂ ∂ai ∂ ∂ai = = ∂wkl ∂wkl ∂sj ∂sj ∂wkl a Now ∂ai ∂wkl z }|k { ∂ai ∂sl = = ak Jil ∂sl ∂wkl |{z} Jil So ∂ak ∂Jil ∂ ∂Jij [ak Jil ] = Jil + ak = ∂wkl ∂sj ∂sj ∂sj ∂Jil = Jil Jkj + ak ∂sj 6 LIMOS / Blaise Pascal University (FRANCE) Research Report RR-04-11 2.3 Differentiation of the penalty term We now set our attention on Eq. (4): X ∂Jij ∂P ϕ0 Jij = ∂wc ∂wc (i,j) ∈ Nout ×Nin Remark. For (i, j) ∈ N × Nin , define Jij+ as X Jki ϕ0 Jkj Jij+ = k ∈ Nout From (6), Jij+ can be calculated by back-propagation: 0 0 if i ∈ Nout , ai ϕ (Jij ) X + Jij = + 0 wik Jkj otherwise. ai (11) k,(i,k)∈A Remark. If we note, for (i, j) ∈ Nin × N , X j + νki ϕ0 Jkj νij = k ∈ Nout + From (10), we obtain a chain rule for the computation of ν ij , 00 0 if i ∈ Nout , ai κij ϕ (Jij ) + X νij = + + + a0i νkj otherwise. wik a00i κij Jkj (12) Proposition 5. For all (k, l) ∈ A, X ∂P + + = Jkj Jlj + ak νlj ∂wkl (13) k,(i,k)∈A Using Eq. (4) and Eq. (10), we finally obtain: j ∈ Nin 2.4 Algorithm We can now establish a forward-backward algorithm (a detailed algorithm is given in appendix) to compute the gradient ∇P (x), for an given input vector x: Algorithme 1 (Principle algorithm for computing the penalty term gradient). 1. (forward propagation) Compute, in the topological order (from inputs to outputs), κ ij and Jij using (8) and (7), for all (i, j) ∈ N × Nin , 2. (backward propagation) Compute, in the reverse topological order (from outputs to inputs), + Jij+ and νij using (11) and (12), for all (i, j) ∈ N × Nin , 3. (final step) Use (13) to compute the overall gradient. Like the standard back-propagation algorithm for error gradient computation, the complexity of this algorithm is linear according to the number of synaptic weights. 7 LIMOS / Blaise Pascal University (FRANCE) Research Report RR-04-11 3 Heuristic for a feasible initialization of feed-forward neural networks In barrier methods, the initial solution should satisfy the constraints. We propose in this section an heuristic method to construct an initial weight vector w meeting all monotonicity constraints (all terms of the jacobian matrix will be positive). Let consider again the relation (10)): Jij = ∂ai = a0i ∂sj X wki Jkj k,(k,i)∈A Assuming a monotonic increasing activation function, a 0i is positive. For all hidden neuron i and for all j ∈ Nout , we have wki Jkj ≥ 0, ∀k | (k, i) ∈ A =⇒ Jij ≥ 0. So, we can use this relation to define a simple heuristic to initialize weights that satisfies the monotonicity constraints. For all nodes i ∈ N , define u i by −1 if Jij ≤ 0, ∀j ∈ Nout , ui = 1 if Jij ≥ 0, ∀j ∈ Nout , 0 otherwise. The sign of the initial weights may be heuristically defined as follows: Algorithme 2 (Weight initialization (see Figure 1)). step 1. Set the following values for u i : if i ∈ Nin or i ∈ Nout , then ui = 1, if i is a bias neuron, ui = 0, if i is a hidden neuron, then ui ∈ {−1, 1}. step2. Determine the sign of a weight w ij , (i, j) ∈ A, with these rules: if ui uj > 0, then wij > 0, if ui uj < 0, then wij < 0, if ui uj = 0, then the sign of wij does not matter. 1 0 x1 +1 −1 +1 y1 −1 +1 x2 +1 Figure 1: Heuristic initialization of weights in a feed-forward neural network. 8 LIMOS / Blaise Pascal University (FRANCE) Research Report RR-04-11 4 Numerical results To illustrate the learning with monotonicity requirements, we consider a toy regression problem of an increasing unidimensional function perturbed by a simple wavelet. We use a batch gradient algorithm to solve unconstrained optimization problems. We also use the Armijo rule for the line search at each iteration (or ”epoch”). In the case of the barrier function, it is fundamental not to overcome the barrier during the line search. The Armijo rule, which preserves the convergence of descent methods ([2])), is: for scalar s > 0, σ ∈ [0, 21 ] and β ∈ [0, 1], we choose ηt = β m s where m is the lower natural integer such as: Φ(wt , µ) − Φ(wt + β m s dt , µ) ≥ −σ β m s ∇Φ(wt , µ)T dt We consider here a non-linear unidimensional regression problem which target data are generated by the function y = x3 − 5 x exp −100 x2 + ε, where ε is a uniform random variable in the range [−0.05, 0.05]. We have considered a Multi-Layer Perceptron with 5 hidden units (so we have 16 parameters). The MLP is trained with a learning base composed by 250 patterns and we use a second base of 1000 patterns for the validation. We have made three different learnings with the same initialization (using our heuristic): the first one without increasing requirements (classical learning), the second one considering a penalty term for increasing requirements, and the last one a barrier term. Quadratic Error (MSE) learning validation classic penalty barrier f (x) = x3 7.47 · 10−4 4.18 · 10−3 4.19 · 10−3 4.22 · 10−3 9.05 · 10−4 5.40 · 10−2 5.47 · 10−2 5.27 · 10−3 Monotonicity Error (MSME) learning validation 0.967 0 0 - Non Increasing Pattern (%) learning validation 1.139 9.67 · 10−16 0 - 6% 0% 0% - 8.3% 0.2% 0% - Epochs 3377412 3983 130759 - Table 1: Learning results with 250 patterns in the learning base. Table 1 summarize our experiment. For each learning process, we give, for the learning and validation bases: • the Mean Square Error (MSE): MSE = 1 |B| X X ai (x, w) − yi (x,y) ∈ B i ∈ Nout 2 • the Mean Square Monotonicity Error (MSME): MME = 1 |B| X X X min 0, Jij (x, w) (x,y) ∈ B i ∈ Nout j ∈ Nin 2 • the rate of patterns where the neural evaluation is non increasing, • and the number of epochs of each learning (we stop when 9 k wk+1 −wk k∞ k w k k∞ < 10−8 ). LIMOS / Blaise Pascal University (FRANCE) Research Report RR-04-11 Classical learning 1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 f (x) = x3 learning patterns neural estimations -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 0.8 1 0.8 1 Learning with penalty function 1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 f (x) = x3 learning patterns neural estimations -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 Learning barrier function 1 0.8 0.6 0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1 f (x) = x3 learning patterns neural estimations -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 Figure 2: Learning results of a 5-hidden units MLP with a the learning base of 250 patterns. 10 LIMOS / Blaise Pascal University (FRANCE) Research Report RR-04-11 We also include in the table the MSE obtained by the function f (x) = x 3 on the learning and validation bases. We can see that the MLP trained with monotonicity requirements is really closed to the underlying function (see also Figure 2). Monotonicity requirements are respected on the learning base for training with penalty and barrier functions, and violations are negligible on the unknown patterns of the validation base. These violations can be reduced by considering a bigger learning base. The learning with monotonicity constraints converges faster than the non-constrained learning, particularly when we use the quadratic penalty function. Learning results with a barrier term or with a penalty term are closed. However the use of the barrier function allow to ensure the monotonicity for all examples of the validation base. Conclusion A learning algorithm satisfying monotonicity requirements for feed-forward neural networks is presented here and has been successfully applied on a simple regression problem. A rigorous optimization scheme based on penalty methods is used to solve the constrained learning problem. Other optimization techniques can also be considered like the augmented lagrangian method or trust region algorithms. We also propose a heuristic to initialize a neural network satisfying the monotonicity constraints. We show with a simple example that introducing constraints on an a priori knowledge on the underlying function can drastically enhance the generalization capacity of the neural network and speed up the learning process. We use such a learning approach to solve a routing problem with delay constraints in telecommunication networks (see [9, 5]). The neural network was trained to estimate delays induced by the load of the network, which is an unknown increasing function of the traffic rates. This neural estimator was then used in a routing optimization scheme for which the increase condition of the delay function is fundamental condition. References [1] Dimitri P. Bertsekas. Constrained Optimization and Lagrange Multiplier Methods. Academic Press, 1982. [2] Dimitri P. Bertsekas. Nonlinear Programming: Second Edition. Athena Scientific, 1999. [3] Christopher M. Bishop. Curvature-driven smoothing: a learning algorithm for feed-forward networks. IEEE Transactions on Neural Networks, 4(5), September 1993. [4] Christopher M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, 1995. [5] Christophe Duhamel, Antoine Mahul, and Alexandre Aussem. Routing with neural-based QoS constraints. In Proceedings of the first International Network Optimization Conference (INOC’2003), pages 201–206, Evry-Paris, France, October 2003. [6] Roger Fletcher. Penalty functions. In A. Bachem, M. Grötschel, and B. Korte, editors, Mathematical Programming, The State of the Art, Bonn 1982, pages 87–114. Springer-Verlag, 1983. 11 LIMOS / Blaise Pascal University (FRANCE) Research Report RR-04-11 [7] Jouko Lampinen and Arto Selonen. Using background knowledge in multilayer perceptron learning. In M. Frydrych, J. Parkkinen, and A.Visa, editors, Proceedings of the 10th Scandinavian Conference on Image Analysis SCIA’97, volume 2, pages 545–549, 1997. [8] Antoine Mahul and Alexandre Aussem. Distributed neural networks for QoS estimation in communication network. International Journal of Computational Intelligence and Applications, 3(3):297–308, 2003. [9] Antoine Mahul and Alexandre Aussem. Neural-based quality of service estimation in MPLS routers. In Supplementary Proceedings of the International Conference on Artificial Neural Networks (ICANN’2003), pages 390–393, Instanbul, Turkey, June 2003. [10] Joseph Sill and Yaser S. Abu-Mostafa. Monotonic hints. Advances in Neural Information Processing Systems, 9:634, 1997. Appendix We give here a detailed version of the algorithm for computation of the penalty term gradient. /* Computation of the values to propagate */ for all i ∈ Nin , do ai ← xi Jii ← 1 for all j ∈ Nin such as j 6= i, do Jij ← 0 end for /* Forward propagation */ for all P i∈ / Nin in topological order, do si ← k,(k,i) ∈ A wki ak ai ← σi (si ), a0i ← σi0 (si ) et a00i ← σi00 (si ) for all jP ∈ Nin , do κij ← k,(k,i) ∈ A wki Jkj Jij ← a0i κij end for end for /* Computation of the values to back-propagate */ for all i ∈ Nout , do for all j ∈ Nin , do + ← a0i ϕ0 (Jij ) Jij + νij ← a00i κij ϕ0 (Jij ) end for end for /* Backward propagation */ for all i ∈ / Nout in reverse topological order, do for all P j ∈ Nin , do + x ← k,(i,k) ∈ A wik Jkj P + y ← k,(i,k) ∈ A νkj + Jij ← a0i x + νij ← a00i κij x + a0i y end for end for 12 LIMOS / Blaise Pascal University (FRANCE) Research Report RR-04-11 /* Final step: gradient computation */ for all c = (k, l) ∈ A, do x←0 for all j ∈ Nin , do + x ← x + Jkj Jlj+ + ak νlj end for ∂P/∂wkl ← x end for 13