Nguyễn Hữu Công và Đtg Tạp chí KHOA HỌC & CÔNG NGHỆ 93(05): 53 - 59 A STUDY TO IMPROVE A LEARNING ALGORITHM OF NEURAL NETWORKS Cong Huu Nguyen*1, Thanh Nga Thi Nguyen2, Ngoc Van Dong3 1 Thai Nguyen University, 2College of Technology – TNU, 3 Ha Noi Vocational College of electrical mechanical ABSTRACT Since the last mid- twentieth century, the study of optimization algorithms, especially on the development of digital computers, is increasingly becoming an important branch of mathematics. Nowadays, those mathematical tools are practically applied to neural networks training. In the process of finding an optimal algorithm to minimize the convergence time of the solution or avoiding the weak minima, local minima, the problems are starting to study the characteristics of the error surface. For the complex error surface as cleft-error surface, that its contours are stretched, bent forming cleft and cleft shaft, the old algorithms can not be settled.This paper proposes an algorithm to improve the convergence of the solution and the ability to exit from undesired areas on the error surface. Keywords: neural networks, special error surface, local minima, optimization, algorithms BACKGROUND* In the process of finding an optimal algorithm to minimize the convergence time of the solution or to avoid tweak m inima, local minima, the problems are starting to study the characteristics of the error surface and take it as a starting point for improvement or propose a new training algorithm. When mentioning about the neural networks, trained network quality is usually offered (supervised learning). This related quality function and led to the concept of network quality surface. Sometimes, we also call the quality surface by other terms: the error surface, the executing surface. Figure 1 shows an error surface. There are some special things to note for this surface such as: the slope is drastically changing on the parameter space. For this reason, it will be difficult to choose an apprppriate pace for learning algorithm known as the steepest descent algoritm, conjugate gradient… In some areas of the error surface is very flat, allowing large learning rate, while other regions of big slopes, require a small learning rate. Other methods such as rules of torque, adaptive learning rate VLBP (Variable * Tel: 0913 589758, Email: conghn@tnu.edu.vn Learning Rate Back propagation algoritm) are not effective in this problem [5]. Thus, with the complex quality surfaces are more difficult in the process of finding the optimal weights and can still blocked at the shaft of the cleft before reaching the minimum point, if the quality surface is the cleft form. Probably, having next strategy to solve this problem is that after reaching near the cleft shaft by gradient method with calculated step approach minimized following line (or with s pecified learning steps) we will move along the bottom of a narrow cleft through the gradually asymptotic geometry, it is assumed that the geometry is a line or approximately quadratic curve. The objective of this paper is to study and apply the cleft algorithm to calculate the learning step for finding the optimal weights of neural networks to solve the control problem. CLEFT-OVERSTEP ALGORITHM NEURAL NETWORK TRAINING FOR Cleft-overstep principle Examining the unconstrained minimizing optimization problem: J(u) → min, u∈ En (1) 53 Số hóa bởi Trung tâm Học liệu – Đại học Thái Nguyên http://www.lrc-tnu.edu.vn Nguyễn Hữu Công và Đtg Tạp chí KHOA HỌC & CÔNG NGHỆ Where u is the minimizing vector in an ndimensional Euclidean space, J(u) is the target lim J ( x ) = b function which satisfies u →∞ (2) The optimization algorithm for problem (1) has the iteration equation as follows: (3) u k +1 = u k + α K s k , k = 0,1.. ... k k+1 where u and u are the starting point and the ending point of the kth iteration step, sk is the vecgtor which show the changed direction of numeric variables in n-dimentional space; α k is the step length. α k is determined according to the cleftoverstep principle and called a “cleftoverstep” step and equation (3) is called the cleft-overstep algorithm. The basic difference between cleft-overstep method and other methods is in the principle for step adjustment. According to this principle, the step length of the searching point at each iteration step is not smaller than the smallest step length at which the target function reaches the (local) minimum value in the moving direction at that iteration step. The searching optimization trajectory of the cleft-overstep principle creates a geometric picture in which the searching point “oversteps” the cleft bottom at each iteration step. To specify the cleft-overstep principle we examine a one numeric variable function at each iteration step [4]: h (α ) = J ( u k + α .s k ) (4) Suppose that sk is the direction of the target function at the point uk. According to condition (2), there is a smallest value α * > 0 so that h(α) reaches minimum: α * = arg min h (α ) , α > 0 (5) k If J(u ), this also means h(α), continuously differentiable, we can define the cleft-overstep step as follows: h ' (α ) α =α v 93(05): 53 - 59 ( α v is the overstep step, means that it oversteps the cleft) The variation graph of function h(α), when the optimization trajectory changes from the starting point uk to the ending point uk+1 is illustrated in figure 2. We can see that when the value α ascends from 0, go through the minimal point α * of h(α), to the value α v , the corresponding optimization trajectory moves forward parallely with sk in the relationship that u k +1 = u k + α K s k , k = 0,1.. and takes a step length of α = α v ≥ α * . This graph also shows that, considering the moving direction, the target function changes in the descending direction from point uk, but when it reaches point uk+1 it changes to the ascending direction. If we use moving steps according to condition (5), we may be trapped at the cleft axis and the corresponding optimization algorithm is also trapped at that point. Also, if the optimization process follows condition (6), the searching point is not allowed to be located at the cleft bottom before the optimal solution is obtained and, simultaneously, it always draw a trajectory which overstep the cleft bottom. In order to obtain effective and stable iteration process, condition (6) is substituted by condition (7). α v ≥ α * = arg min h (α ) , h (α v ) − h* ≤ λ h0 − h* α >0 (7) Where: 0 < λ < 1 is called overstep coefficient h * = h(α * ) h 0 = h(α 0 ) Determining the cleft-overstep step > 0, h (α v ) ≤ h ( 0 ) (6) Figure 1: Cleft-similar error surface 54 Số hóa bởi Trung tâm Học liệu – Đại học Thái Nguyên http://www.lrc-tnu.edu.vn Nguyễn Hữu Công và Đtg Tạp chí KHOA HỌC & CÔNG NGHỆ 93(05): 53 - 59 the Backpropagation procedure and learning step calculated by principles of cleftoverstep principle. The example: for an input vector, neural networks have to answer what it is. The software provides a structured network of 35 input neurons, 5 middle layer neurons, 10 output layer neurons. Sigmoid is activated function; this function’s characteristic is easy to make the cleft - error surface. [1] Start Figure 2: Determining the cleft-overstep step α=a =0.5 Correctness ε>0 γ=0.1 Initialized u0 Searching direction s0 vượt khe α v Choosing the length of learning step in the cleft problem is of grave importance. If this length is too short, the running time of computers will be long. If this length is too long, there may be difficulties in searching process because it is difficult to observe the curvedness of the cleft. Therefore, adaptive learning step for the cleft problem is essential in the process of searching for optimal solution. In this section, we propose a simple, yet effective way to find the cleft-overstep step. Suppose that J(u) is continuous and satisfies the condition limJ(u) = ∞ when k-1 u → ∞ and at each iteration k, point u and moving vector sk-1 was determined. We need to determine the length of the step α k which satisfies condition (7). If instead of h* in (7) by estimate h(α)=h(u+ as) 0 α=β β=a β=1.5β 0 h(α)≤h( β) h ≈ h * , h > h * , we still get cleft-overstep by definition. Therefore, to simplify programming should take the smallest value of h simply calculated corresponding in each iteration, without accurately determining h *. It also significantly reduces the number of objective function’s value. We have following algorithm: ( ) αk T θ = α + γ (β − ( ) 1 α=0 β=a 1 0 α −β ≤ε 1 α=θ identified the hk' (αk ).hk' (0) = J s' u k −1 S k −1J s' u k S k −1 < 0 h(α)≥h( 0) 0 h(θ)≤h( α) α=β 1 T (8) PROGRAM AND RESULTS To illustrate above remarks, here we offer a method using neural network training with β=θ End Figure 3: Diagram of algorithm that determines the cleft-overstep learning step 55 Số hóa bởi Trung tâm Học liệu – Đại học Thái Nguyên http://www.lrc-tnu.edu.vn Nguyễn Hữu Công và Đtg Tạp chí KHOA HỌC & CÔNG NGHỆ The principle of network training algoritms is done by Backpropagation procedure associated with learning step calculated by cleft-overstep principle. Cleft-overstep algoritm has been presented in section 2. 93(05): 53 - 59 v= M −1 ∑ bj y j j =0 v: total weight of so ∂ v / ∂ bj = yj (ignoring the index of neurons in output layer) We use J = 0.5 * (z-t)2, so ∂ J / ∂ z = (z-t). Activated function of output layer neuron is sigmoid z = g (v), with ∂ z / ∂ v = z (1-z). We have: ∂J ∂J ∂z ∂v = . . = ( z − t ).z (1 − z ) . y ∂b ∂z ∂v ∂b (9) Since then we have the updated formula of output layer’s weight as follows (ignoring the ∆b = α .( z − t ) .z.(1 − z ) . y Figure 4: Structure of neural network for recognition Thus with the use of the fastest descent methods to update the weight of the network, we need the information related to separate derivative of the error function in each weight, that is to determine formulate and update algorithm of weight in the hidden layer and output layer. For a given sample set, we will calculate the derivative of error function by taking the derivative’s sum on each sample in that set. Analysis method and the derivative are based on the "chain rule". According to the slope of the tangent to the error curve in the w – axis cross section called the partial derivative of error function J taken by that weight, denoted ∂ J/∂ w, using the chain rule we have: ∂J ∂J ∂w1 ∂wn = . ... ∂w ∂w1 ∂w2 ∂w Adjust the weights of output layer: Define:b: the weight of output layer z: the output of output layer. t: the desired target value yj: the output of neurons in the hidden layer indices): (10) We will use formula (10) [4] in the procedure DIEUCHINHTRONGSO () to adjust the weight of output layer, learning rate α is calculated according to the principle of the cleft-overstep principle. Adjust the weights of hidden layers: The derivative of the objective function for a weight of hidden layer is calculated by chain rule: ∂J ∂a = ( ∂J ∂y ) .( ∂y ∂u ) .( ∂u ∂a ) Define: a: the weight of hidden layer y: output of the neuron in the hidden layer xi: the components of the input vector of input layer u: total weight u= N −1 ∑ ai xi i =0 so ∂ u / ∂ ai = xi k: index of neuron in output layer We have objective derivative of the weight of hidden layer ∂J K−1 = ∑( zk − tk ).zk .(1− zk ) .bk .y.(1− y) .xi ∂ai k=0 (11) From here, we can adjust formula for weight of the hidden layer as below: K−1 ∆ai =α.∑( zk −tk ).zk.(1−zk ).bk.y.(1− y).xi k=0 (12) 56 Số hóa bởi Trung tâm Học liệu – Đại học Thái Nguyên http://www.lrc-tnu.edu.vn Nguyễn Hữu Công và Đtg Tạp chí KHOA HỌC & CÔNG NGHỆ In this formula, the index i denotes the ith neuron of input layer, the index k denotes the kth neuron of output layer. We will use formula (12) [4] in the procedure DIEUCHINHTRONGSO () to adjust the weights of hidden layer, learning rate α is calculated according to the principle of the cleft-overstep principle. The network structure Using the sigmoid function that is prone to produce the narrow cleft network quality, the f = equation 1 1 + exp ( -x ) (13) Example Recognizing the characters are the digits 0, 1, ... 9; [1] Comparison of convergence of the three learning step methods: cleft- 93(05): 53 - 59 overstep principle, fixed step and gradually descent step. We use a matrix of 5 × 7 = 35 for each character encoding. Corresponding to each input vector x is a vector of size 35 × 1, with components receiving the value of 0 or 1. Thus, we can select the input layer with 35 inputs. To distinguish ten characters, we make the output layer is 10. For the hidden layer, five neurons are selected, so Hidden layer weight matrix: W1,1, size 35 × 5 Output layer weight matrix: W2,1, size 5 × 10 Input vector x, size 35 × 1 Hidden layer output vector y, size 5 × 1 Output layer output vector z, size 10 × 1 After compiling with Visual C + +, run the program, in turn training the network by 3 methods: fixed learning step, gradually descent step and cleft overstep. Each method, we try to train 20 times. The result gives the following table: Table 1: Input sample file records {0 1 2 3 4 5 6 7 8 9} No Gradually descent step from 1 Fail Fail 7902 (iteration) 3634 (iteration) 7213 2416 Fail 2908 12570 2748 9709 3169 Fail 2315 9173 2375 Fail Fail 8410 2820 10333 2618 12467 2327 Fail 3238 9631 2653 12930 2652 10607 Fail Fail 2792 7965 2322 11139 2913 Fail 2689 Average: 10003 iterations, 7 Average: 2740 iteration, 3 Fail/20 Fail/20 Fixed Step 0.2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Summary Cleft-overstep Fail 23 (iteration) 50 34 31 42 43 33 34 33 32 39 44 Fail 31 53 31 42 42 33 Average: 35 iteration, fail/20 2 57 Số hóa bởi Trung tâm Học liệu – Đại học Thái Nguyên http://www.lrc-tnu.edu.vn Nguyễn Hữu Công và Đtg Tạp chí KHOA HỌC & CÔNG NGHỆ Comments: We have trained the network by three different methods and have realized that learning by cleft overstep principle has much higher convergence speed, the number of fail times is also reduced. One drawback of the cleft overstep principle is that the time for calculating by computer in each iteration is long, this is because we defined that the constants FD = 1 - e4 is small. However, the total network training time is more beneficial. CONCLUSION In this paper, the authors have proposed successfully the use of "cleft-overstep" algorithm to improve the neural network training having the special error surface and have illustrated particularly through the application of hand writing recognition. Through research and experimentation, gained results have shown that: with the neural network structure that error surface is shaped deep cleft, still use this Backpropagation algorithm but applying "cleft-overstep" to train the network gives us more accuracy and faster convergence speed than the gradient method. The use of "cleft-overstep" algorithm can be applied to train a neural network structure that has the special error surface. Thus, the results of this study can be applied to many other problems in the field of telecommunications, control, and information technology. The paper should be more research on the identification of vector direction searching in the algorithm "cleft-overstep"and the 93(05): 53 - 59 changing the assess standard of the quality function to reduce the complexity of the calculation process on the computer [6]. However, the results of this study have initially reflected the correctness of the proposed algorithm and revealed possibilities to practical applications. REFERENCES [1]. Cong Huu Nguyen; Thanh Nga Thi Nguyen; Phuong Huy Nguyen(2011); Research on the application of genetic algorithm combined with the “cleft-overstep” algorithm for improving learning process of MLP neural network with special error surface , Natural Computation (ICNC), 2011 Seventh International Conference on Issue Date: 26-28 July 2011; On page(s): 222 - 227 . [2]. Maciej Lawrynczuk (2010), “Training or neural models for predictive control”, Insitute of control and computation Engineering, Faculty of Electronics and Information Technology, Warsaw University of Technology, ul. Nowowiejska 15/19, 00-665 Warsaw, Poland, Neurocomputing 73. [3]. Thuc Nguyen Dinh & Hai Hoang Duc, Artificial Intelligence – Neural Network, Method and Application, Educational Publisher, Ha noi. [4]. Nguyen Van Manh and Bui Minh Tri, Method of “cleft-overstep” by perpendicular direction for solving the unconstrained nonlinear optimization problem, Acta Mathematica Vietnamica, vol. 15, N02, 1990. [5]. Hagan, M.T., H.B. Demuth and M.H Beal, Neural Networks Design, PWS Publishing Company, Boston, 1996. [6]. R.K. Al Seyab, Y. Cao (2007) “Nonlinear system identification for predictive control using continuous time recurrent neural networks and automatic differentiation”, School of Engineering Cranfield University, College Road, Cranfield, Bedford MK43 0AL, UK, Science Direct 58 Số hóa bởi Trung tâm Học liệu – Đại học Thái Nguyên http://www.lrc-tnu.edu.vn Nguyễn Thị Hồng Hoa và Đtg Tạp chí KHOA HỌC & CÔNG NGHỆ 93(05): 61 - 64 TÓM TẮT NGHIÊN CỨU CẢI TIẾN THUẬT TOÁN HỌC CỦA MẠNG NƠRON Nguyễn Hữu Công*1, Nguyễn Thị Thanh Nga2, Đồng Văn Ngọc3 1 Đại học Thái Nguyên, 2Trường Đại học Kỹ thuật Công nghiệp – ĐH Thái Nguyên 3 Trường Cao đẳng nghề Cơ điện Hà Nội Từ giữa thế kỷ XX, nghiên cứu về các thuật toán tối ưu hóa, đặc biệt là sự phát triển của kỹ thuật số máy tính, đang ngày càng trở thành một lĩnh vực quan trọng trong toán học. Ngày nay, những công cụ toán học này được áp dụng để huấn luyện các mạng nơron. Trong quá trình tìm kiếm một thuật toán tối ưu để giảm thiểu thời gian hội tụ hoặc tránh các cực tiểu yếu, cực tiểu địa phương, đều bắt đầu từ việc nghiên cứu các đặc tính của mặt lỗi (mặt sai số). Đối với các bề mặt lỗi phức tạp như mặt lỗi có bề mặt hở, đường nét của nó được kéo dài, uốn cong hình thành trục và hở hàm ếch, các thuật toán cũ không thể giải quyết được. Bài báo này đề xuất một thuật toán để cải thiện sự hội tụ và khả năng để thoát ra từ các khu vực không mong muốn trên các bề mặt lỗi đặc biệt. Từ khóa: mạng nơron, mặt lỗi đặc biệt, cực tiểu địa phương, tối ưu hóa, thuật toán học Ngày nhận bài: , ngày phản biện: , ngày duyệt đăng: * 59 Số hóa bởi Trung tâm Học liệu – Đại học Thái Nguyên http://www.lrc-tnu.edu.vn