Neural Networks for Solving Systems of Linear Equations; Minimax, Least Absolute Value and Least Square Problems in Real Time Andrzej Cichocki Presented by : Yasaman Farahani Maryam Khordad Leila Pakravan Nejad 1 Introduction Solving systems of linear equations is considered to be one of the basic problems widely encountered in science and engineering since it is frequently used in many applications. Every linear parameter estimation problem gives rise to a set of linear equations Ax=b. This problem arises in a broad class of scientific desciplines such as signal processing,robotics,automatic control,system theory,statistics,and physics. In many applications a real time solution of a set of linear equations ( or,equivalently, an online inversion of matrices) is desired. We employ artificial neural networks (ANN’s) which can be considered specialized analog computers relying on strongly simplified models of neurons. This has led to new theoretical results and advances in VLSI technology that make it possible to fabricate microelectronic networks of high comlexity. 2 Formulation of the Basic Problem Consider the linear parameter estimation model : It is desired to find in real time a solution x* if an exact (error-free) solution exists at all or to find an approximate solution that comes as close as possible to to a true solution(the best estimates of the solution vector x*). The key step is to construct an appropriate energy function(Lyapunov function) E(x) so that the lowest energy state will correspond to the desired solution x*. The derivation of the energy function will transform the minimization problem into a set of ordinary diffrential or difference equations on the basis of ANN architectures with appropriate synaptics weights,input excitations, and nonlinear activation functions. 3 Find the vector that minimizes the energy function: The following cases have special importance: 4 In order to use the influence of outliers (large errors) the more robust iteratively reweighted least squares technique can be used. In the presense of outliers an alternative approach is to use the least absolute value criterion. The proper choice of the criterion used depends on the specific applications and greatly on the distribution of the errors in the measurement vector b. The standard least square criterion is optimal for a Gaussian distribution of the noise,however this assumption in frequently unrealistic due to different sources of errors such as instrument errors,modeling errors,sampling errors,and human errors. 5 NEURON-LIKE ARCHITECTURES FOR SOLVING SYSTEMS OF LINEAR EQUATIONS Standard Least Squares Criterion: is an n x n positive-definite matrix that is often diagonal. The entries of the matrix depend on the time and the vector x. 6 NEURON-LIKE ARCHITECTURES FOR SOLVING SYSTEMS OF LINEAR EQUATIONS, Cont. The basic idea is to compute a trajectory x(t) starting at the initial point x(0) that has the solution x* as a limit point. The specific choice of the coefficients must ensure the stability of the differential equations and an appropriate convergence speed to the stationary solution (equilibrium) state. the system of above differential equations is stable (i.e., it has always a stable asymptotic solution) since under the condition that the matrix is positive-definite for all values of x and t , and in absence of round-off errors in the matrix A. 7 Fig. 1. Schematic architecture of an artificial neural network for solving asystem of linear equations Ax = b. 8 Iteratively Reweighted Least Squares Criterion In order to diminish the influence of the outliers we will employ the iteratively reweighted least squares criterion. Applying the gradient approach for the minimization of the energy function a we obtain the system of differential equations: 9 Iteratively Reweighted Least Squares Criterion,Cont. Adaptive selection of the can greatly increase the convergence rate without causing stability problems as could arise by use of higher (fixed) constant values of . The use of sigmoid nonlinearities in the first layer of "neurons" is essential for overdetermined linear systems of equations since it enables us to obtain more robust solutions, which are less insensitive to outliers (in comparison to the standard linear implementation), by compressing large residuals and preventing their absolute values from being greater than the prescribed cutoff parameter . 10 Special Cases with Simpler Architectures An important class of Ax = b problems are the well-scaled and wellconditioned problems for which the eigenvalues of the matrix A are clustered in a set containing eigenvalues of similar magnitude. In such a case the matrix differential equation will be: is a positive scalar coefficient and is an arbitrary n x n nonsingular matrix that should be chosen such that the matrix is a positive stable matrix. The stable equilibrium point x* (for dx / dt = 0) does not depend on the value , and the coefficients of the matrix . matrix can be a diagonal matrix with entries , i.e., the set of differential equations can take the form: is the nonlinear sigmoid function. 11 Special Cases with Simpler Architectures, Cont. For some well-conditioned problems, instead of minimizing one global energy function E(x),it is possible to minimize simultaneously n local energy functions defined by Applying a general gradient method for each energy function: Fig. 2. Simplified architecture of an ANN for the solution of a 12 system of linear equations with a diagonally dominant matrix. Special Cases with Simpler Architectures, Cont. Analogously, for the above energy function we can obtain Assuming, for example, that we have To find sufficient conditions for the stability of such a circuit we can use Lyapunov method: Hence we estimate that if 13 Special Cases with Simpler Architectures, Cont. The above condition means that the system is stable if the matrix A is diagonally dominant. We can derive a sufficient condition for the stability of the above systems: 14 Positive Connection In some practical implementations of ANN’s it is convenient to have all gains (connection weights) aij positive.This can easily be achieved by the extension of the system of linear equations : With diffrerent signs of the entries aij to the following form with all positive. Since both of the above systems of linear equations must be equivalent with respect to the variables the following relation must be satisfied: 15 Positive Connection, Cont. So we obtain: From the above formula it is evident that we can always choose auxiliary entries and so that all entries will be positive. instead of solving the original problem, it is possible to solve the problem with all the entries (connection weights) positive. 16 Improved Circuit Structures for illConditioned Problems For ill_conditioned problems proposed schemes may be prohibitively slow and they mail even fail to find an appropriate solution or they may find a solution with large error. This can be explained by the fact that for ill-conditioned problem we may obtain a system of stiff diffrencial equations. The system of stiff differential equations is one that is stable but exhibits a wide difference in the behavior of the individual components of the solution. The essence of a system of stiff differential equations is that one has a very slowly varying solution (trajectory) which is such that some perturbation to it is rapidly damped. For a linear system of differential equations this happens when the time constants of the system, i.e., the reciprocals of the eigenvalues of the matrix A are widely different. 17 Augmented Lagrangian with Regularization Motivated by the desire to alleviate the stiffness of the differential equations and simultaneously to improve the convergence properties and the accuracy of the desired networks, we will develop a new ANN architecture with improved performance. For this purpose we construct the following energy function (augmented Lagrangian function) for the linear parameter estimation problem: where 18 Augmented Lagrangian with Regularization, Cont. The augmented Lagrangian is obtained from the ordinary (common) Lagrangian by adding penalty terms. Since an augmented Lagrangian can be ill-conditioned a regularization term with coefficient a is introduced to eliminate the instabilities associated with the penalty terms. The problem of minimization of the above defined energy function can be transferred to the set of differential equations: 19 Augmented Lagrangian with Regularization, Cont. The above set of equations can be written in the compact matrix form: Fig 3. General architecture of an ANN for matrix inversion. 20 Augmented Lagrangian with Regularization, Cont. In comparison to the architecture given in Fig. 1, the circuit contains extra damped integrators and amplifiers (gains k,). The addition of these extra gains and integrators does not change the stationary point x* but, as shown by computer simulation experiments, helps to damp parasitic oscillations, improves the final accuracy, and increases the convergence speed (decreases the settling time). Analogous to our previous considerations auxiliary sigmoid nonlinearities can be incorporated in the first layer of computing units (i.e., adders) in order to reduce the influence of outliers. 21 Preconditioning Preconditioning techniques form a class of linear transformations of the matrix A or the vector x that improve the eigenvalue structure of the specified energy function and alleviate the stiffness of the associated system of differential equations. The simplest technique that enables us to incorporate preconditioning in an ANN implementation is to apply a linear transformation x = My where M is an appropriate matrix, i.e., instead of minimizing the energy function we can minimize the modified energy function: The above problem can be solved by the simulating system of differential equations: where is a positive scalar coefficient. Multiplying above equation by the matrix M, we get: 22 Preconditioning, Cont. Setting we get a system of differential equations already considered in standard least squares criterion . Thus the realization of a suitable symmetric positive-definite matrix instead of a simple scalar enables us to perform preconditioning, which may considerably improve the convergence properties of the system. 23 Artificial Neural Network with Time Processing Independent of the Size of the Problem The systems of differential equations considered above cause the trajectory x(t) to converge to a desired solution x* only for ,although the convergence speed can be very high. In some real-time applications it is required to assure that the specified energy function E(x) reach the minimum at a prescribed finite period of time, say or that E(x) becomes close to the minimum with a specified error (where is an arbitrarily chosen positive very small number). In other words, we can define the reachability time as the settling time after which the energy function E(x) enters a neighborhood of the minimum and remains there ever after the moment . Such a problem can be solved by making the coefficients of the matrix ( adaptive during the minimization process, under the assumption that the initial value and the minimum (final) value E(x*) of the energy function E(x(t)) are known or can be estimated. 24 Artificial Neural Network with Time Processing Independent of the Size of the Problem, Cont. consider the Ax = b problem with a nonsingular matrix A, which can be mapped to the system of differential equations : The adaptive parameter can be defined as: For this problem 25 Hence it follows that the energy function decreases in time linearly during the minimization process as: and reaches the value By choosing , we find that the system of above equations reaches the equilibrium (stationary point) in the prescribed time independent of the size of the problem. The system of differential equations (45) can (approximately) be implemented by the ANN shown in Fig. 4 employing auxiliary analog multipliers and dividers. (very close to the minimum) after the time 26 Neural Networks for Linear Programming The ANN architectures considered in the previous sections can easily be employed for the solution of a linear programming problem which can be stated in standard form as follows: Minimize the scalar cost function: Subject to the linear constraints: By use of the modified Lagrange multiplier approach we can construct the computation energy function is a regularization parameter The problem of minimization of the energy function E(x) can be transformed to a set of differential equations: 27 Neural Networks for Linear Programming,cont. and mean the integration time constants of the integrators. The circuit consists of adders (summing amplifiers) and integrators.Diodes used in the feedback from the integrators assure that the output voltages xj are non-negative (i.e., ). A regularization in this circuit is performed by using local feedback with gain around appropriate integrators. Fig. 4. A conceptual ANN implementation of linear programming. 28 Minimax and Least Absolute Value Problems 29 Goal The goal is to extend proposed class to new ANN’s witch are capable in real time to find estimates of the solution vectors x* and residual vectors r(x*)=Ax*-b for the linear model Ax b using the minimax and least absolute values criteria 30 Lp-NORMED MINIMIZATION Lp-normed error function: m 1 Ep ( x) ri p i 1 ri ( x) p 1 p m aijxj bj j 1 Steepest decsent method: Learning rate: j 1 j (i 1,2,..., m) dxj Ep( x) j xj dt 0 31 Lp-NORMED MINIMIZATION m dxj j aijg[ri ( x)] dt i 1 1 if signri ( x) 1 if (j=1,2,…,n) ri ( x) 0 ri ( x) 0 32 Lp-NORMED MINIMIZATION 33 Lp-NORMED MINIMIZATION 34 Lp-NORMED MINIMIZATION L1-norm: : L norm E( x) max ri( x) sign[ri ( x)] if g ri ( x) 0 1im ri ( x) max { rk ( x) } 1 k m otherwise 35 Lp-NORMED MINIMIZATION Ep(x) for p=1 and p=∞ have discontinuous partial first order derivatives. E1( x) is piecewise differentiable, with a possible derivative discontinuity at x if ri( x) 0 for some i. E(x) derivative discontinuity at x if ri ( x) rk ( x) E( x) for some i≠k. The presence of discontinuities in the derivatives are often responsible for various anomalous result. The direct implementation of these activation functions is difficult and impractical. 36 MINIMAX (L∞-Norm) We transform the minimax problem: min max ri ( x) xR n 1i m Into equivalent one: minimize ε Subject to the constraint: ri ( x) , 0 Thus the problem can be viewed as finding the smallest nonnegative value of * E( x*) 0 x* is the vector of the optimal value of the parameters. 37 NN Architecture Using Quadratic Penalty Function Terms km 2 2 E ( , x) v ri ( x) ri ( x) 2 i 1 v 0, k 0Are penalty coefficients and [ y] : min{0, y} 38 NN Architecture Using Quadratic Penalty Function Terms Steepest decsent method: m v d 0 ri ( x) Si1 ri ( x) Si 2 k dt i 1 m dxj j aij ri ( x) Si1 ri ( x) Si 2 dt i 1 xj (0) xj ( 0) ( j 1,2,..., n) ( 0) ( 0 ) 0 0, j 0 39 NN Architecture Using Quadratic Penalty Function Terms 40 NN Architecture Using Quadratic Penalty Function Terms • The system of differential equation can be simplified by incorporating adaptive nonlinear building blocks. 41 NN Architecture Using Quadratic Penalty Function Terms 42 NN Architecture Using Exact Penalty Method 43 NN Architecture Using Exact Penalty Method 44 NN Architecture Using Exact Penalty Method 45 NN Architecture Using Exact Penalty Method 46 NN Architecture Using Exact Penalty Method 47 NN Architecture Using Exact Penalty Method Modifying minimax problem: set of new equations: 48 NN Architecture Using Exact Penalty Method 49 NN Architecture Using Exact Penalty Method One advantage of the proposed circuit is that it does not require to use precision signum activation function and absolute value function generators. 50 LEAST ABSOLUTE VALUES (L1-NORM) Find the design vector energy function that minimizes the 51 Neural Network Model by Using the Inhibition Principle The function of Inhibition subnetwork is to suppress some signals while allowing the other signals to be transmitted for further processing. Theorem: there is a minimizer of the energy function for witch the residuals for at least n values of i, say where n denotes the rank of the matrix A. 52 Neural Network Model by Using the Inhibition Principle 53 Simplified NN for Solving Linear Least Squares and Total Least Squares Problems 54 Objective analog circuit designing of a neural network for implementing such adaptive algorithms propose some extensions and modifications of the existing adaptive algorithms demonstrate the validity and high performance of the proposed neural network models by computer simulation experiments 55 Problem Formulation In least squares (LS) approach, matrix A are assumed to be free from error and all errors are confined to the observation vector b Definition of a cost (error function) E(x) 56 Problem Formulation By using a standard gradient approach for the minimization of the cost function the problem can be mapped to the system of linear differential equations it requires extra precalculations and it is inconvenient for large matrices especially when the entries aij and/or bi are time variable 57 Motivation The ordinary LS problem is optimal only if all errors are confined to the observation vector b and they have Gaussian distribution. The measurements in the data matrix A are assumed to be free from errors. However, such an assumption is often unrealistic (e.g., in image recognition and computer vision) since sampling errors, modeling errors and instrument errors may imply noise inaccuracies of the data matrix A The total least squares problem (TLS) has been devised as a more global and often more reliable fitting method than the standard LS problem for solving an overdetermined set of linear equations when the measurement in b as well as in A are subject to errors 58 A Simplified Neuron For The Least Squares Problem In the design of an algorithm for neural networks the key step is to construct an appropriate cost (computational energy) function E(x) so that the lowest energy state will correspond to the desired solution x* The formulation of the cost function enables us to transform the minimization problem into a system of differential equations on the basis of which we design an appropriate neural network with associated learning algorithm. For our purpose we have developed the following instantaneous error function 59 A Simplified Neuron For The Least Squares Problem The actual error e(t) can be written as 60 A Simplified Neuron For The Least Squares Problem For the so-formulated error e ( t ) we can construct the instantaneous estimate of the energy (cost) function at time t as The minimization of the cost (computational energy) function leads to the set of differential equations 61 A Simplified Neuron For The Least Squares Problem The system of the above differential equations can be written in the compact matrix form The system of these differential equations constitutes the basic adaptive learning algorithm of a single artificial neuron (processing unit) 62 A Simplified Neuron For The Least Squares Problem 63 Loss Functions There are many possible loss functions p(e) which can be employed as the cost function logisticfunction function The Talvar’s absolute Huber’s function value function 64 Standard Regularized Least Squares LS Problem Find the vector x*LS which minimizes the cost function The minimization of the cost function according to the gradient descent rule leads to the learning algorithm 65 Neural Network Implementations 66 About Implementation The network consists of analog integrators, summers and analog multipliers. The network is driven by the independent source signals si(t) multiplied by the incoming data aij, b;(i = 1,2, . . . , m; j = 1,2, . . , n) . The artificial neuron (processing unit) with an on chip adaptive learning algorithm shown in Fig. allows processing of the input information (contained in the available input data aij, bi) fully simultaneously, i.e., all m equations are acted upon simultaneously in time. This is the important feature of the proposed neural network. 67 Adaptive Learning Algorithms for the TLS problem For the TLS problem formulated in previous, we can construct the instantaneous energy function 68 Adaptive Learning Algorithms for the TLS problem The above set of differential equations constitutes a basic adaptive parallel learning algorithm for solving the TLS problem for overdetermined linear systems. 69 Analog (continuous-time) implementation of the algorithm 70 Extensions And Generalizations Of Neural Network Models It is interesting that the neural network models shown in previous Figs. can be employed not only to solve LS or TLS problems but they can easily be modified and/or extended to related problems. by changing the value of the parameter β more or less emphasis can be given to errors of the matrix A with respect to errors of the vector b. for large β (say β = 100) it can be assumed that the vector b is almost free of error and the error lies in the data matrix A only. Such a case is referred to as the so called DLS (data least squares) problem (since the error occurs in A but not in b) The DLS problem can be solved by simulating the system of differential equations 71 Extensions And Generalizations For complex-valued elements (signals) the algorithm can further be generalized as β = 0 for the LS-problem β = 1 for the TLS problem β >> 1 for the DLS problem 72 Computer Simulation Result (LS) Example 1 : Consider the problem of finding the minimal L2norm solution of the underdetermined system of linear equations The above set of equations has infinity many solutions. There is a unique minimum norm solution which we want to find. The final solution (equilibrium point) was x* = [0.0882,0.1083,0.2733,0.5047,0.3828, -0.30971T which is in excellent agreement with the exact minimum L2-norm solution obtained by using MATLAB. 73 Computer Simulation Result Let us consider the following linear parameter estimation problem described by the set of linear equations Example 2: 74 Simulation Results (LS, TLS, DLS) Time: less than 400 ns 75 Simulation Results (MINIMAX, Least Absolute) • MINIMAX problem Theoretical solution: Last proposed NN: Time: 300 ns 76 Simulation Results (MINIMAX, Least Absolute) Least Absolute Value: Theoretical solution: First proposed NN: solution: Time: 60 ns Last proposed NN: solution in first phase: solution in second phase: Time:100ns 77 Simulation Results (Iteratively reweighted LS, …) Iteratively reweighted least squares criterion for Standard Least Square: 78 Simulation Results (MINIMAX, Least Absolute) • Example 3: • Last NN: • First NN: 79 Simulation Results (Iteratively reweighted, …) Iteratively reweighted least square criterion Time= 750 ns Augmented lagrangian with regularization Time=52 ns ANN providing linearly decreasing the energy function in time in a prescribed speed of convergence Time=10 ns 80 Simulation Results (Iteratively reweighted, …) Example 4: inverse of the matrix In order to find the inverse matrix we need to make the source vector b successively [1, 0, 0] T, [0, 1, 0] T, [0, 0, 1] T. Time=50ns 81 Conclusion very simple and low-cost analog neural networks for solving least squares and TLS problems using only one single highly simplified artificial neuron with an on chip learning capability able to estimate the unknown parameters in real time (hundreds or thousands of nanoseconds) suitable for currently available VLSI implementations attractive for real time and/or high throughput rate applications when the observation vector and the model matrix are changing in time 82 Conclusion universal and flexible allows either a processing of all equations fully simultaneously or a processing of groups of equations (i.e., blocks) in iterative steps allows the processing only of one equation per block, i.e., in each iterative step only one single equation can be processed 83