LECTURE #9: FUZZY LOGIC & NEURAL NETS

COURSE ANNOUNCEMENT -- Spring 2004 DSES-4810-01 Intro to COMPUTATIONAL INTELLIGENCE & SOFT COMPUTING With ever increasing computer power readily available novel engineering methods based on “soft computing” are emerging at a rapid rate. This course provides the students with a working knowledge of computational intelligence (CI) covering the basics of fuzzy logic, neural networks, genetic algorithms and evolutionary computing, simulated annealing, wavelet analysis, artificial life and chaos. Applications in control, forecasting, optimization, data mining, fractal image compression and time series analysis are illustrated with engineering case studies. This course provides a hands-on introduction to the fascinating discipline of computational intelligence (i.e. the synergistic interplay of fuzzy logic, genetic algorithms, neural networks, and other soft computing techniques). The students will develop the skills to solve engineering problems with computational intelligence paradigms. The course requires a CI-related project in the student’s area of interest. Instructor: Office Hours: Class Time: Text (optionakl): Prof. Mark J. Embrechts (x 4009 embrem@rpi.edu) Thursday 10-11 am (CII5217) Or by appointment. Monday/Thursday: 8:30 – 9:50 am (Amos Eaton Hall 216) J. S. Jang, C. T. Sun, E. Mizutani, “Neuro-Fuzzy and Soft Computing,” Prentice Hall, 1996 (1998). ISBN 0-13-261066-3 Course is open to graduate students and seniors of all disciplines. GRADING: Tests 10% 5 Homework Projects 35% Course Project 40% Presentation 15% ATTENDANCE POLICY Course attendance is mandatory, a make-up project is required for each missed class. A missed class without make-up results in the loss of half a grade point. ACADEMIC HONESTY Homework Projects are individual exercises. You can discuss assignments with your peers, but not copy. Course project may be in groups of 2. 1 COMPUTATIONAL INTELLIGENCE - COURSE OUTLINE 1. INTRODUCTION TO ARTIFICIAL NEURAL NETWORKS (ANNs) 1.1 History 1.2 Philosophy of neural nets 1.3 Overview neural nets 2. INTRODUCTION TO FUZZY LOGIC 2.1 History 2.2 Philosophy of Fuzzy Logic 2.3 Terminology and definitions 3. INTRODUCTION TO EVOLUTIONARY COMPUTING 3.1 Introduction to Genetic Algorithms 3.2 Evolutionary Computing/ Evolutionary programming/ Genetic Programming 3.3 Terminology and definitions 4. NEURAL NETWORK APPLICATIONS/DATAMINING WITH ANNs 4.1 Case study: time series forecasting (population forecasting) 4.2 Case study: automated discovery of novel pharmaceuticals (Part I) 4.3 Data mining with neural networks 5. FUZZY LOGIC APPLICATIONS/FUZZY EXPERT SYSTEMS 5.1 Fuzzy logic case study: tipping 5.2 Fuzzy expert systems 6. SIMULATED ANNEALING/GENETIC ALGORITHM APPLICATIONS 6.1 Simulated annealing 6.2 Supervised clustering with GAs 6.3 Case study: automated discovery of novel pharmaceuticals (Part II) 7. DATA VISUALIZATION WITH SELF-ORGANIZING MAPS 7.1 The Kohonen feature map 7.2 Case study: visual explorations for novel pharmaceuticals (Part III) 7. ARTIFICIAL LIFE 7.1 Cellular automata 7.2 Self-organized criticality 7.3 Case study: highway traffic jam simulation 8. FRACTALS and CHAOS 8.1 Fractal Dimension 8.2 Introduction to Chaos 8.3 Iterated Function Systems 9. WAVELETS 2 Monday January 12, 2004 DSES-4810-01 Intro to COMPUTATIONAL INTELLIGENCE & SOFT COMPUTING Instructor: Prof. Mark J. Embrechts (x 4009 or 371-4562) (embrem@rpi.edu) Office Hours: Tuesday 10-12 am (CII5217) Or by appointment. Class Time: Monday/Thursday: 8:30-9:50 (Amos Eaton Hall 216) TEXT (optional): J. S. Jang, C. T. Sun, E. Mizutani, “Neuro-Fuzzy and Soft Computing,” Prentice Hall, 1996. (1998) ISBN 0-13-261066-3 LECTURES #1-3: INTRO to Neural Networks The purpose of the first two lectures is to expose an overview of the philosophy of artificial neural networks. Today's lecture will provide a brief history of neural network development and inspire the idea of training a neural network. We will introduce a neural network as a framework to generate a map from an input space to an output space. Three basic premises will be discussed to explain artificial neural networks: (1) A problem can be formulated and represented as a map from a m-dimensional space Rm to a n-dimensional space Rn, or Rm -> Rn. (2) Such a map can be realized by setting up an equivalent artificial framework of basic building blocks of McCulloch-Pitts artificial neurons. This collection of artificial neurons forms an artificial neural network or ANN. (3) The neural net can be trained to conform to the map based on samples of the map and will reasonably generalize to new cases it has not encountered before. Handouts: 1. Mark J. Embrechts, "Problem Solving with Artificial Neural Networks." 2. Course outline and policies. Tasks: Start thinking about project topic, meet with me during office hours or by appointment. 3 PROJECT DEADLINES: January 22 January 29 Homework Project #0 (web page summary) Project proposal (2 typed pages, title, references, Motivation, deliverable, evaluation criteria) WHAT IS EXPECTED FROM THE CLASS PROJECT?  Prepare a monologue about a course related subject (15 to 20 written pages and supporting material in appendices).  Prepare a 20 minute lecture about your project and give presentation. Hand in a hard copy of your slides.  A project starts in the library. Prepare to spend at least a full day in the library over the course of the project. Meticulously write down all the relevant references, and attach a copy of the most important references to your report.  The idea for the lecture and the monologue is that you spend the maximum amount of effort to allow a third party to present that same material, based on your preparation, with a minimal amount of effort.  The project should be a finished and self-consistent document where you meticulously digest the prerequisite material, give a brief introduction to your work, and motivate the relevance of the material. Hands-on program development and personal expansions of and reflections on the literature are strongly encouraged. If your project involves programming, hand in a working version of the program (with source code) and document the program with a user’s manual and sample problems.  It is expected that you spend on average 6 hours/week on the class project. PROJECT PROPOSAL   A project proposal should be a fluent text of at least 2 full pages, where you are trying to sell the idea for a research project in a professional way. Therefore the proposal should contain a clear background and motivation. The proposal should define a clear set of goals, deliverables, and time table.  Identify how you would consider your project successful and address evaluation criteria  Make sure you select a title (acronyms and logos are suggested as well), and add a list of references to your proposal. 4 PROBLEM SOLVING WITH ARTIFICIAL NEURAL NETWORKS Mark J. Embrechts 1. INTRODUCTION TO NEURAL NETWORKS 1.1 Artificial neural networks in a nutshell This introduction to artificial neural networks explains as briefly as possible what is commonly understood by an artificial neural network and how they can be applied to solve data mining problems. Only the most popular type of neural networks will be discussed here: i.e., feedforward neural networks (usually trained with the popular backpropagation algorithm). Neural nets emerged from psychology as a learning paradigm, which mimics how the brain learns. There are many different types of neural networks, training algorithms, and different ways to interpret how and why a neural network operates. A neural network problem is viewed in this write-up as a parameter free implementation of a map and it is silently assumed that most data mining problems can be framed as a map. This is a very limited view, which does not fully cover the power of artificial neural networks. However, this view leads to a intuitive basic understanding of the neural network approach for problem solving with a minimum of otherwise necessary introductory material. Three basic premises will be discussed in order to explain artificial neural networks: (1) A problem can be formulated and represented as a map from a m-dimensional space Rm to a n-dimensional space Rn, or Rm -> Rn. (2) Such a map can be implemented by constructing an artificial framework of basic building blocks of McCulloch-Pitts artificial neurons. This collection of artificial neurons forms an artificial neural network (ANN). (3) The neural net can be trained to conform to the map based on samples of the map and will reasonable generalize to new cases it has not encountered before. The next sections expand on these premises and explain a map, McCulloch-Pitts neuron, artificial neural network or ANN, training and generalization. 1.2 Framing an equivalent map for a problem Let us start by considering a token problem and reformulate this problem as a map. The token problem involves deciding whether a seven-bit binary number is odd or even. To restate this problem as a map two spaces are considered: a seven-dimensional input space containing all the seven-bit binary numbers, and a one-dimensional output space with just two elements (or classes): odd or even, which will be symbolically represented by a one or a zero. Such a map can be interpreted as a transformation from Rm to Rn, or Rm -> Rn (with m = 7 and n = 2). A map for the seven-bit parity problem is illustrated in figure 1.1. 5 0000000 0000001 0000010 0000011 ... 1111111 map from R7to R1 1 0 R7 R1 Figure 1.1 The seven-bit parity problem posed as a mapping problem. The seven-bit parity problem was just framed as a formal mapping problem. The specific details of the map are yet to be determined: all we have so far is that we hope that a precise function can be formulated that transfers the seven bit binary input space to a 1-dimensional 1bit output space which solves the seven-bit parity problem. We hope that eventually we can specify a green-box that formally could be implemented as a subroutine in a C-code, where the subroutine would have a header of the type: void Parity_Mapping(VECTOR sample, int *decision) { code line 1; ... line of code; *decision = ... ; } // end of subroutine In other words: given a seven-bit binary vector as an input to this subroutine (e.g. {1, 0, 1, 1, 0, 0, 1}), we expect the subroutine to return an integer nicknamed "decision.” The value for decision will turn out to be unity or zero, depending on whether the seven-bit input vector is odd or even. We call this methodology a green-box approach to problem solving to imply that we only hope that such a function can eventually be realized, but that so far, we are clueless about how exactly we are going to fill the body of that green box. Of course, you probably guessed by now that somehow artificial neural networks will be applied to do this job for us. Before elaborating on neural networks we still have to discuss a subtle but important point related to our way of solving the seven-bit parity problem. Implicitly it is assumed for this problem that all seven-bit binary numbers are available and that the parity of each seven-bit binary number is known. Let us complicate the seven-bit parity problem by specifying that we know for the time being the correct parity for only 120 of the 128 possible seven-bit binary numbers. We want to specify a map for these 120 seven-bit binary numbers such that the map will correctly identify the eight remaining binary numbers. This is a much more difficult problem than mapping the seven-bit parity problem based on all the possible samples, and whether an answer exists and 6 can be found for this type of problem is often not clear at all from the onset. The methodology for learning what has to go in the green box for this problem will divide the available samples for this map in a training set -- a subset of the known samples -- and a test set. The test set will be used only for evaluating the goodness of the green box implementation to the map. Let us introduce a second example to illustrate how a regression problem can be reformulated as a mapping problem. Consider a collection of images of circles: all 64x64 black-and-white (B&W) pixel images. The problem here is to infer the radii of these circles based on the pixel values. Figure 1.2 illustrates how to formulate this problem as a formal map. A 64x64 image could be scanned row by row and be represented by a string of zeros and ones depending whether the pixel is white or black. This input space has 64x64 or 4096 binary elements and can therefore be considered as a space with 4096 dimensions. The output space is a onedimensional number, being the radius of the circle in the appropriate units. We generally would not expect for this problem to have access to all possible 64x64 B&W images of circles to determine the mapping function. We therefore would only consider a representative sample of circle images, somehow use a neural network to fill out the green box to specify the map, and hope that it will give the correct circle radius within a certain tolerance for future out-of-sample 64x64 B&W image of circles. It actually turns out that the formal mapping procedure as described so far would yield lousy estimates for the radius. Some ingenious form of preprocessing on the image data (e.g., considering selected frequencies of a 2-D Fourier transform) will be necessary to reduce the dimensionality of the input space. Most problems can be formulated in multiple ways as a map of the type: Rm -> Rn. However, not all problems can be elegantly transformed into a map, and some formal mapping representations might be betters than others for a particular problem. Often ingenuity, experimentation, and common sense are called for to frame an appropriate map that can adequately be represented by artificial neural networks. 64 R1 Map from R4096 to R1 64 R1 R2 R2 Figure 1.2 Determining the radius of a 64x64 B&W image of a circle, posed as a formal mapping problem. 7 1.3 The McCulloch-Pitts neuron and artificial neural networks The first neural network premise states that most problems can be formulated as an equivalent formal mapping problem. The second premise states that such a map can be represented by an artificial neural network (or ANN): i.e., a framework of basic building blocks, the so-called McCulloch-Pitts artificial neurons. The McCulloch-Pitts neuron was first proposed in 1943 by Warren McCulloch and Walter Pitts, a psychologist and a mathematician, in a paper illustrating how simple artificial representations of neurons could in principle represent any arithmetic function. How to actually implement such a function was first addressed by the psychologist Donald Hebb in 1949 in his book "The organization of behavior." The McCulloch-Pitts neuron can easily be understood as a simple mathematical operator. This operator has several inputs and one output and performs two elementary operations on the inputs: first it makes a weighted sum of all the inputs, and then it applies a functional transform to that sum which will be send to the output. Assume that there are N inputs {x1, x2, ... , xN}, or an input vector x and consider the output y. The output y can be expressed as a function of its inputs according to the following equations: (1) sum   xi i 1 N and y  f (sum) (2) So, far we have not yet specified the transfer function f(.). In its most simple form it is just a threshold function giving an output of unity when the sum exceeds a certain value, and zero when the sum is below this value. It is common practice in neural networks to use as transfer function the sigmoid function, which can be expressed as: 1 1  e sum f (sum)  (3) Figure 1.3 illustrates the basic operations of a McCulloch-Pitts neuron. It is common practice to apply an appropriate scaling to the inputs (usually such that either 0 < x i < 1, or -1 < xi < 1). x1 w1 w2  f() y w3 x3 wN xN Figure 1.3 The McCulloch-Pitts artificial neuron as a mathematical operator. 8 One more enhancement has to be clarified for the basics of the McCulloch-Pitts neuron: before summing the inputs, they actually have to be modified by multiplying them with a weight vector, {w1, w2, ... , wN}, so that instead of using equation (1) and summing the inputs we will make a weighted sum of the inputs according to equation (4). sum   wi xi (4) i1 N A collection of these basic operators can be stacked in a structure -- an artificial neural network -- that can have any number of inputs and any number of outputs. The neural network shown in figure 2 represents a map with two inputs to one output. There are two fan-out input elements and a total of six neurons. There are three layers of neurons, the first layer is called the first hidden layer, the second layer is the second hidden layer and the output layer consists of one neuron. There are 14 weights. The layers are fully connected. In this example there are no backward connections and this type of neural net is therefore called a feedforward network. The type of neural net of figure 1.4 is the most commonly encountered type of artificial neural network, the feedforward net: (1) (2) (3) There are no connections skipping layers. The layers are fully connected. There is usually at least one hidden layer. It is not hard to envision now that any map can be translated into an artificial neural network structure -- at least formally. How to determine the right weight set and how many neurons to locate in the hidden layers we have not yet addressed. This is a subject for the next section. x1 w 11 w 12 w 13 x2   f() w 11  f()  f() 11  f() w 22 w 23 Output w  neuron  f() w 21  f() w 32 First hidden layer Figure 1.4 Typical artificial feedforward neural network. 9 Second hidden layer y 1.4 Artificial neural networks An artificial neural network is a collection of connected McCulloch-Pitts neurons. Neural networks can formally represent almost any functional map provided that: (1) (2) A proper number of basic neurons are appropriately connected Appropriate weights are selected Specifying an artificial neural network to conform with a particular map means determining the neural network structure and its weights. How to connect the neurons and how to select the weights is the subject of the discipline of artificial neural networks. Even when a neural network can represent in principle any function or map, it is not necessarily clear that one can ever specify such a neural network with the existing algorithms. This section will briefly address how to set up a neural network, and give at least a conceptual idea about determining an appropriate weight set. The feedforward neural network of figure 1.4 is the most commonly encountered type of artificial neural net. For most functional maps at least one hidden layer of neurons, and sometimes two hidden layers of neurons are required. The structural layout of a feedforward neural network can now be determined. For a feedforward layered neural network two points have to be addressed to determine the layout: (1) How many hidden layers to use? (2) How many neurons to choose in each hidden layer? Different experts in the field have often different answers to these questions. A general guideline that works surprisingly well is to try one hidden layer first, and to choose as few neurons in the hidden layer(s) as one can get away with. The most intriguing question still remains and addresses the third premise of neural networks: it is actually possible to come up with algorithms that allow us to specify a good weight set. How do we determine the weights of the network from samples of the map? Can we expect a reasonable answer for new cases that were not encountered before from such a network? It is straightforward to devise algorithms that will determine a weight set for neural networks that contain just an input layer and an output layer -- and no hidden layer(s) of neurons. However, such networks do not generalize well at all. Neural networks with good generalization capabilities require at least one hidden layer of neurons. For many applications such neural nets generalize surprisingly well. The need for hidden layers in artificial neural networks was already realized in the late fifties. However, in his 1963 book "Perceptrons" the MIT professor Marvin Minsky argued that it might not be possible at all to come up with any algorithm to determine a suitable weight set if hidden layers are present in the network structure. Only in 1986 emerged such an algorithm: the backpropagation algorithm, popularized by Rummelhart and MacLelland in a very clearly written chapter in their book "Parallel Distributed Computing." The backpropagation algorithm was actually invented and reinvented several times and its original formulation is generally credited to Paul Werbos. He 10 described the backpropagation algorithm in his Harvard Ph.D. dissertation in 1972, but this algorithm was not widely noted at that time. The majority of today’s neural network applications relies in one form on the other on the backpropagation algorithm. 1.5 Training neural networks The result of a neural network is its weight set. Determining an appropriate weight set is called training or learning, based on the metaphor that learning takes place in the human brain which can be viewed as a collection of connected biological neurons. The learning rule proposed by Hebb was the first mechanism for determining the weights of a neural network. The Canadian Donald Hebb postulated this learning strategy in the late forties as one of the basic mechanisms how humans and animals can learn. Later on it turned out that he hit the hammer on the nail with his formulation. Hebb's rule is surprisingly simple, and while in principle Hebb's rule can be used to train multi-layered neural networks we will not elaborate further on this rule. Let us just point out here that there are now many different neural network paradigms and many algorithms for determining the weights of a neural network. Most of these algorithms work iteratively: i.e., one starts out with a randomly selected weight set, applies one or more samples of the mapping, and gradually upgrades the weights. This iterative search for a proper weight set is called the learning or training phase. Before explaining the workings of the backpropagation algorithm we will present a simple alternative, the random search. The most naive answer to determine a weight set -- which rather surprisingly in hindsight did not emerge before the backpropagation principle was formulated -- is just to try randomly generated weight sets, and to keep trying with new randomly generated weight sets until one hits it just right. The random search is at least in principle a way to determine a suitable weight set if it weren't for its excessive demands on computing time. While this method sounds too naive to give it even serious thought, smart random search paradigms (such as genetic algorithm and simulated annealing) are nowadays actually legitimate and widely used training mechanism for neural networks. However, random search methods have many whistles to blow and bells to ring, and are extremely demanding on computing time. Only the wide availability of ever faster computers allowed this method to be practical at all. The process for determining the weights of a neural net proceeds in two separate stages. In the first stage, the training phase, one applies an algorithm to determine a -- hopefully good -weight set with about 2/3 of the available mapping samples. The generalization performance of the just trained neural net is subsequently evaluated in the testing phase based on the remaining samples of the map. 1. 6 The backpropagation algorithm An error measure can be defined to quantify the performance of a neural net. This error function depends on the weight values and the mapping samples. Determining the weights of a neural network can therefore be interpreted as an optimization problem, where the performance error of the network structure is minimized for a representative sample of the mappings. All paradigms applicable to general optimization problems apply therefore to neural nets as well. 11 The backpropagation algorithm is elegant and simple, and is used in eighty percent of the neural network applications. It consistently gives at least reasonably acceptable answers for the weight set. The backpropagation algorithm can not be applied to just any optimization problem, but it is specifically tailored to multi-layer feedforward neural network. There are many ways to define the performance error of a neural network. The most commonly applied error measure is the mean square error. This error, E, is determined by showing every sample to the net and to tally the differences between the actual output, o, minus the desired target output, t, according to equation (5). E noutputs 2  oi  ti  (5) i1 Training a neural network starts out with a randomly selected weight set. A batch of samples is shown to the network, and an improved weight set is obtained by iteration following equations (6) and (7). The new weights for a particular neuron (labeled ij) at iteration (n+1), are an improvement for the weights from iteration (n), by moving a small amount on the gradient of the error surface towards the direction of the minimum. wij(n1)  wij(n)  wij wij    dE dwij (6) (7) Equations (6) and (7) represent an iterative steepest descent algorithm, which will always converge to a local minimum of the error function provided that the learning parameter, a, is small. The ingenuity of the backpropagation algorithm was to come up with a simple analytical expression for the gradient of the error in multi-layered nets by a clever application of the chain rule. While it was for a while commonly believed that the backpropagation algorithm was the only practical algorithm to implement equation (7), it is worth pointing out that the derivative of E with respect to the weights can easily be estimated numerically by tweaking the weights a little bit. This approach is perfectly valid, but is significantly slower than the elegant backpropagation formulation. The details for deriving the backpropagation algorithm can be found in the literature. 1.7 More neural network paradigms So far, we briefly described how feedforward neural nets can solve problems by recasting the problem as a formal map. The workings of the backpropagation algorithm to train a neural network were formally explained. While the views and algorithms presented here conform with the mainstream approach to neural network problem solving, there are literary hundreds of different neural network types and training algorithms. Recasting the problem as a formal map is just one part and one view of neural net. For a broader view on neural networks we refer to the literature. 12 At least two more paradigms revolutionized and popularized neural networks in the eighties: the Hopfield net and the Kohonen net. The physicist John Hopfield gained attention for neural networks in 1983 when he wrote a paper in the Proceedings of the National Academy of Science indicating how neural networks form an ideal framework to simulate and explain the statistical mechanics of phase transitions. The Hopfield net can also be viewed as a recurrent content addressable memory that can be applied to image recognition, and traveling salesman type of optimization problems. For several specialized applications, this type of network is far superior to any other neural network approach. The Kohonen network proposed by the Finnish professor Teuvo Kohonen on the other hand is a one-layer feedforward network that can be viewed as a self-learning implementation of the K-means clustering algorithm for vector quantization with powerful self-organizing properties and biological relevance. Other popular, powerful and clever neural network paradigms are the radial basis function network, the Boltzmann machine, the counterpropagation network and the ART (adaptive resonance theory) networks. Radial basis functions can be viewed as a powerful general regression technique for multi-dimensional function approximation which employ Gaussian transfer functions with different standard deviations. The Boltzmann machine is a recursive simulated annealing type of network with arbitrary network configuration. Hecht-Nielsen's counterpropagation network cleverly combines a feedforward neural network structure with a Kohonen layer. Grossberg's ART networks use a similar idea but can be elegantly implemented in hardware and retains a high level of biological plausibility. There is room as well for more specialized networks such as Oja's rules for principal component analysis, wavelet networks, cellular automata networks and Fukushima's neocognitron. Wavelet networks utilize the powerful wavelet transform and generally combine elements of the Kohonen layer with radial basis function techniques. Cellular automata networks are a neural network implementation of the cellular automata paradigm, popularized by Mathemtica's inventor, Stephen Wolfram. Fukushima's neocognitron is a multi-layered network with weight sharing and feature extraction properties that has shown the best performance for handwriting and OCR recognition applications. A variety of higher-order methods improve the speed of the backpropagation approach. Most widely applied are conjugate gradient networks and the Levenberg-Marquardt algorithm. Recursive networks with feedback connections are more and more applied, especially in neurocontrol problems. For control applications specialized and powerful neural network paradigms have been developed and it is worthwhile noting that a one-to-one equivalence can be derived between feedforward neural nets of the backpropagation type and Kalman filters. Fuzzy logic and neural networks are often combined for control problems. There is no shortage of neural network tools and most paradigms can be applied to a wide range of problems. Most neural network implementations rely on the backpropagation algorithm. However, which neural network paradigm to use is often a secondary question and whatever the user feels comfortable with is fair game. 13 1.8 Literature The domain of artificial neural networks is vast and literature is expanding at a fast rate. With the knowledge to be far from complete let me briefly discuss my favorite neural network references in this section. Note also that an excellent comprehensive introduction to neural networks can be found under the frequently asked questions on neural networks files at various WWW websites (i.e. search “FAQ neural networks” in Alta Vista). Jose Principe Probably the standard textbook now for teaching neural networks. Comes with a demo version of Neurosolutions. Neural and Adaptive Systems: Fundamentals Through Simulations, Jose Principe, Neil R. Euliano, and W. Curt Lefebre, John Wiley 2000. Hagan, Demuth, and Beale An excellent book for basic comprehensive undergraduate teaching, going back to basics with lots of Linear Algebra as well and good MATLAB illustration files is Neural Network Design, Hagan, demuth, and Beale, PWS Publishing Company, 1996. Joseph P. Bigus Bigus wrote an excellent introduction to neural networks for data mining for the nontechnical reader. The book makes a good case why neural networks are an important data mining tool and the power and limitations of neural networks for data mining. Some conceptual case studies are discussed. The book does not really discuss the theory of neural networks, or how exactly to apply neural networks to a data mining problem, but it gives nevertheless many practical hints and tips. Data Mining with Neural Networks: Solving Business Problems – from Application Development to Decision Support, McGraw-Hill (1997). Maureen Caudill Maureen Caudill has published several books that aim to the beginners market and provide valuable insight in the workings of neural nets. More than here books, I would recommend a series of articles that appeared in the popular monthly magazine AI EXPERT. Collections of Caudill's articles are bundled as separate special editions of AI EXPERT. Phillip D. Wasserman Wasserman published two very readable books explaining neural networks. He has a knack to explain difficult paradigms efficiently and understandably with a minimum of mathematical diversions. Neural Computing, Van Nostrand Reinhold (1990). Advanced Methods in Neural Computing, Van Nostrand Reinhold (1993). 14 Jacek M. Zurada Zurada published the first books on neural networks that can be considered a textbook. It is an introductory-level graduate engineering course with an electrical engineering bias and comes with a wealth of homework problems and software. Introduction to Artificial Neural Systems, West Publishing Company (1992). Laurene Fausett An excellent introductory textbook on the advanced undergraduate level with a wealth of homework problems. Fundamentals of Neural Networks: Architecture, Algorithms, and Application, Prentice Hall (1994). Simon Haykin Nicknamed the bible of neural networks by my students this 700-page work can be considered both as a desktop reference and advanced graduate level text on neural networks with challenging homework problems. Neural Networks: A Comprehensive Foundation, MacMillan College Publishing Company (1995). Mohammed H. Hassoun Excellent graduate level textbook with clear explanations and a collection of very appropriate homework problems. Fundamentals of Artificial Neural Networks, MIT Press (1995). John Hertz, Anders Krogh, and Richard G. Palmer This book is one of the earlier better books on neural networks and provides a thorough understanding of the various neural paradigms and how and why neural networks work. This book is excellent for its references and has an extremely high information density. Even though this book is heavy on the Hopfield network and the statistical mechanics interpretation, I probably consult this book more than any other. It does not lend itself well as a textbook, but for a while it was one of the few good books available. Highly recommended. Introduction to the Theory of Neural Computation, Addison Wesley Publishing Company (1991). Timothy Masters Masters wrote a series of three books in short succession and I would call his collection of book the user's guide to neural networks. If you program your own networks the wealth of information is invaluable. If you use neural networks, the wealth of information is invaluable. The books come with software and all source code is included. The software is very powerful, 15 but is geared toward the serious C++ user and lacks a decent user's interface for the non-C++ initiated. A must for the beginner and the advanced user. Practical Neural Network recipes in C++, Academic Press, Inc. (1993). Signal and Image Processing with neural Networks, John Wiley (1994). Advanced Algorithms for Neural Networks: A C++ Sourcebook, John Wiley (1995). Bart Kosko Advanced electrical engineering graduate level textbook. Excellent for fuzzy logic and neural network control applications. Not recommended for general introduction or advanced reference. Neural Networks and Fuzzy Systems, Prentice Hall (1992). Guido J. DeBoeck If you are serious about applying neural networks for stock market speculation this book is a good starting point. No theory, just applications. Trading on the Edge: Neural, genetic and fuzzy systems for chaotic Financial Markets, John Wiley & Sons (1994). 16 2. NEURAL NETWORK CASE STUDY – POPULATION FORECASTING 2.1 Introduction The purpose of this case study is to expose an overview of the philosophy of artificial neural networks. This case study will inspire the view of neural networks as a model free regression technique. The study presented here describes how to estimate the world's population for the year 2025 based on traditional regression techniques and based on an artificial neural network. In the previous section an artificial neural network was explained as a biologically inspired model that can implement a map. This model is based on an interconnection of elementary McCulloch-Pitts neurons. It was postulated that: (a) (b) (c) Most real-world problems can be formulated as a map. Such a map can be formally represented by an artificial neural network, where the socalled "weights" are the free parameters to be determined. Neural networks can "train" their weights to conform with a map using powerful computational algorithms. This model for the map does not only represent the "training samples" quite reasonably, but generally extrapolates well to "test samples" that were not used to train the neural network. The most popular algorithm for training a neural network is the backpropagation algorithm which has been rediscovered in various fields over and over again and is generally credited to Dr. Paul Werbos.[1] The backpropagation algorithm was widely popularized in 1986 by Rumelhart and McClelland[2] explaining why the surge in popularity of artificial neural networks is a relatively recent phenomenon. The derivation and implementation details of the backpropagation algorithm are referred to the literature. 2.2 Population forecasting The reverend Thomas Malthus identified in 1798 in his seminal work "An essay on the principle of population"[3] that the world's population grows exponentially while agricultural output grows linearly, predicting gloom and doom for future generations. Indeed, the rapidly expanding population on our planet reminds us daily that the resources on our planet have to be carefully mended to survive gracefully during the next few decades. The data for the world's population from 1650 through 1996 are summarized in Table I and figure 2.1.[4] TABLE I. Estimates for the world population (1650 – 1996) YEAR 1650 1750 1850 1900 1950 1960 POPULATION (in millions) 470 694 1091 1571 2513 3027 17 1970 1980 1990 1995 1996 3678 4478 5292 5734 5772 In order to build a model for population forecasting we will normalize the data points (Table II). The year 1650 is re-scaled as 0.0 and 2025 as 1.0 and we interpolate linearly in between for all the other years. The reason for doing such a normalization is that it is customary (and often required) for neural networks to scale the data between zero and unity. Since our largest considered year will be 2025 it will be re-scaled as unity. The reader can easily verify that a linear re-normalization of a variable x between a maximum value (max) and a minimal value (min) will lead to a re-normalized value (xnor) according to: xnor  x  min max min Because the population increases so rapidly with time we will work with the natural logarithm of the population (in million) and then re-normalize these data according to the above formula, where (anticipating the possibility for a large forecast for the world's population in 2025) we used 12 as the maximum possible value for the re-normalized logarithm of the population in 2025 and 6.153 as the minimum value. In other words: max in the above formula was arbitrarily assigned a value of 12 to assure that the neural net predictions can accommodate large values. Table II illustrates these transforms for the world population data. Figure 2.1 Estimates of the world population between 1650 and 1996. 18 TABLE II. Estimates of World Population and corresponding normalizations YEAR 1650 1750 1850 1900 1950 1960 1970 1980 1990 1995 1996 POP 470 694 1091 1571 2513 3027 3678 4478 5292 5734 5772 YEARnor 0.000 0.267 0.533 0.667 0.800 0.827 0.853 0.880 0.907 0.920 0.923 ln(POP) 6.153 6.542 6.995 7.359 7.829 8.015 8.210 8.407 8.574 8.654 8.661 POPnor 0.000 0.067 0.144 0.206 0.287 0.318 0.352 0.385 0.414 0.428 0.429 2.3 Traditional regression model for population forecasting First we will apply traditional regression techniques to population forecasting. The classical Malthusian model assumes that the population grows as an exponential curve. This equivalent to stating that the natural logarithm of the population will grow linearly with time. Because the re-normalization in the previous paragraph re-scaled the population numbers first into their natural logarithms, we should be able to get by with a linear regression model for the re-scaled values. With other words, we are trying to determine the unknown coefficients a and b in the following population model: POPNOR  a  YEARNOR  b or, using the traditional symbols Y and X for the dependent and the independent variables Y  aX  b It is customary in regression analysis to determine the coefficients a and b such that the sum of the squares of the errors (E) between the modeled values and the actual values is minimized. With other words, the following function needs to be minimized: N N E   yi  Y    yi  ax i  b i 1 2 2 i1 There are N data points, xi and yi are the actual data points, and the Y values are the estimates according to the model. The values for the coefficients a and b for which this error is minimal can be found by setting the partial derivatives of the error with respect to the unknown coefficients a and b equal to zero and solving this set of two equations for these unknown coefficients. This leads to the following: 19  E  0  a  E 0   b or  E   2 yi  Ax i  b xi  0 a  E  2 yi  Ax i  b   0   b It is left as an exercise to the reader to verify that this yields for a and b a N N N i 1 i 1 i1 2 N  xi yi   xi  yi   N  xi 2   xi   i  i b  y  ax where, N 1 y N  yi 1 N  xi x i1 N i1 Table III illustrates the numerical calculation of a and b, where the first ten data entries were used (with other words, we do not consider the 1996 data point). TABLE III. Estimates of World Population and corresponding normalizations Xnor Ynor xy x2 . 0.000 0.000 0.000 0.000 0.267 0.067 0.018 0.071 0.533 0.144 0.077 0.284 0.667 0.206 0.137 0.445 0.800 0.287 0.230 0.640 0.827 0.318 0.263 0.684 0.853 0.352 0.300 0.728 0.880 0.385 0.339 0.774 0.907 0.414 0.375 0.823 0.920 0.428 0.394 0.846 6.654 2.601 2.133 5.295 20 Expressions for a and b cab be evaluated based on the data in Table III. a 10  2.133  6.654  2.601 10  5.295  6.654  2  0.464 b  0.260  0.464  0.665  0.0486 Forecasting for the year 2025 according to the regression model yields the following for the normalized value for the population: y2025  a 1.0  b  0.464  0.0486  0.415 When re-scaling back into the natural logarithm of the actual population we obtain: ln POP2025  max min y2025  min  12  6.153  0. 415  6.153  8.580 The actual population estimate for the year 2025 is the exponent of this value leading to an estimate of 5321 million people. Obviously this value is not what we would expect or accept as a forecast. What happened actually is that over the considered time period (1650 - 1996) the population has actually been exploding faster than exponentially and the postulated exponential model is not a very good one. The flaws in this simple regression approach become obvious when we plot the data and their approximations in the re-normalized frame according to figure 2.2. Our model has an obvious flaw, but the approach we took here is a typical regression implementation. Only by plotting our data and predictions, and often after the fact, becomes the reason for the poor or invalid estimate obvious. More seasoned statisticians would suggest that we try an approximation of the type: cx 2 dxe y  a  be or use ARMA models and/or other state-of-the-art time series forecasting tools. All these methods are fair game for forecasting and can yield reliable estimates in the hands of the experienced analyst. Nevertheless, from this simple case study we can conclude so far that forecasting the world's population seems to be a challenging forecasting problem indeed. 2.4 Simple neural network model for population forecasting In this section we will develop the neural network approach for building a population forecasting model. We will define a very simple network with one input element, two neurons in the hidden layer and one output neuron. We will however include two bias nodes (dummy nodes with input unity) which is standard practice for most neural network applications. The network has common sigmoid transfer functions and the bias is just an elegant way to allow some shifts in the transfer functions as well. The sigmoid transfer function can be viewed as a crude approximation for the threshold function. Remember that an artificial neuron can be viewed as a mathematical operator with following functions: 21 Figure 2.2 Figure 2.3. Results from regression analysis on logarithmically normalized data entries. The sigmoid function f z  1 as a crude approximation to the threshold function. 1 e  z Note that the introduction of bias nodes (i.e., dummy nodes with input unity, as shown in figure 4) allows horizontal shifts of the sigmoid (and/or threshold function) allowing more powerful and more flexible approximation. 22 a) Make a weighted sum of the input signals resulting in a signal z. b) Apply a transfer function f(z) to the signal z, which in the case of a sigmoid corresponds to: 1 f z  1 e  z as illustrated in figure 2.3. Figure 2.4 is a representation of our simple neural network. Note that there are three neurons and two bias nodes. There are three layers: an input layer, one hidden layer and an output layer. Only the hidden layer and the output layer contain neurons: such a network is referred to as a 1x2x1 net. The two operations of a neuron (weighted sum and transfer function) are symbolically represented on the figure for each neuron (by the symbols  and f). In order for a neural network to be a robust function approximator at least one hidden layer of neurons and generally at most two hidden layers of neurons are required. The neural network represented in figure 2.4 is the most common neural network of the feedforward type and is fully connected. The unknown weights are indicated on the figure by the symbols w1, w2 ,...,w7. The weights can be considered as being the neural network equivalent for the unknown regression coefficients from our regression model. The algorithm for finding these coefficients that was applied here is the standard backpropagation algorithm, which minimizes the sum of the squares of the errors similar to the way how it was done for regression analysis. However, contrary to regression analysis, an iterative numerical minimization procedure rather than an analytical derivation was applied to estimate the weights in order to minimize the least-squares error measure. The backpropagation algorithm uses a clever trick to solve this problem when a hidden layer of neurons is present in the model. By all means think of a neural network as a more sophisticated regression model. It is different from a regression model in the sense that we do not specify linear or higher-order models for the regression analysis. We specify only a neural network frame (number of layers of neurons, and number of neurons in each layer) and let the neural network algorithm work out what the proper choice for the weights will be. This approach is often referred to as a model-free approximation method, because we really do not specify whether we are dealing with a linear, quadratic or exponential model. The neural network was trained with MetaNeural™, a general-purpose neural network program that uses the backpropagation algorithm and runs on most computer platforms. The neural network was trained on the same 10 patterns that were used for the regression analysis and the screen response is illustrated in figure 2.5. 23 HIDDEN LAYER  f neuron 1 w5 w1 INPUT w2  f w3 1 Bias node Figure 2. 4 OUTPUT w6  f neuron 2 w4 neuron 3 w7 1 Bias node Neural network approximation for the population forecasting problem. Figure 2.5 Screen response from MetaNeural™ for training and testing the population forecasting model. Hands-on details for the network training will be left for lecture 3, where we will gain handson exposure to artificial neural network programs. The files that were used for the MetaNeural™ program are reproduced in the appendix. The program gave 0.48118 as the prediction for the normalized population forecasts in 2025. After re-scaling this would correspond to 7836 million people. Probably a rather underestimated forecast, but definitely better than the regression model. The weights corresponding to this forecast model are 24 reproduced in Table IV. The problem of the neural network model is that a 1-2-1 net is a rather simplistic network and that the way we represented the patterns too much emphasis is placed on the earlier years (1650 - 1850) which are really not all that relevant. By over-sampling (i.e., presenting the data from 1950 onward let's say three times as often than the other data) and choosing a 1-3-4-1 network, the way a more seasoned practitioner might approach this problem, we actually obtained a forecast of 11.02 billion people for the world's population in 2025. This answer seems to be a lot more reasonable than the one obtained from the 1-2-1 network. Changing to the 1-3-4-1 model is just a matter of changing a few numbers in the input file for MetaNeural™ and can be done in a matter of seconds. The results for the predictions with the 1-3-4-1 network with over-sampling are shown in figure 2.6. Figure 2.6. World population prediction with a 1-3-41 artificial neural network with over-sampling. TABLE IV. Weight values corresponding to the neural network in figure 2.4 WEIGHT w1 w2 w3 w4 w5 w6 w7 VALUE -2.6378 2.4415 1.6161 -1.3550 -3.6308 3.0321 -1.3795 25 2.6 Conclusions A neural network can be viewed as a least-squares model-free regression-like approximator that can implement almost any map. The illustration of a forecasting model for the world's population with a simple neural network proceeds similar to regression analysis and relatively straightforward. The fact that neural networks are model-free approximators is often advantageous over traditional statistical forecasting methods and standard time series analysis techniques. Where neural networks differ from standard regression techniques is the way how the least-squares error minimization procedure was implemented: while regression techniques rely on closed one-step analytical formulas, the neural network approach employs a numerical iterative backpropagation algorithm. 2.7 Exercises for the brave 1. Derive the expressions for the parameters a, b, c, d, and e for the following regression model: cx 2 dxe y  a  be and forecast the world's population for the year 2025 based on this model. 2. Write a MATLAB program that implements the evaluation of the network shown in figure 4 and verify the population forecast for the year 2025 based on this 1-21 neural network model and the weights shown in TABLE IV. 3. Expand the MATLAB program of exercise 2 to a program that can train the weights of a neural network based on a random search model. I.e., Start with and initial random collection for the weights (let's say all chosen from a uniform random distribution between -1.0 and +1.0). Then iteratively adjust the weights by making small random perturbations (one weight at a time), evaluate the new error after showing all the training samples, and retain the perturbed weight if it is smaller. Repeat this process until the network has a reasonably small error. 2.8 References [1] [2] [3] [4] P. Werbos, "Beyond regression: New tools for prediction and analysis in the behavioral sciences," Ph.D. thesis, Harvard University (1974). D. E. Rumelhart, G. Hinton, and R. J. Williams, "Learning internal representations by error propagation," In Parallel distributed processing: explorations in the microstructure of cognition, Vol. 1, D. E. Rumelhart and James L. McClelland, Eds., Chapter 8, pp. 318-362, MIT Press, Cambridge, MA (1986). Malthus, "An Essay on the Principle of Population," 1798. Republished in the Pelican Classics series, Penguin Books, England (1976). Otto Johnson, Ed., "1997 Information Please Almanac," Houghton Mifflin Company Boston & New York (1996). 26 APPENDIX: INPUT FILES FOR 1-2-1 NETWORK FOR MetaNeural™ ANNOTATED MetaNeural™ INPUT FILE: POP 3 1 2 1 1 0.1 0.1 0.5 0.5 1000 500 1 1 pop.pat 0 100 0.01 1 0.6 Three layered network One input node 2 neurons in the hidden layer One output neuron Show all samples and then update weights Learning parameter first layer of weights Learning parameter second layer of weights Momentum first layer of weights Momentum second layer of weights Do thousand iterations (for all patterns) Show intermediate results every 500 iterations on the screen Standard [0, 1] sigmoid transfer function Temperature one for sigmoid (i.e., standard sigmoid) Name of training pattern file Ignored Ignored Stop training when error is less than 0.01 Initial weights are drawn from a uniform random distribution between [-0.6, 0.6] POP.PAT: The pattern file 10 0.000 0.267 0.533 0.667 0.800 0.827 0.853 0.880 0.907 0.920 0.000 0.067 0.144 0.206 0.287 0.318 0.352 0.385 0.414 0.428 0 1 2 3 4 5 6 7 8 9 10 training pattern first training pattern second training pattern 27

LECTURE #9: FUZZY LOGIC & NEURAL NETS

Related documents

Products

Support

LECTURE #9: FUZZY LOGIC & NEURAL NETS

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib