LECTURE #9: FUZZY LOGIC & NEURAL NETS

advertisement
COURSE ANNOUNCEMENT -- Spring 2004
DSES-4810-01 Intro to COMPUTATIONAL INTELLIGENCE
& SOFT COMPUTING
With ever increasing computer power readily available novel engineering methods based
on “soft computing” are emerging at a rapid rate. This course provides the students with a
working knowledge of computational intelligence (CI) covering the basics of fuzzy logic,
neural networks, genetic algorithms and evolutionary computing, simulated annealing,
wavelet analysis, artificial life and chaos. Applications in control, forecasting,
optimization, data mining, fractal image compression and time series analysis are
illustrated with engineering case studies.
This course provides a hands-on introduction to the fascinating discipline of
computational intelligence (i.e. the synergistic interplay of fuzzy logic, genetic
algorithms, neural networks, and other soft computing techniques). The students will
develop the skills to solve engineering problems with computational intelligence
paradigms. The course requires a CI-related project in the student’s area of interest.
Instructor:
Office Hours:
Class Time:
Text (optionakl):
Prof. Mark J. Embrechts (x 4009 embrem@rpi.edu)
Thursday 10-11 am (CII5217)
Or by appointment.
Monday/Thursday: 8:30 – 9:50 am (Amos Eaton Hall 216)
J. S. Jang, C. T. Sun, E. Mizutani, “Neuro-Fuzzy and Soft
Computing,” Prentice Hall, 1996 (1998). ISBN 0-13-261066-3
Course is open to graduate students and seniors of all disciplines.
GRADING:
Tests
10%
5 Homework Projects
35%
Course Project
40%
Presentation
15%
ATTENDANCE POLICY
Course attendance is mandatory, a make-up project is required for each missed class. A
missed class without make-up results in the loss of half a grade point.
ACADEMIC HONESTY
Homework Projects are individual exercises. You can discuss assignments with your
peers, but not copy. Course project may be in groups of 2.
1
COMPUTATIONAL INTELLIGENCE - COURSE OUTLINE
1.
INTRODUCTION TO ARTIFICIAL NEURAL NETWORKS (ANNs)
1.1 History
1.2 Philosophy of neural nets
1.3 Overview neural nets
2.
INTRODUCTION TO FUZZY LOGIC
2.1 History
2.2 Philosophy of Fuzzy Logic
2.3 Terminology and definitions
3.
INTRODUCTION TO EVOLUTIONARY COMPUTING
3.1 Introduction to Genetic Algorithms
3.2 Evolutionary Computing/ Evolutionary programming/ Genetic Programming
3.3 Terminology and definitions
4.
NEURAL NETWORK APPLICATIONS/DATAMINING WITH ANNs
4.1 Case study: time series forecasting (population forecasting)
4.2 Case study: automated discovery of novel pharmaceuticals (Part I)
4.3 Data mining with neural networks
5.
FUZZY LOGIC APPLICATIONS/FUZZY EXPERT SYSTEMS
5.1 Fuzzy logic case study: tipping
5.2 Fuzzy expert systems
6.
SIMULATED ANNEALING/GENETIC ALGORITHM APPLICATIONS
6.1 Simulated annealing
6.2 Supervised clustering with GAs
6.3 Case study: automated discovery of novel pharmaceuticals (Part II)
7.
DATA VISUALIZATION WITH SELF-ORGANIZING MAPS
7.1 The Kohonen feature map
7.2 Case study: visual explorations for novel pharmaceuticals (Part III)
7.
ARTIFICIAL LIFE
7.1 Cellular automata
7.2 Self-organized criticality
7.3 Case study: highway traffic jam simulation
8.
FRACTALS and CHAOS
8.1 Fractal Dimension
8.2 Introduction to Chaos
8.3 Iterated Function Systems
9.
WAVELETS
2
Monday January 12, 2004
DSES-4810-01 Intro to COMPUTATIONAL INTELLIGENCE
& SOFT COMPUTING
Instructor:
Prof. Mark J. Embrechts (x 4009 or 371-4562) (embrem@rpi.edu)
Office Hours:
Tuesday 10-12 am (CII5217)
Or by appointment.
Class Time:
Monday/Thursday: 8:30-9:50 (Amos Eaton Hall 216)
TEXT (optional): J. S. Jang, C. T. Sun, E. Mizutani, “Neuro-Fuzzy and Soft
Computing,” Prentice Hall, 1996. (1998) ISBN 0-13-261066-3
LECTURES #1-3: INTRO to Neural Networks
The purpose of the first two lectures is to expose an overview of the philosophy of
artificial neural networks. Today's lecture will provide a brief history of neural network
development and inspire the idea of training a neural network. We will introduce a neural
network as a framework to generate a map from an input space to an output space. Three
basic premises will be discussed to explain artificial neural networks:
(1) A problem can be formulated and represented as a map from a m-dimensional space Rm
to a n-dimensional space Rn, or Rm -> Rn.
(2) Such a map can be realized by setting up an equivalent artificial framework of basic
building blocks of McCulloch-Pitts artificial neurons. This collection of artificial neurons
forms an artificial neural network or ANN.
(3) The neural net can be trained to conform to the map based on samples of the map and will
reasonably generalize to new cases it has not encountered before.
Handouts:
1. Mark J. Embrechts, "Problem Solving with Artificial Neural Networks."
2. Course outline and policies.
Tasks:
Start thinking about project topic, meet with me during office hours or by appointment.
3
PROJECT DEADLINES:
January 22
January 29
Homework Project #0 (web page summary)
Project proposal (2 typed pages, title, references,
Motivation, deliverable, evaluation criteria)
WHAT IS EXPECTED FROM THE CLASS PROJECT?

Prepare a monologue about a course related subject (15 to 20 written pages and
supporting material in appendices).

Prepare a 20 minute lecture about your project and give presentation. Hand in a hard
copy of your slides.

A project starts in the library. Prepare to spend at least a full day in the library over
the course of the project. Meticulously write down all the relevant references, and
attach a copy of the most important references to your report.

The idea for the lecture and the monologue is that you spend the maximum amount of
effort to allow a third party to present that same material, based on your preparation,
with a minimal amount of effort.

The project should be a finished and self-consistent document where you
meticulously digest the prerequisite material, give a brief introduction to your work,
and motivate the relevance of the material. Hands-on program development and
personal expansions of and reflections on the literature are strongly encouraged. If
your project involves programming, hand in a working version of the program (with
source code) and document the program with a user’s manual and sample problems.

It is expected that you spend on average 6 hours/week on the class project.
PROJECT PROPOSAL


A project proposal should be a fluent text of at least 2 full pages, where you are trying
to sell the idea for a research project in a professional way. Therefore the proposal
should contain a clear background and motivation.
The proposal should define a clear set of goals, deliverables, and time table.

Identify how you would consider your project successful and address evaluation
criteria

Make sure you select a title (acronyms and logos are suggested as well), and add a list
of references to your proposal.
4
PROBLEM SOLVING WITH ARTIFICIAL NEURAL NETWORKS
Mark J. Embrechts
1. INTRODUCTION TO NEURAL NETWORKS
1.1 Artificial neural networks in a nutshell
This introduction to artificial neural networks explains as briefly as possible what is commonly
understood by an artificial neural network and how they can be applied to solve data mining
problems. Only the most popular type of neural networks will be discussed here: i.e.,
feedforward neural networks (usually trained with the popular backpropagation algorithm).
Neural nets emerged from psychology as a learning paradigm, which mimics how the brain
learns. There are many different types of neural networks, training algorithms, and different
ways to interpret how and why a neural network operates. A neural network problem is
viewed in this write-up as a parameter free implementation of a map and it is silently assumed
that most data mining problems can be framed as a map. This is a very limited view, which
does not fully cover the power of artificial neural networks. However, this view leads to a
intuitive basic understanding of the neural network approach for problem solving with a
minimum of otherwise necessary introductory material.
Three basic premises will be discussed in order to explain artificial neural networks:
(1) A problem can be formulated and represented as a map from a m-dimensional space Rm
to a n-dimensional space Rn, or Rm -> Rn.
(2) Such a map can be implemented by constructing an artificial framework of basic building
blocks of McCulloch-Pitts artificial neurons. This collection of artificial neurons forms an
artificial neural network (ANN).
(3) The neural net can be trained to conform to the map based on samples of the map and will
reasonable generalize to new cases it has not encountered before.
The next sections expand on these premises and explain a map, McCulloch-Pitts neuron,
artificial neural network or ANN, training and generalization.
1.2 Framing an equivalent map for a problem
Let us start by considering a token problem and reformulate this problem as a map. The token
problem involves deciding whether a seven-bit binary number is odd or even. To restate this
problem as a map two spaces are considered: a seven-dimensional input space containing all
the seven-bit binary numbers, and a one-dimensional output space with just two elements (or
classes): odd or even, which will be symbolically represented by a one or a zero. Such a map
can be interpreted as a transformation from Rm to Rn, or Rm -> Rn (with m = 7 and n = 2). A
map for the seven-bit parity problem is illustrated in figure 1.1.
5
0000000
0000001
0000010
0000011
...
1111111
map from
R7to R1
1
0
R7
R1
Figure 1.1 The seven-bit parity problem posed as a mapping problem.
The seven-bit parity problem was just framed as a formal mapping problem. The specific
details of the map are yet to be determined: all we have so far is that we hope that a precise
function can be formulated that transfers the seven bit binary input space to a 1-dimensional 1bit output space which solves the seven-bit parity problem. We hope that eventually we can
specify a green-box that formally could be implemented as a subroutine in a C-code, where the
subroutine would have a header of the type:
void Parity_Mapping(VECTOR sample, int *decision) {
code line 1;
...
line of code;
*decision = ... ;
} // end of subroutine
In other words: given a seven-bit binary vector as an input to this subroutine (e.g. {1, 0,
1, 1, 0, 0, 1}), we expect the subroutine to return an integer nicknamed "decision.” The
value for decision will turn out to be unity or zero, depending on whether the seven-bit
input vector is odd or even.
We call this methodology a green-box approach to problem solving to imply that we only hope
that such a function can eventually be realized, but that so far, we are clueless about how
exactly we are going to fill the body of that green box. Of course, you probably guessed by
now that somehow artificial neural networks will be applied to do this job for us. Before
elaborating on neural networks we still have to discuss a subtle but important point related to
our way of solving the seven-bit parity problem. Implicitly it is assumed for this problem that
all seven-bit binary numbers are available and that the parity of each seven-bit binary number
is known.
Let us complicate the seven-bit parity problem by specifying that we know for the time being
the correct parity for only 120 of the 128 possible seven-bit binary numbers. We want to
specify a map for these 120 seven-bit binary numbers such that the map will correctly identify
the eight remaining binary numbers. This is a much more difficult problem than mapping the
seven-bit parity problem based on all the possible samples, and whether an answer exists and
6
can be found for this type of problem is often not clear at all from the onset. The methodology
for learning what has to go in the green box for this problem will divide the available samples
for this map in a training set -- a subset of the known samples -- and a test set. The test set will
be used only for evaluating the goodness of the green box implementation to the map.
Let us introduce a second example to illustrate how a regression problem can be reformulated
as a mapping problem. Consider a collection of images of circles: all 64x64 black-and-white
(B&W) pixel images. The problem here is to infer the radii of these circles based on the pixel
values. Figure 1.2 illustrates how to formulate this problem as a formal map. A 64x64 image
could be scanned row by row and be represented by a string of zeros and ones depending
whether the pixel is white or black. This input space has 64x64 or 4096 binary elements and
can therefore be considered as a space with 4096 dimensions. The output space is a onedimensional number, being the radius of the circle in the appropriate units.
We generally would not expect for this problem to have access to all possible 64x64 B&W
images of circles to determine the mapping function. We therefore would only consider a
representative sample of circle images, somehow use a neural network to fill out the green box
to specify the map, and hope that it will give the correct circle radius within a certain tolerance
for future out-of-sample 64x64 B&W image of circles. It actually turns out that the formal
mapping procedure as described so far would yield lousy estimates for the radius. Some
ingenious form of preprocessing on the image data (e.g., considering selected frequencies of a
2-D Fourier transform) will be necessary to reduce the dimensionality of the input space.
Most problems can be formulated in multiple ways as a map of the type: Rm -> Rn. However,
not all problems can be elegantly transformed into a map, and some formal mapping
representations might be betters than others for a particular problem. Often ingenuity,
experimentation, and common sense are called for to frame an appropriate map that can
adequately be represented by artificial neural networks.
64
R1
Map from
R4096 to R1
64
R1
R2
R2
Figure 1.2
Determining the radius of a 64x64 B&W image of a circle, posed as a formal mapping
problem.
7
1.3 The McCulloch-Pitts neuron and artificial neural networks
The first neural network premise states that most problems can be formulated as an equivalent
formal mapping problem. The second premise states that such a map can be represented by an
artificial neural network (or ANN): i.e., a framework of basic building blocks, the so-called
McCulloch-Pitts artificial neurons.
The McCulloch-Pitts neuron was first proposed in 1943 by Warren McCulloch and Walter
Pitts, a psychologist and a mathematician, in a paper illustrating how simple artificial
representations of neurons could in principle represent any arithmetic function. How to actually
implement such a function was first addressed by the psychologist Donald Hebb in 1949 in his
book "The organization of behavior." The McCulloch-Pitts neuron can easily be understood as
a simple mathematical operator. This operator has several inputs and one output and performs
two elementary operations on the inputs: first it makes a weighted sum of all the inputs, and
then it applies a functional transform to that sum which will be send to the output. Assume that
there are N inputs {x1, x2, ... , xN}, or an input vector x and consider the output y. The output
y can be expressed as a function of its inputs according to the following equations:
(1)
sum   xi
i 1 N
and
y  f (sum)
(2)
So, far we have not yet specified the transfer function f(.). In its most simple form it is just a
threshold function giving an output of unity when the sum exceeds a certain value, and zero
when the sum is below this value. It is common practice in neural networks to use as transfer
function the sigmoid function, which can be expressed as:
1
1  e sum
f (sum) 
(3)
Figure 1.3 illustrates the basic operations of a McCulloch-Pitts neuron. It is common practice
to apply an appropriate scaling to the inputs (usually such that either 0 < x i < 1, or -1 < xi < 1).
x1
w1
w2
 f()
y
w3
x3
wN
xN
Figure 1.3
The McCulloch-Pitts artificial neuron as a mathematical operator.
8
One more enhancement has to be clarified for the basics of the McCulloch-Pitts neuron: before
summing the inputs, they actually have to be modified by multiplying them with a weight
vector, {w1, w2, ... , wN}, so that instead of using equation (1) and summing the inputs we will
make a weighted sum of the inputs according to equation (4).
sum 
 wi xi
(4)
i1 N
A collection of these basic operators can be stacked in a structure -- an artificial neural network
-- that can have any number of inputs and any number of outputs. The neural network shown in
figure 2 represents a map with two inputs to one output. There are two fan-out input elements
and a total of six neurons. There are three layers of neurons, the first layer is called the first
hidden layer, the second layer is the second hidden layer and the output layer consists of one
neuron. There are 14 weights. The layers are fully connected. In this example there are no
backward connections and this type of neural net is therefore called a feedforward network.
The type of neural net of figure 1.4 is the most commonly encountered type of artificial neural
network, the feedforward net:
(1)
(2)
(3)
There are no connections skipping layers.
The layers are fully connected.
There is usually at least one hidden layer.
It is not hard to envision now that any map can be translated into an artificial neural network
structure -- at least formally. How to determine the right weight set and how many neurons to
locate in the hidden layers we have not yet addressed. This is a subject for the next section.
x1
w 11
w 12
w 13
x2

 f() w 11
 f()
 f()
11
 f()
w 22
w 23
Output
w  neuron
 f() w 21
 f() w 32
First hidden
layer
Figure 1.4 Typical artificial feedforward neural network.
9
Second hidden
layer
y
1.4 Artificial neural networks
An artificial neural network is a collection of connected McCulloch-Pitts neurons. Neural
networks can formally represent almost any functional map provided that:
(1)
(2)
A proper number of basic neurons are appropriately connected
Appropriate weights are selected
Specifying an artificial neural network to conform with a particular map means determining the
neural network structure and its weights. How to connect the neurons and how to select the
weights is the subject of the discipline of artificial neural networks. Even when a neural
network can represent in principle any function or map, it is not necessarily clear that one can
ever specify such a neural network with the existing algorithms. This section will briefly
address how to set up a neural network, and give at least a conceptual idea about determining
an appropriate weight set.
The feedforward neural network of figure 1.4 is the most commonly encountered type of
artificial neural net. For most functional maps at least one hidden layer of neurons, and
sometimes two hidden layers of neurons are required. The structural layout of a feedforward
neural network can now be determined. For a feedforward layered neural network two points
have to be addressed to determine the layout:
(1) How many hidden layers to use?
(2) How many neurons to choose in each hidden layer?
Different experts in the field have often different answers to these questions. A general
guideline that works surprisingly well is to try one hidden layer first, and to choose as few
neurons in the hidden layer(s) as one can get away with.
The most intriguing question still remains and addresses the third premise of neural networks:
it is actually possible to come up with algorithms that allow us to specify a good weight set.
How do we determine the weights of the network from samples of the map? Can we expect a
reasonable answer for new cases that were not encountered before from such a network?
It is straightforward to devise algorithms that will determine a weight set for neural networks
that contain just an input layer and an output layer -- and no hidden layer(s) of neurons.
However, such networks do not generalize well at all. Neural networks with good
generalization capabilities require at least one hidden layer of neurons. For many applications
such neural nets generalize surprisingly well. The need for hidden layers in artificial neural
networks was already realized in the late fifties. However, in his 1963 book "Perceptrons" the
MIT professor Marvin Minsky argued that it might not be possible at all to come up with any
algorithm to determine a suitable weight set if hidden layers are present in the network
structure. Only in 1986 emerged such an algorithm: the backpropagation algorithm,
popularized by Rummelhart and MacLelland in a very clearly written chapter in their book
"Parallel Distributed Computing." The backpropagation algorithm was actually invented and
reinvented several times and its original formulation is generally credited to Paul Werbos. He
10
described the backpropagation algorithm in his Harvard Ph.D. dissertation in 1972, but this
algorithm was not widely noted at that time. The majority of today’s neural network
applications relies in one form on the other on the backpropagation algorithm.
1.5 Training neural networks
The result of a neural network is its weight set. Determining an appropriate weight set is called
training or learning, based on the metaphor that learning takes place in the human brain which
can be viewed as a collection of connected biological neurons. The learning rule proposed by
Hebb was the first mechanism for determining the weights of a neural network. The Canadian
Donald Hebb postulated this learning strategy in the late forties as one of the basic mechanisms
how humans and animals can learn. Later on it turned out that he hit the hammer on the nail
with his formulation. Hebb's rule is surprisingly simple, and while in principle Hebb's rule can
be used to train multi-layered neural networks we will not elaborate further on this rule. Let us
just point out here that there are now many different neural network paradigms and many
algorithms for determining the weights of a neural network. Most of these algorithms work
iteratively: i.e., one starts out with a randomly selected weight set, applies one or more samples
of the mapping, and gradually upgrades the weights. This iterative search for a proper weight
set is called the learning or training phase.
Before explaining the workings of the backpropagation algorithm we will present a simple
alternative, the random search. The most naive answer to determine a weight set -- which
rather surprisingly in hindsight did not emerge before the backpropagation principle was
formulated -- is just to try randomly generated weight sets, and to keep trying with new
randomly generated weight sets until one hits it just right. The random search is at least in
principle a way to determine a suitable weight set if it weren't for its excessive demands on
computing time. While this method sounds too naive to give it even serious thought, smart
random search paradigms (such as genetic algorithm and simulated annealing) are nowadays
actually legitimate and widely used training mechanism for neural networks. However, random
search methods have many whistles to blow and bells to ring, and are extremely demanding on
computing time. Only the wide availability of ever faster computers allowed this method to be
practical at all.
The process for determining the weights of a neural net proceeds in two separate stages. In the
first stage, the training phase, one applies an algorithm to determine a -- hopefully good -weight set with about 2/3 of the available mapping samples. The generalization performance of
the just trained neural net is subsequently evaluated in the testing phase based on the remaining
samples of the map.
1. 6 The backpropagation algorithm
An error measure can be defined to quantify the performance of a neural net. This error
function depends on the weight values and the mapping samples. Determining the weights of a
neural network can therefore be interpreted as an optimization problem, where the performance
error of the network structure is minimized for a representative sample of the mappings. All
paradigms applicable to general optimization problems apply therefore to neural nets as well.
11
The backpropagation algorithm is elegant and simple, and is used in eighty percent of the
neural network applications. It consistently gives at least reasonably acceptable answers for the
weight set. The backpropagation algorithm can not be applied to just any optimization
problem, but it is specifically tailored to multi-layer feedforward neural network.
There are many ways to define the performance error of a neural network. The most commonly
applied error measure is the mean square error. This error, E, is determined by showing every
sample to the net and to tally the differences between the actual output, o, minus the desired
target output, t, according to equation (5).
E
noutputs
2
 oi  ti 
(5)
i1
Training a neural network starts out with a randomly selected weight set. A batch of samples is
shown to the network, and an improved weight set is obtained by iteration following equations
(6) and (7). The new weights for a particular neuron (labeled ij) at iteration (n+1), are an
improvement for the weights from iteration (n), by moving a small amount on the gradient of
the error surface towards the direction of the minimum.
wij(n1)  wij(n)  wij
wij   
dE
dwij
(6)
(7)
Equations (6) and (7) represent an iterative steepest descent algorithm, which will always
converge to a local minimum of the error function provided that the learning parameter, a, is
small. The ingenuity of the backpropagation algorithm was to come up with a simple analytical
expression for the gradient of the error in multi-layered nets by a clever application of the
chain rule. While it was for a while commonly believed that the backpropagation algorithm
was the only practical algorithm to implement equation (7), it is worth pointing out that the
derivative of E with respect to the weights can easily be estimated numerically by tweaking the
weights a little bit. This approach is perfectly valid, but is significantly slower than the elegant
backpropagation formulation. The details for deriving the backpropagation algorithm can be
found in the literature.
1.7 More neural network paradigms
So far, we briefly described how feedforward neural nets can solve problems by recasting the
problem as a formal map. The workings of the backpropagation algorithm to train a neural
network were formally explained. While the views and algorithms presented here conform with
the mainstream approach to neural network problem solving, there are literary hundreds of
different neural network types and training algorithms. Recasting the problem as a formal map
is just one part and one view of neural net. For a broader view on neural networks we refer to
the literature.
12
At least two more paradigms revolutionized and popularized neural networks in the eighties:
the Hopfield net and the Kohonen net. The physicist John Hopfield gained attention for neural
networks in 1983 when he wrote a paper in the Proceedings of the National Academy of
Science indicating how neural networks form an ideal framework to simulate and explain the
statistical mechanics of phase transitions. The Hopfield net can also be viewed as a recurrent
content addressable memory that can be applied to image recognition, and traveling salesman
type of optimization problems. For several specialized applications, this type of network is far
superior to any other neural network approach. The Kohonen network proposed by the Finnish
professor Teuvo Kohonen on the other hand is a one-layer feedforward network that can be
viewed as a self-learning implementation of the K-means clustering algorithm for vector
quantization with powerful self-organizing properties and biological relevance.
Other popular, powerful and clever neural network paradigms are the radial basis function
network, the Boltzmann machine, the counterpropagation network and the ART (adaptive
resonance theory) networks. Radial basis functions can be viewed as a powerful general
regression technique for multi-dimensional function approximation which employ Gaussian
transfer functions with different standard deviations. The Boltzmann machine is a recursive
simulated annealing type of network with arbitrary network configuration. Hecht-Nielsen's
counterpropagation network cleverly combines a feedforward neural network structure with a
Kohonen layer. Grossberg's ART networks use a similar idea but can be elegantly implemented
in hardware and retains a high level of biological plausibility.
There is room as well for more specialized networks such as Oja's rules for principal
component analysis, wavelet networks, cellular automata networks and Fukushima's
neocognitron. Wavelet networks utilize the powerful wavelet transform and generally combine
elements of the Kohonen layer with radial basis function techniques. Cellular automata
networks are a neural network implementation of the cellular automata paradigm, popularized
by Mathemtica's inventor, Stephen Wolfram. Fukushima's neocognitron is a multi-layered
network with weight sharing and feature extraction properties that has shown the best
performance for handwriting and OCR recognition applications.
A variety of higher-order methods improve the speed of the backpropagation approach. Most
widely applied are conjugate gradient networks and the Levenberg-Marquardt algorithm.
Recursive networks with feedback connections are more and more applied, especially in neurocontrol problems. For control applications specialized and powerful neural network paradigms
have been developed and it is worthwhile noting that a one-to-one equivalence can be derived
between feedforward neural nets of the backpropagation type and Kalman filters. Fuzzy logic
and neural networks are often combined for control problems.
There is no shortage of neural network tools and most paradigms can be applied to a wide
range of problems. Most neural network implementations rely on the backpropagation
algorithm. However, which neural network paradigm to use is often a secondary question and
whatever the user feels comfortable with is fair game.
13
1.8 Literature
The domain of artificial neural networks is vast and literature is expanding at a fast rate. With
the knowledge to be far from complete let me briefly discuss my favorite neural network
references in this section. Note also that an excellent comprehensive introduction to neural
networks can be found under the frequently asked questions on neural networks files at various
WWW websites (i.e. search “FAQ neural networks” in Alta Vista).
Jose Principe
Probably the standard textbook now for teaching neural networks. Comes with a demo version
of Neurosolutions.
Neural and Adaptive Systems: Fundamentals Through Simulations, Jose Principe, Neil R.
Euliano, and W. Curt Lefebre, John Wiley 2000.
Hagan, Demuth, and Beale
An excellent book for basic comprehensive undergraduate teaching, going back to basics with
lots of Linear Algebra as well and good MATLAB illustration files is
Neural Network Design, Hagan, demuth, and Beale, PWS Publishing Company, 1996.
Joseph P. Bigus
Bigus wrote an excellent introduction to neural networks for data mining for the nontechnical reader. The book makes a good case why neural networks are an important data
mining tool and the power and limitations of neural networks for data mining. Some
conceptual case studies are discussed. The book does not really discuss the theory of
neural networks, or how exactly to apply neural networks to a data mining problem, but it
gives nevertheless many practical hints and tips.
Data Mining with Neural Networks: Solving Business Problems – from Application
Development to Decision Support, McGraw-Hill (1997).
Maureen Caudill
Maureen Caudill has published several books that aim to the beginners market and provide
valuable insight in the workings of neural nets. More than here books, I would recommend a
series of articles that appeared in the popular monthly magazine AI EXPERT. Collections of
Caudill's articles are bundled as separate special editions of AI EXPERT.
Phillip D. Wasserman
Wasserman published two very readable books explaining neural networks. He has a knack to
explain difficult paradigms efficiently and understandably with a minimum of mathematical
diversions.
Neural Computing, Van Nostrand Reinhold (1990).
Advanced Methods in Neural Computing, Van Nostrand Reinhold (1993).
14
Jacek M. Zurada
Zurada published the first books on neural networks that can be considered a textbook. It is an
introductory-level graduate engineering course with an electrical engineering bias and comes
with a wealth of homework problems and software.
Introduction to Artificial Neural Systems, West Publishing Company (1992).
Laurene Fausett
An excellent introductory textbook on the advanced undergraduate level with a wealth of
homework problems.
Fundamentals of Neural Networks: Architecture, Algorithms, and Application, Prentice Hall
(1994).
Simon Haykin
Nicknamed the bible of neural networks by my students this 700-page work can be considered
both as a desktop reference and advanced graduate level text on neural networks with
challenging homework problems.
Neural Networks: A Comprehensive Foundation, MacMillan College Publishing Company
(1995).
Mohammed H. Hassoun
Excellent graduate level textbook with clear explanations and a collection of very appropriate
homework problems.
Fundamentals of Artificial Neural Networks, MIT Press (1995).
John Hertz, Anders Krogh, and Richard G. Palmer
This book is one of the earlier better books on neural networks and provides a thorough
understanding of the various neural paradigms and how and why neural networks work. This
book is excellent for its references and has an extremely high information density. Even though
this book is heavy on the Hopfield network and the statistical mechanics interpretation, I
probably consult this book more than any other. It does not lend itself well as a textbook, but
for a while it was one of the few good books available. Highly recommended.
Introduction to the Theory of Neural Computation, Addison Wesley Publishing Company
(1991).
Timothy Masters
Masters wrote a series of three books in short succession and I would call his collection of
book the user's guide to neural networks. If you program your own networks the wealth of
information is invaluable. If you use neural networks, the wealth of information is invaluable.
The books come with software and all source code is included. The software is very powerful,
15
but is geared toward the serious C++ user and lacks a decent user's interface for the non-C++
initiated. A must for the beginner and the advanced user.
Practical Neural Network recipes in C++, Academic Press, Inc. (1993).
Signal and Image Processing with neural Networks, John Wiley (1994).
Advanced Algorithms for Neural Networks: A C++ Sourcebook, John Wiley (1995).
Bart Kosko
Advanced electrical engineering graduate level textbook. Excellent for fuzzy logic and neural
network control applications. Not recommended for general introduction or advanced
reference.
Neural Networks and Fuzzy Systems, Prentice Hall (1992).
Guido J. DeBoeck
If you are serious about applying neural networks for stock market speculation this book is a
good starting point. No theory, just applications.
Trading on the Edge: Neural, genetic and fuzzy systems for chaotic Financial Markets, John
Wiley & Sons (1994).
16
2.
NEURAL NETWORK CASE STUDY – POPULATION FORECASTING
2.1 Introduction
The purpose of this case study is to expose an overview of the philosophy of artificial neural
networks. This case study will inspire the view of neural networks as a model free regression
technique. The study presented here describes how to estimate the world's population for the
year 2025 based on traditional regression techniques and based on an artificial neural network.
In the previous section an artificial neural network was explained as a biologically inspired
model that can implement a map. This model is based on an interconnection of elementary
McCulloch-Pitts neurons. It was postulated that:
(a)
(b)
(c)
Most real-world problems can be formulated as a map.
Such a map can be formally represented by an artificial neural network, where the socalled "weights" are the free parameters to be determined.
Neural networks can "train" their weights to conform with a map using powerful
computational algorithms. This model for the map does not only represent the "training
samples" quite reasonably, but generally extrapolates well to "test samples" that were not
used to train the neural network.
The most popular algorithm for training a neural network is the backpropagation algorithm
which has been rediscovered in various fields over and over again and is generally credited to
Dr. Paul Werbos.[1] The backpropagation algorithm was widely popularized in 1986 by
Rumelhart and McClelland[2] explaining why the surge in popularity of artificial neural
networks is a relatively recent phenomenon. The derivation and implementation details of the
backpropagation algorithm are referred to the literature.
2.2 Population forecasting
The reverend Thomas Malthus identified in 1798 in his seminal work "An essay on the
principle of population"[3] that the world's population grows exponentially while agricultural
output grows linearly, predicting gloom and doom for future generations. Indeed, the rapidly
expanding population on our planet reminds us daily that the resources on our planet have to be
carefully mended to survive gracefully during the next few decades. The data for the world's
population from 1650 through 1996 are summarized in Table I and figure 2.1.[4]
TABLE I. Estimates for the world population (1650 – 1996)
YEAR
1650
1750
1850
1900
1950
1960
POPULATION (in millions)
470
694
1091
1571
2513
3027
17
1970
1980
1990
1995
1996
3678
4478
5292
5734
5772
In order to build a model for population forecasting we will normalize the data points (Table
II). The year 1650 is re-scaled as 0.0 and 2025 as 1.0 and we interpolate linearly in between for
all the other years. The reason for doing such a normalization is that it is customary (and often
required) for neural networks to scale the data between zero and unity. Since our largest
considered year will be 2025 it will be re-scaled as unity. The reader can easily verify that a
linear re-normalization of a variable x between a maximum value (max) and a minimal value
(min) will lead to a re-normalized value (xnor) according to:
xnor 
x  min
max min
Because the population increases so rapidly with time we will work with the natural logarithm
of the population (in million) and then re-normalize these data according to the above formula,
where (anticipating the possibility for a large forecast for the world's population in 2025) we
used 12 as the maximum possible value for the re-normalized logarithm of the population in
2025 and 6.153 as the minimum value. In other words: max in the above formula was
arbitrarily assigned a value of 12 to assure that the neural net predictions can accommodate
large values. Table II illustrates these transforms for the world population data.
Figure 2.1
Estimates of the world population between 1650 and 1996.
18
TABLE II. Estimates of World Population and corresponding normalizations
YEAR
1650
1750
1850
1900
1950
1960
1970
1980
1990
1995
1996
POP
470
694
1091
1571
2513
3027
3678
4478
5292
5734
5772
YEARnor
0.000
0.267
0.533
0.667
0.800
0.827
0.853
0.880
0.907
0.920
0.923
ln(POP)
6.153
6.542
6.995
7.359
7.829
8.015
8.210
8.407
8.574
8.654
8.661
POPnor
0.000
0.067
0.144
0.206
0.287
0.318
0.352
0.385
0.414
0.428
0.429
2.3 Traditional regression model for population forecasting
First we will apply traditional regression techniques to population forecasting. The classical
Malthusian model assumes that the population grows as an exponential curve. This equivalent
to stating that the natural logarithm of the population will grow linearly with time. Because the
re-normalization in the previous paragraph re-scaled the population numbers first into their
natural logarithms, we should be able to get by with a linear regression model for the re-scaled
values. With other words, we are trying to determine the unknown coefficients a and b in the
following population model:
POPNOR  a  YEARNOR  b
or, using the traditional symbols Y and X for the dependent and the independent variables
Y  aX  b
It is customary in regression analysis to determine the coefficients a and b such that the sum of
the squares of the errors (E) between the modeled values and the actual values is minimized.
With other words, the following function needs to be minimized:
N
N
E   yi  Y    yi  ax i  b
i 1
2
2
i1
There are N data points, xi and yi are the actual data points, and the Y values are the estimates
according to the model. The values for the coefficients a and b for which this error is minimal
can be found by setting the partial derivatives of the error with respect to the unknown
coefficients a and b equal to zero and solving this set of two equations for these unknown
coefficients. This leads to the following:
19
 E  0

a
 E
0

 b
or
 E
  2 yi  Ax i  b xi  0
a
 E
 2 yi  Ax i  b   0

 b
It is left as an exercise to the reader to verify that this yields for a and b
a
N
N
N
i 1
i 1
i1
2
N  xi yi   xi  yi


N  xi 2   xi 
 i 
i
b  y  ax
where,
N
1
y
N
 yi
1
N
 xi
x
i1
N
i1
Table III illustrates the numerical calculation of a and b, where the first ten data entries were
used (with other words, we do not consider the 1996 data point).
TABLE III. Estimates of World Population and corresponding normalizations
Xnor
Ynor
xy
x2
.
0.000
0.000
0.000
0.000
0.267
0.067
0.018
0.071
0.533
0.144
0.077
0.284
0.667
0.206
0.137
0.445
0.800
0.287
0.230
0.640
0.827
0.318
0.263
0.684
0.853
0.352
0.300
0.728
0.880
0.385
0.339
0.774
0.907
0.414
0.375
0.823
0.920
0.428
0.394
0.846
6.654
2.601
2.133
5.295
20
Expressions for a and b cab be evaluated based on the data in Table III.
a
10  2.133  6.654  2.601
10  5.295  6.654 
2
 0.464
b  0.260  0.464  0.665  0.0486
Forecasting for the year 2025 according to the regression model yields the following for the
normalized value for the population:
y2025  a 1.0  b  0.464  0.0486  0.415
When re-scaling back into the natural logarithm of the actual population we obtain:
ln POP2025  max min y2025  min  12  6.153  0. 415  6.153  8.580
The actual population estimate for the year 2025 is the exponent of this value leading to an
estimate of 5321 million people. Obviously this value is not what we would expect or accept as
a forecast. What happened actually is that over the considered time period (1650 - 1996) the
population has actually been exploding faster than exponentially and the postulated exponential
model is not a very good one. The flaws in this simple regression approach become obvious
when we plot the data and their approximations in the re-normalized frame according to figure
2.2. Our model has an obvious flaw, but the approach we took here is a typical regression
implementation. Only by plotting our data and predictions, and often after the fact, becomes
the reason for the poor or invalid estimate obvious. More seasoned statisticians would suggest
that we try an approximation of the type:
cx 2 dxe
y  a  be
or use ARMA models and/or other state-of-the-art time series forecasting tools. All these
methods are fair game for forecasting and can yield reliable estimates in the hands of the
experienced analyst. Nevertheless, from this simple case study we can conclude so far that
forecasting the world's population seems to be a challenging forecasting problem indeed.
2.4 Simple neural network model for population forecasting
In this section we will develop the neural network approach for building a population
forecasting model. We will define a very simple network with one input element, two neurons
in the hidden layer and one output neuron. We will however include two bias nodes (dummy
nodes with input unity) which is standard practice for most neural network applications. The
network has common sigmoid transfer functions and the bias is just an elegant way to allow
some shifts in the transfer functions as well. The sigmoid transfer function can be viewed as a
crude approximation for the threshold function. Remember that an artificial neuron can be
viewed as a mathematical operator with following functions:
21
Figure 2.2
Figure 2.3.
Results from regression analysis on logarithmically normalized data entries.
The sigmoid function
f z 
1
as a crude approximation to the threshold function.
1 e  z
Note that the introduction of bias nodes (i.e., dummy nodes with input unity, as shown in
figure 4) allows horizontal shifts of the sigmoid (and/or threshold function) allowing more
powerful and more flexible approximation.
22
a) Make a weighted sum of the input signals resulting in a signal z.
b) Apply a transfer function f(z) to the signal z, which in the case of a sigmoid corresponds
to:
1
f z 
1 e  z
as illustrated in figure 2.3.
Figure 2.4 is a representation of our simple neural network. Note that there are three neurons
and two bias nodes. There are three layers: an input layer, one hidden layer and an output layer.
Only the hidden layer and the output layer contain neurons: such a network is referred to as a
1x2x1 net. The two operations of a neuron (weighted sum and transfer function) are
symbolically represented on the figure for each neuron (by the symbols  and f). In order for a
neural network to be a robust function approximator at least one hidden layer of neurons and
generally at most two hidden layers of neurons are required. The neural network represented in
figure 2.4 is the most common neural network of the feedforward type and is fully connected.
The unknown weights are indicated on the figure by the symbols w1, w2 ,...,w7.
The weights can be considered as being the neural network equivalent for the unknown
regression coefficients from our regression model. The algorithm for finding these coefficients
that was applied here is the standard backpropagation algorithm, which minimizes the sum of
the squares of the errors similar to the way how it was done for regression analysis. However,
contrary to regression analysis, an iterative numerical minimization procedure rather than an
analytical derivation was applied to estimate the weights in order to minimize the least-squares
error measure. The backpropagation algorithm uses a clever trick to solve this problem when a
hidden layer of neurons is present in the model. By all means think of a neural network as a
more sophisticated regression model. It is different from a regression model in the sense that
we do not specify linear or higher-order models for the regression analysis. We specify only a
neural network frame (number of layers of neurons, and number of neurons in each layer) and
let the neural network algorithm work out what the proper choice for the weights will be. This
approach is often referred to as a model-free approximation method, because we really do not
specify whether we are dealing with a linear, quadratic or exponential model. The neural
network was trained with MetaNeural™, a general-purpose neural network program that uses
the backpropagation algorithm and runs on most computer platforms. The neural network was
trained on the same 10 patterns that were used for the regression analysis and the screen
response is illustrated in figure 2.5.
23
HIDDEN LAYER
 f
neuron 1
w5
w1
INPUT
w2
 f
w3
1
Bias node
Figure 2. 4
OUTPUT
w6
 f
neuron 2
w4
neuron 3
w7
1
Bias node
Neural network approximation for the population forecasting problem.
Figure 2.5 Screen response from MetaNeural™ for training and testing the population forecasting
model.
Hands-on details for the network training will be left for lecture 3, where we will gain handson exposure to artificial neural network programs. The files that were used for the
MetaNeural™ program are reproduced in the appendix. The program gave 0.48118 as the
prediction for the normalized population forecasts in 2025. After re-scaling this would
correspond to 7836 million people. Probably a rather underestimated forecast, but definitely
better than the regression model. The weights corresponding to this forecast model are
24
reproduced in Table IV. The problem of the neural network model is that a 1-2-1 net is a rather
simplistic network and that the way we represented the patterns too much emphasis is placed
on the earlier years (1650 - 1850) which are really not all that relevant. By over-sampling (i.e.,
presenting the data from 1950 onward let's say three times as often than the other data) and
choosing a 1-3-4-1 network, the way a more seasoned practitioner might approach this
problem, we actually obtained a forecast of 11.02 billion people for the world's population in
2025. This answer seems to be a lot more reasonable than the one obtained from the 1-2-1
network. Changing to the 1-3-4-1 model is just a matter of changing a few numbers in the input
file for MetaNeural™ and can be done in a matter of seconds. The results for the predictions
with the 1-3-4-1 network with over-sampling are shown in figure 2.6.
Figure 2.6.
World population prediction with a 1-3-41 artificial neural network with over-sampling.
TABLE IV. Weight values corresponding to the neural network in figure 2.4
WEIGHT
w1
w2
w3
w4
w5
w6
w7
VALUE
-2.6378
2.4415
1.6161
-1.3550
-3.6308
3.0321
-1.3795
25
2.6 Conclusions
A neural network can be viewed as a least-squares model-free regression-like approximator
that can implement almost any map. The illustration of a forecasting model for the world's
population with a simple neural network proceeds similar to regression analysis and relatively
straightforward. The fact that neural networks are model-free approximators is often
advantageous over traditional statistical forecasting methods and standard time series analysis
techniques. Where neural networks differ from standard regression techniques is the way how
the least-squares error minimization procedure was implemented: while regression techniques
rely on closed one-step analytical formulas, the neural network approach employs a numerical
iterative backpropagation algorithm.
2.7 Exercises for the brave
1.
Derive the expressions for the parameters a, b, c, d, and e for the following regression
model:
cx 2 dxe
y  a  be
and forecast the world's population for the year 2025 based on this model.
2.
Write a MATLAB program that implements the evaluation of the network shown in
figure 4 and verify the population forecast for the year 2025 based on this 1-21 neural
network model and the weights shown in TABLE IV.
3.
Expand the MATLAB program of exercise 2 to a program that can train the weights of a
neural network based on a random search model. I.e., Start with and initial random
collection for the weights (let's say all chosen from a uniform random distribution
between -1.0 and +1.0). Then iteratively adjust the weights by making small random
perturbations (one weight at a time), evaluate the new error after showing all the training
samples, and retain the perturbed weight if it is smaller. Repeat this process until the
network has a reasonably small error.
2.8 References
[1]
[2]
[3]
[4]
P. Werbos, "Beyond regression: New tools for prediction and analysis in the behavioral
sciences," Ph.D. thesis, Harvard University (1974).
D. E. Rumelhart, G. Hinton, and R. J. Williams, "Learning internal representations by
error propagation," In Parallel distributed processing: explorations in the
microstructure of cognition, Vol. 1, D. E. Rumelhart and James L. McClelland, Eds.,
Chapter 8, pp. 318-362, MIT Press, Cambridge, MA (1986).
Malthus, "An Essay on the Principle of Population," 1798. Republished in the Pelican
Classics series, Penguin Books, England (1976).
Otto Johnson, Ed., "1997 Information Please Almanac," Houghton Mifflin Company
Boston & New York (1996).
26
APPENDIX: INPUT FILES FOR 1-2-1 NETWORK FOR MetaNeural™
ANNOTATED MetaNeural™ INPUT FILE: POP
3
1
2
1
1
0.1
0.1
0.5
0.5
1000
500
1
1
pop.pat
0
100
0.01
1
0.6
Three layered network
One input node
2 neurons in the hidden layer
One output neuron
Show all samples and then update weights
Learning parameter first layer of weights
Learning parameter second layer of weights
Momentum first layer of weights
Momentum second layer of weights
Do thousand iterations (for all patterns)
Show intermediate results every 500 iterations on the screen
Standard [0, 1] sigmoid transfer function
Temperature one for sigmoid (i.e., standard sigmoid)
Name of training pattern file
Ignored
Ignored
Stop training when error is less than 0.01
Initial weights are drawn from a uniform random distribution
between [-0.6, 0.6]
POP.PAT: The pattern file
10
0.000
0.267
0.533
0.667
0.800
0.827
0.853
0.880
0.907
0.920
0.000
0.067
0.144
0.206
0.287
0.318
0.352
0.385
0.414
0.428
0
1
2
3
4
5
6
7
8
9
10 training pattern
first training pattern
second training pattern
27
Download