Neural Networks Learning Objectives Characteristics of neural nets Supervised learning – Back-propagation

advertisement
Neural Networks
Learning Objectives
Characteristics of neural nets
Supervised learning – Back-propagation
Probabilistic nets
What is a Neural Network?
According to the DARPA Neural Network Study
(1988, AFCEA International Press, p. 60):
... a neural network is a system composed of many simple
processing elements operating in parallel whose function is
determined by network structure, connection strengths, and
the processing performed at computing elements or nodes.
Characteristics of Neural Nets
The good news: They exhibit some brain-like
behaviors that are difficult to program directly like:
learning
association
categorization
generalization
feature extraction
optimization
noise immunity
The bad news: neural nets are
black boxes
difficult to train in some cases
There is a wide range of neural network architectures:
Multi-Layer Perceptron (Back-Prop Nets) 1974-85
Neocognitron 1978-84
Adaptive Resonance Theory (ART) 1976-86
Sef-Organizing Map 1982
Hopfield 1982
Bi-directional Associative Memory 1985
Boltzmann/Cauchy Machine 1985
Counterpropagation 1986
Radial Basis Function 1988
Probabilistic Neural Network 1988
General Regression Neural Network 1991
Support Vector Machine 1995
Our single "neuron" model
−1
xn
b
wn
x2
x1
w2
w1
Σ
n +1
D = ∑ wi xi
i =1
⎧ 1 D≥0
O=⎨
⎩−1 D < 0
xn +1 = −1
wn +1 = b
class c1
class c2
Basic Neuron Model
x1
x2
wji
wj2
sum
hj
Dj
wj3
x3
wjn
xn
D j = ∑ w ji xi
f ( Dj )
threshold activation function
input features
and bias
jth neuron
input layer
output
C1
C2
C3
hidden layer
output
C1
C2
C3
hidden
layer 2
output
hidden
layer 1
C1
C2
C3
Most neural nets use a smooth activation function
x1
x2
wji
wj2
sum
threshold
hj
Dj
wj3
x3
wjn
D j = ∑ w ji xi
f ( Dj )
xn
input features
and bias
sigmoidal
1
f ( z) =
1 + e− z
f ′ = f (1 − f )
Major question – How do we adjust the weights to learn the
mapping between inputs and outputs?
Answer: Use the back propagation algorithm, which is just an
application of the chain rule of differential calculus.
Consider this simple example
w
x
x
u
h
y = f (u h )
h = f ( w x)
y = f (u f ( w x ))
= F ( u , w, x )
y
y
To learn the weights we can try to minimize the output error
i.e. , we start with an initial guess for the weights and then present an example
of known input x and output value d (training example). Then we form up the
error
1
2
E=
2
(d − y)
and adjust the weights to reduce this error. Since
ΔE =
we can make
by choosing
∂E
∂E
Δu +
Δw
∂u
∂w
ΔE ≤ 0
∂E
= −λΔu
∂u
∂E
= −λΔw
∂w
λ
… constant
This leads to the adjustment rule
∂E
u ( m + 1) = u ( m ) − μ
∂u ( m )
∂E
w ( m + 1) = w ( m ) − μ
∂w ( m )
μ
… learning
rate
So we now need to find these derivatives of E with respect to the
weights.
w
x
x
u
h
y
y
y = f (u h )
h = f ( w x)
y = f (u f ( w x ))
= F ( u , w, x )
For the derivatives involving the weight between the hidden layer and
output layer
∂E
∂ ⎡1
∂ ⎡1
2⎤
2 ⎤ ∂y
=
−
=
−
d
y
d
y
(
)⎥
(
)⎥
⎢
⎢
∂u ( m ) ∂u ⎣ 2
⎦ ∂y ⎣ 2
⎦ ∂u
= ( y − d ) f ′ ( u h ) h = ( y − d ) y (1 − y ) h
= ( y − d ) y (1 − y ) h u m , w m
( ) ( )
Similarly for the weight between the hidden layer and input layer
∂E
∂ ⎡1
∂ ⎡1
2⎤
2 ⎤ ∂y
=
(d − y) ⎥ = ⎢ (d − y) ⎥
⎢
∂w ( m ) ∂w ⎣ 2
⎦ ∂y ⎣ 2
⎦ ∂w
∂ (u h)
= ( y − d ) f ′ (u h)
= ( y − d ) y (1 − y ) u h′ ( w x ) x
∂w
= ( y − d ) y (1 − y ) h (1 − h ) ux u m , w m
( ) ( )
Thus, the training algorithm is:
1. Initialize weights to small random values
2. Using a training set of known pairs of inputs and outputs (x, d) change the
weights according to
w ( m + 1) = w ( m ) − μ ( y − d ) y (1 − y ) h (1 − h ) u x u m , w m
( ) ( )
u ( m + 1) = u ( m ) − μ ( y − d ) y (1 − y ) h u m , w m
( ) ( )
with
y = f (u h )
h = f ( w x)
until
E=
1
2
d
−
y
(
)
2
becomes sufficiently small
E
iteration
stop
This is an example of supervised learning
One of the most popular neural nets is a feed forward net with
one hidden layer trained by the back propagation algorithm
hidden layer
outputs
It can be shown that in principle this type of network can
represent an arbitrary input-output mapping or solve an arbitrary
classification problem
Back propagation algorithm (three layer feed forward network)
xi
yk
w ji
P input
nodes
ukj
M hidden
nodes
ukjnew = ukjold − μ ( yk − d k ) yk (1 − yk ) h j
w
new
ji
=w
K output
nodes
( k = 1,..., K
− μ ∑ ( yk − d k ) yk (1 − yk ) ukjold h j (1 − h j ) xi
K
old
ji
j = 1,..., M )
k =1
yk =
with
1
⎛ M old ⎞
1 + exp ⎜ −∑ ukj h j ⎟
⎝ j =1
⎠
1
hj =
⎛ P old ⎞
1 + exp ⎜ −∑ w ji xi ⎟
⎝ i =1
⎠
( j = 1,..., M
i = 1,..., P )
Some issues associated with this "backprop" network
1.
2.
3.
4.
design of training, testing and validation data sets
determination of the network structure
selection of the learning rate (μ)
problems with under or over-training
E
testing set
training set
iterations
stop
learning
over
trained
Some important issues for neural networks:
Pre-processing the data to provide:
• reduction of data dimensionality
• noise filtering or suppression
• enhancement
strengthening of relevant features
centering data within a sensory aperture or window
scanning a window over the data
•invariance in the measurement space to:
translations
rotations
scale changes
distortion
• data preparation
analog to digital conversion
data scaling
data normalization
thresholding
Some examples of pre-processing include
1-D and 2-D FFTs
Filtering
Convolution Kernels
Correlation Masks or Template Matching
Autocorrelation
Edge Detection and Enhancement
Morphological Image Processing
Fourier Descriptors
Walsh, Hadamard, Cosine, Hartley, Hotelling, Hough Transforms
Higher order spectra
Homomorphic Transformations (e.g. Cepstrums)
Time-Frequency transforms ( Wavelet, Wigner-Ville, Zak)
Linear Predictive Coding
Principal Component Analysis
Independent Component Analysis
Geometric Moments
Thresholding
Data Sampling
Scanning
Probabilistic Neural Network (PNN)
Basic idea:
Use training samples themselves to obtain a representation of the
probability distributions for each class and then use Bayes decision
rule to make a classification
Basis functions usually chosen are Gaussians:
⎡ − ( x − x )T ( x − x ) ⎤
1
ij
ij
⎢
⎥
fi ( x ) =
exp
∑
p/2
2
p
⎥
2σ
( 2π ) σ M i j =1 ⎢⎣
⎦
Mi
i … class number (i = 1, 2, …, N)
j … training pattern number
xij … j th training pattern from i th class
Mi … number of training vectors in class i
p .. dimension of vector x
fi (x) = sum of Gaussians centered at each training pattern from the ith
class to represent the probability density of that class
σ … smoothing factor (standard deviation, width of the Gaussians)
probability density function for class i
xij
If we normalize the vectors x and xij to unit length and assume the
number of training samples from each class are in proportion to their
a priori probability of occurrence then we can take as our decision
function
⎡ ( x ⋅ xij ) − 1 ⎤
exp ⎢
gi ( x ) = M i fi ( x ) =
⎥
p/2
2
p ∑
⎥⎦
( 2π ) σ j =1 ⎢⎣ σ
1
Mi
⎡ ( x ⋅ xij ) − 1 ⎤
exp ⎢
gi ( x ) = M i fi ( x ) =
⎥
p/2
2
p ∑
⎥⎦
( 2π ) σ j =1 ⎢⎣ σ
1
Mi
Since we decide for a given class k based on
g k ( x ) > gi ( x )
for all i = 1, 2,..., N
(i ≠ k )
the common constant outside the sum makes no difference and we can
take
⎡ ( x ⋅ xij ) − 1 ⎤
gi ( x ) = ∑ exp ⎢
⎥
2
j =1
⎢⎣ σ
⎥⎦
Mi
This can now be easily implemented in a neural network form
x1
xj
xp
weights are
just elements
of the training
patterns
xij
1
M1
1
Mi
1
MN
∑
∑
∑
g1 ( x )
gi ( x )
gN (x)
Probabilistic Neural Network
Characteristics of the PNN
1. no training, weights are just the training vectors themselves
2. only parameter that needs to be found is the smoothing factor, σ
3. outputs are representative of probabilities of each class directly
4. the decision surfaces are guaranteed to approach the Bayes optimal
boundaries as the number of training samples grows
5. "outliers" are tolerated
6. sparse samples are adequate for good network performance
7. can update the network as new training samples become available
8. needs to store all the training samples, requiring a large memory
9. testing can be slower than with other nets
References
Specht, D.F., " Probabilistic neural networks," Neural Networks, 3, 109118, 1990.
Zaknich, A., Neural Networks for Intelligent Signal Processing, World
Scientific, 2003.
Haykin, S., Neural Networks, a Comprehensive Foundation, 2nd Ed.,
Prentice-Hall, 1999.
Bishop, C.S., Neural Networks for Pattern Recognition, Clarendon
Press, 1995.
Resources
There are many, many neural network resources and tools available on
the web.
Some software packages:
MATLAB Neural Network Toolbox
Neuroshell Classifier
ClassifierXL
Brainmaker
Neurosolutions
Neuroxl
www.mathworks.com
www.wardsystems.com
www.analyzerxl.com
www.calsci.com
www.nd.com
www.neuroxl.com
Download