MULTILAYER PERCEPTRONS - CSE,Dhaka City College

advertisement
Multi-Layer Feedforward
Neural Networks
CAP5615 Intro. to Neural Networks
Xingquan (Hill) Zhu
Outline
• Multi-layer Neural Networks
• Feedforward Neural Networks
– FF NN model
– Backpropogation (BP) Algorithm
– BP rules derivation
– Practical Issues of FFNN
• FFNN for Face Recognition
Multi-layer NN
• Between the input and output layers there are hidden layers, as
illustrated below.
– Hidden nodes do not directly send outputs to the external environment.
• Multi-layer NN overcome the limitation of a single-layer NN
– they can handle non-linearly separable learning tasks.
Input
layer
Output
layer
Hidden Layer
XOR problem
Two classes, green and red, cannot be
separated using one line, but two lines.
The NN below with two hidden nodes
realizes this non-linear separation, where
each hidden node represents one of the two
blue lines.
x2
1
-1
1
x1
-1
-1
x1
+1
w0
y1
w1
-1
x2
z
-1
w3
y2
+1
-1
Types of decision regions
1
w0  w1 x1  w2 x2  0
x1 w1
w0  w1 x1  w2 x2  0
x2
L1
L2
Network
with a single
node
w0
w2
1
Convex
region
x1
1
L4
L3
One-hidden layer network that
realizes the convex region: each
hidden node realizes one of the
lines bounding the convex region
1
1
-3.5
1
x2
1
P1
P2
two-hidden layer network that
realizes the union of three convex
regions: each box represents a one
hidden layer network realizing
one convex region
1
1
x1
P3
x2
1
1
1.5
1
Different Non-Linearly Separable Problems
Structure
Single-Layer
Two-Layer
Three-Layer
Types of
Decision Regions
Exclusive-OR
Problem
Half Plane
Bounded By
Hyperplane
A
B
A
Convex Open
Or
Closed Regions
A
B
Arbitrary
(Complexity
Limited by No.
of Nodes)
Class
Separation
B
B
B
B
A
A
B
B
B
A
A
A
A
Most General
Region Shapes
Outline
• Multi-layer Neural Networks
• Feedforward Neural Networks
– FF NN model
– Backpropogation (BP) Algorithm
– BP rules derivation
– Practical Issues of FFNN
• FFNN for Face Recognition
FFNN NEURON MODEL
• The classical learning algorithm of FFNN is based on
the gradient descent method.
• The activation function used in FFNN are continuous
functions of the weights, differentiable everywhere.
– A typical activation function is the Sigmoid Function
FFNN NEURON MODEL
• A typical activation function is the Sigmoid Function:
 (v j ) 
1
1
 av j
e
with a  0
 (v j )
1
Increasing a
where v j   wji yi
vj
i
with wji weight of link from node i
-10 -8 -6 -4 -2
2
4
6
8
10
to node j and yi output of node i
• When a approaches to 0,  tends to a linear function
• when a tends to infinity then  tends to the step function
FFNN MODEL
• xij : The input from node i to node j
• wij : The weight from node i to node j
– wij : The weight updating amount from node i to node j
• ok : The output from node k
The objective of multi-layer NN
• The error of output neuron j after the activation of the
network on the n-th training example ( x ( n ), d ( n ))
is:
e j (n)  d j (n) - o j (n)
• The network error is the sum of the squared errors of the
output neurons:
1
2
E(n) 
2
e
(n)
j
j output node
• The total mean squared error is the average of the
network errors over the training examples.
E (W ) 
1
N
N
 E (n)
n 1
1
E (W ) 
2N
2
(
d
(
n
)

o
(
n
))
n  j j
j
Feed forward NN
Idea: Credit assignment problem
• Problem of assigning ‘credit’ or ‘blame’ to
individual elements involving in forming overall
response of a learning system (hidden units)
• In neural networks, problem relates to
distributing the network error to the weights.
Outline
• Multi-layer Neural Networks
• Feedforward Neural Networks
– FF NN model
– Backpropogation (BP) Algorithm
– BP rules derivation
– Practical Issues of FFNN
• FFNN for Face Recognition
Training: Backprop algorithm
• Searches for weight values that minimize the total
error of the network over the set of training
examples.
• Repeated procedures of the following two passes:
– Forward pass: Compute the outputs of all units in the
network, and the error of the output layers.
– Backward pass: The network error is used for updating
the weights (credit assignment problem).
• Starting at the output layer, the error is propagated backwards
through the network, layer by layer. This is done by recursively
computing the local gradient of each neuron.
Backprop
• Back-propagation training algorithm illustrated:
Network activation
Error computation
Forward Step
Error propagation
Backward Step
• Backprop adjusts the weights of the NN in order
to minimize the network total mean squared
error.
BP
BP Example
1
w0c
X0
w0a
a
wac
w0b
c
w1a
X1
w1b
w2a
wbc
b
X2
w2b
• XOR
– X0
1
1
1
1
X1
0
0
1
1
X2
0
1
0
1
Y
0
1
1
0
=0.5;
ox  1 (1  e  v x ) ;
Neuro a
woa =0.34
w1a =0.13
va=0.34
oa=0.58
For instance {(1, 0, 0), 0}
Neuro b
w0b =-0.12
w1b =0.57
vb= -0.12
ob=0.47
Neuro C
w0c =-0.99
vc=-0.54
oc=0.37
wac =0.16
w2a =-0.92
w2b =-0.33
wbc =0.75
a=oa(1-oa)kwakk
=0.58*(1-0.58)*0.16*(-0.085)
=-0.003
b=ob(1-ob)kwbkk
=0.47*(1-0.47)*0.75*(-0.085)
=-0.016
c=oc(1-oc)(tc-oc)
=0.37*(1-0.37)*(0-0.37)
=-0.085
woa =awoa=0.5*(-0.003)*1
=-0.015
wob =bwob=0.5*(-0.016)*1
=-0.008
woc =cwoc=0.5*(-0.085)*1
=-0.043
w1a =aw1a=0.5*(-0.003)*0=0
w1b =bw1b=0.5*(-0.01)*0=0
wac =cwac=0.5*(-0.085)*0.58
=-0.025
w2a =aw2a=0.5*(-0.003)*1=0
w2b =bw2b=0.5*(-0.01)*1=0
wbc =cwbc=0.5*(-0.085)*0.47
=-0.020
• Weight updating
Neuro a
Neuro b
Neuro C
woa = woa+ woa=0.340.015=0.325
w0b = w0b + wob =-0.12-0.008
w0c = w0c + w0c =-0.99-0.043
w1a = w1a + w1a=0.13+0
w1b = w1b +  w1b =0.57+0
wac = wac + wac =0.16-0.025
w2a = w2a+ w2a =-0.92+0
w2b = w2b + w2b =-0.33+0
wbc = wbc + wbc =0.75-0.02
woa =awoa=0.5*(-0.003)*1
=-0.015
wob =bwob=0.5*(-0.016)*1
=-0.008
woc =cwoc=0.5*(-0.085)*1
=-0.043
w1a =aw1a=0.5*(-0.003)*0=0
w1b =bw1b=0.5*(-0.01)*0=0
wac =cwac=0.5*(-0.085)*0.58
=-0.025
w2a =aw2a=0.5*(-0.003)*1=0
w2b =bw2b=0.5*(-0.01)*1=0
wbc =cwbc=0.5*(-0.085)*0.47
=-0.020
Outline
• Multi-layer Neural Networks
• Feedforward Neural Networks
– FF NN model
– Backpropogation (BP) Algorithm
– BP rules derivation
– Practical Issues of FFNN
• FFNN for Face Recognition
Weight Update Rule
The Backprop weight update rule is based on the
gradient descent method: take a step in the direction
yielding the maximum decrease of the network error E.
This direction is the opposite of the gradient of E.
wij  wij  wij
E (W )
wij  -
wij
Weight Update Rule
The input of a neuron j is v j 
w x
i 0 ,..., m
oj
ij i
Using the chain rule E (W )  E (W ) v j
w ij
v j w ij
we can write:
(vj)
vj
j
Moreover if we define the
E (W )
local gradient of neuron j  j  
v j
as follows:
Then from
v j
w ij

 iwij xi
w ij
 xi
we get
wij
1 … i … m
l
w ij   j xi
Weight update
So we have to compute the local gradient
j  
E (W )
v j
of neurons. Two different rules for the cases
• j output neuron (green ones)
• j hidden neuron (the brown ones)
Input
layer
Output
layer
Hidden Layer
Weight update of output neuron
If j is an output neuron then using the chain rule we obtain:
j
E (W )
E (W ) e j o j


 e j ( 1) ' ( v j )
v j
e j
o j v j
because
ej  dj - oj
and
For j output neuron
Substituting  j in
o j   (v j)
 j  e j ' ( v j )
w ij   j xi
we get
wij   (d j - o j )  o j (1  o j )
Weight update of hidden neuron
C set of neurons of output layer
E (W )
E (W ) v k
j  
 -
v j
v k v j
k C
E ( w)
  k
v k
v k
v k o j

,
Using chain rule
v j
o j v j
Then
vk
 w jk
o j
Moreover
For j is a hidden node
Substituting  j in
Because
o j
v j
vk 
w
i 0 ,..., m
ik
xi
  ' (v j )
E (W ) v k
  ' (v j ) k w jk  o j  (1  o j ) k w jk
v j
k  C v k
kC
kC
j  -
w ij   j xi
we get
wij  xi  o j (1  o j )

w jk
k
k in next layer
Error backpropagation
The flow-graph below illustrates how errors are backpropagated to the hidden neuron j
wj1
j ’(vj)
1
’(v1)
e1
k
’(vk)
ek
’(vm)
em
wjk
wjm
m
Summary: Delta Rule
• Delta rule wij = j xi
j 
  ( v j )(d j  o j )
 (v j)
where
 w
k
k of next layer
IF j output node
jk IF j hidden node
 ' ( v j )  a  o j  (1  o j )
for sigmoid activation functions
Outline
• Multi-layer Neural Networks
• Feedforward Neural Networks
– FF NN model
– Backpropogation (BP) Algorithm
– BP rules derivation
– Practical Issues of FFNN
• FFNN for Face Recognition
Network training:
Two types of network training:
 Incremental mode (on-line, stochastic, or perobservation) Weights updated after each instance is
presented

Batch mode (off-line or per -epoch)
Weights updated after all the patterns are presented
Backprop algorithm
incremental-mode
n=1;
initialize w(n) randomly;
while (stopping criterion not satisfied and n<max_iterations)
for each example (x,d)
- run the network with input x and compute the output y
- update the weights in backward order starting from
those of the output layer:
wij  wij  wij
with  wij computed using the (generalized) Delta rule
end-for
n = n+1;
end-while;
choose a randomized ordering for selecting the examples in the training set in order to
avoid poor performance.
Backprop algorithm
batch mode
• In the batch-mode the weights are
updated only after all examples have
been processed, using the formula
wij  wij 
x

w
 ij
x training example
• The learning process continues on an
epoch-by-epoch basis until the stopping
condition is satisfied.
Stopping criterions
• Sensible stopping criterions:
– total mean squared error change:
Back-prop is considered to have converged when
the absolute rate of change in the average
squared error per epoch is sufficiently small (in the
range [0.01, 0.1]).
– generalization based criterion:
After
each epoch the NN is tested for generalization
using a different set of examples (validation set). If
the generalization performance is adequate then
stop.
Use of Available Data Set for Training
The available data set is normally split into three
sets as follows:
• Training set – use to update the weights.
Patterns in this set are repeatedly in random
order. The weight update equation are
applied after a certain number of patterns.
• Validation set – use to decide when to stop
training only by monitoring the error.
• Test set – Use to test the performance of the
neural network. It should not be used as part
of the neural network development cycle.
Earlier Stopping - Good Generalization
• Running too many epochs may overtrain the
network and result in overfitting and perform poorly
in generalization.
 Keep a hold-out validation set and test accuracy
after every epoch. Maintain weights for best
performing network on the validation set and stop
training when error increases increases beyond
this.
Validation set
error
Training set
No. of epochs
Model Selection by Cross-validation
• Too few hidden units prevent the network from
learning adequately fitting the data and learning the
concept.
• Too many hidden units leads to overfitting.
 Similar cross-validation methods can be used to
determine an appropriate number of hidden units
by using the optimal test error to select the model
with optimal number of hidden layers and nodes.
Validation set
error
Training set
No. of epochs
NN DESIGN
•
•
•
•
Data representation
Network Topology
Network Parameters
Training
Data Representation
• Data representation depends on the problem. In general
NNs work on continuous (real valued) attributes.
Therefore symbolic attributes are encoded into continuous
ones.
• Attributes of different types may have different ranges of
values which affect the training process. Normalization
may be used, like the following one which scales each
attribute to assume values between 0 and 1.
xi  min i
xi 
max i  min i
for each value xi of attribute i , where min i and max i are
the minimum and maximum value of that attribute over
the training set.
Network Topology
• The number of layers and of neurons
depend on the specific task. In practice this
issue is solved by trial and error.
• Two types of adaptive algorithms can be
used:
– start from a large network and successively
remove some neurons and links until network
performance degrades.
– begin with a small network and introduce new
neurons until performance is satisfactory.
Network parameters
• How are the weights initialized?
• How is the learning rate chosen?
• How many hidden layers and how many
neurons?
• How many examples in the training set?
Initialization of weights
• In general, initial weights are randomly chosen, with
typical values between -1.0 and 1.0 or -0.5 and 0.5.
• If some inputs are much larger than others, random
initialization may bias the network to give much more
importance to larger inputs. In such a case, weights
can be initialized as follows:
w ij   21m
w jk  
1
2n

For weights from the input to
the first layer
1
|x i |
i 1,..., m

j 1,..., n
(

1
i w ijx )
i
For weights from the first to the
second layer
Choice of learning rate
• The right value of  depends on the
application. Values between 0.1 and 0.9
have been used in many applications.
Size of Training set
• Rule of thumb:
– the number of training examples should be at
least five to ten times the number of weights of
the network.
• Other rule:
|W|
N
(1 - a)
|W|= number of weights
a=expected accuracy
Applications of FFNN
Classification, pattern recognition:
• FFNN can be applied to tackle non-linearly
separable learning tasks.
– Recognizing printed or handwritten characters
– Face recognition
– Classification of loan applications into credit-worthy and
non-credit-worthy groups
– Analysis of sonar radar to determine the nature of the
source of a signal
Regression and forecasting:
• FFNN can be applied to learn non-linear functions
(regression) and in particular functions whose inputs
is a sequence of measurements over time (time
series).
Outline
• Multi-layer Neural Networks
• Feedforward Neural Networks
– FF NN model
– Backpropogation (BP) Algorithm
– BP rules derivation
– Practical Issues of FFNN
• FFNN for Face Recognition
Categorical attributes and multiclasses
• A categorical attribute is usually decomposed into
a series of (0, 1) continuous attributes
– Whether an attribute value exists or now.
• Each class corresponds to one output node, the
desired output of the node is “1” for any instance
belonging to this class (otherwise, “0”)
– For each test instance, the final class label is
determined by the output node with the maximum
output value.
A generalized delta rule
• If  is small then the algorithm learns the weights very
slowly, while if  is large then the large changes of the
weights may cause an unstable behavior with
oscillations of the weight values.
• A technique for tackling this problem is the introduction
of a momentum term in the delta rule which takes into
account previous updates. We obtain the following
generalized Delta rule:
w ij ( n)  w ij ( n  1)   j ( n)x i ( n)
 momentum constant
0  1
momentum term accelerates the descent in steady downhill directions
Neural Net for object recognition
from images
• Objective
– Identify interesting objects from input images
• Face recognition
– Locate faces, happy/sad faces, gender, face pose, orientation
– Recognize specific faces: authorization
• Vehicle recognition (traffic control or safe driving assistant)
– Passenger car, van, pick up, bus, truck
• Traffic sign detection
• Challenges
– Image size (100x100, 10240x10240)
– Object size, pose and object orientation
– Illuminations
Example
Example: Face Detection
Challenges
pose variation
lighting condition variation
facial expression variation
Normal procedures
• Training (identify your problem and build specific model)
– Build training dataset
• Isolate sample images
– Images containing faces
• Extract regions containing the objects
– region containing faces
• Normalization (size and illumination)
– 200x200 etc.
• Select counter-class examples
– Non-face regions
– Determine Neural Net
• Input layers are determined by the input images
– E.g., a 200x200 image requires 40,000 input dimensions, each containing a
value between 0-255
• Neural net architectures
– A three layer FF NN (two hidden layers) is a common practice
• Output layers are determined by the learning problem
– Bi-class classification or multi-class classification
– Train Neural Net
Normal procedures
• Test
– Given a test image
• Select a small region (considering all possibilities of
the object location and size)
– Scanning from the top left to the bottom right
– Sampling at different scale levels
• Feed the region into the network, determine
whether this region contains the object or not
• Repeat the above process
– Which is a time consuming process
CMU Neural Nets for Face Pose
Recognition
Head pose (1-of-4):
90% accuracy
Face recognition (1-of-20):
90% accuracy
Neural Net Based Face Detection
•Large training set of faces and small set of non-faces
•Training set of non-faces automatically built up:
•Set of images with no faces
•Every ‘face’ detected is added to the non-face training set.
Traffic sign
detection
• Demo
– http://www.mathworks.com/products/demos/videoimage
/traffic_sign/vipwarningsigns.html
• Intelligent traffic light control system
– Instead of using loop detectors (like metal detectors)
• Using surveillance video: Detecting vehicle and bicycles
Vehicle Detection
• Intelligent vehicles aim at improving the
driving safety by machine vision
techniques
http://www.mobileye.com/visionRange.shtml
Term Project (1)
• Modifying CMU face recognition source code to
train a classifier for one type of image classification
problem
– You identify your own objective (make your objective
unique)
• Gender, kid/adult recognition etc.
– Available source code (c, Unix)
– Team
• Maximum team members: 3
– Due date (April 30)
– A written report (3 page minimum)
• Your objective
• System architecture
• Experimental results
Alternative choice (2)
• Alternatively, you can propose your own
term project as well.
• Requirement
– Must relate to the neural network and
classification
– Must have a clear objective
– Must involve programming work
– Must have experimental assessment results
– Must have a written report (3 page minimum).
– Send me your proposal by April 4.
CMU NN face recognition source code
• Dr. Tom Mitchell (Machine Learning)
– http://www.cs.cmu.edu/~tom/faces.html
• What available?
– Image dataset
• Different class of images: pose, expression, glasses, etc. in pgm
format
– Complete C source codes
•
•
•
•
Pgm image read/write
3 layer feed-forward neural network architecture
Backpropogation learning algorithms
Weight visualization
– Document
• A 13 page document, list the details of the datasets and the
source code
Outline
• Multi-layer Neural Networks
• Feedforward Neural Networks
– FF NN model
– Backpropogation (BP) Algorithm
– BP rules derivation
– Practical Issues of FFNN
• FFNN for Face Recognition
Download