Artificial Neural Networks

advertisement
CSCI3230
Introduction to Neural Network II
Liu Pengfei
Week 13, Winter 2015
Outline
Backward Propagation Algorithm
1.
1.
2.
Practical Issues
2.
1.
2.
2
Optimization Model & Gradient Descent
Multilayer Networks of Sigmoid Units & Backward
Propagation
Over-fitting
Local Minima
Backward Propagation Algorithm
How to update weights to minimize the error between the
target and output?
3
Optimization Model (Error minimization)

Given observed data, find a network which gives the same
output as the observed target variable(s).

Given the topology of the network (number of layers,
number of neurons, their connections), find a set of
weights to minimize the error function.
The set of
training
examples
4
Target
Output
Gradient Descent

Activation function is smooth E(W) is a smooth
function of the weights W

But difficult to analytically solve

Can use calculus!

Use iterative approach

5
Gradient Descent
Gradient Descent
To find a local minimum of a function using gradient
descent, one takes steps proportional to the negative
of the gradient (or of the approximate gradient) of the
function at the current point.

For a smooth function f (x ),
f
 is the direction that f increases most rapidly.
x
f



So xt 1  xt   
until x converges
x x  xt
6
Gradient interpretation
7
Gradient Descent
Gradient
Training Rule
i.e.
Update
Partial derivatives are the key
8
Gradient Descent-Example
9
Gradient Descent
Derivation (How to compute
We know:
10
)
Sigmoid Units
11
Chain Rule
Input x
Net input
𝑤𝑖 𝑥𝑖
1
Sigmoid output 1+𝑒 −𝑛𝑒𝑡
𝜕𝐸𝑟𝑟 𝜕𝐸𝑟𝑟 𝜕𝑠𝑖𝑔 𝜕𝑛𝑒𝑡
=
∗
∗
𝜕𝑤𝑖
𝜕𝑠𝑖𝑔 𝜕 𝑛𝑒𝑡 𝜕𝑤𝑖
12
1
Error 𝐸 = 2 (𝑡 − 𝑠𝑖𝑔 𝑜𝑢𝑡𝑝𝑢𝑡) 2
𝜕𝐸𝑟𝑟
𝜕𝑠𝑖𝑔
1
𝐸𝑟𝑟 = (𝑡 − 𝑠𝑖𝑔 𝑜𝑢𝑡𝑝𝑢𝑡) 2
2
𝜕𝐸𝑟𝑟
𝜕𝑠𝑖𝑔
13
= (𝑡 − 𝑠𝑖𝑔)
𝜕𝑠𝑖𝑔
𝜕 𝑛𝑒𝑡
1
𝑠𝑖𝑔 =
1 + 𝑒 −𝑛𝑒𝑡
𝜕𝑠𝑖𝑔
𝜕 𝑛𝑒𝑡
14
=
𝑒 −𝑛𝑒𝑡
(1+𝑒 −𝑛𝑒𝑡 )2
= 𝑠𝑖𝑔(1 − 𝑠𝑖𝑔)
𝜕𝑛𝑒𝑡
𝜕𝑤𝑖
𝑛𝑒𝑡 =
𝜕𝑛𝑒𝑡
𝜕𝑤𝑖
15
= 𝑥𝑖
𝑤𝑖 𝑥𝑖
Sigmoid Units
After knowing how to update
16
in sigmoid unit,
Multilayer Networks of Sigmoid Units

For output neuron
-same as single sigmoid neuron
For hidden neuron
- Error is the weight sum error of the connected output
neurons

1
𝐸𝑟𝑟 = (𝑡 − 𝑠𝑖𝑔 𝑜𝑢𝑡𝑝𝑢𝑡) 2
2
17
𝐸𝑟𝑟𝑗 =
𝑤𝑖𝑗 𝐸𝑟𝑟𝑖
𝑖 𝑡ℎ𝑎𝑡 𝑐𝑜𝑛𝑛𝑒𝑐𝑡𝑒𝑑 𝑤𝑖𝑡ℎ 𝑗
Hidden neuron
Output of hidden
18
Net of output
Error of output
Multilayer Networks of Sigmoid Units
Back-propagation Algorithm
 BP learns the weights for a multilayer network, given a
network with a fixed set of units and interconnections.
 BP employs gradient descent in attempt to minimize the
squared error between the network output values and the
target values for these outputs.
 Two stage learning
 Forward stage: calculate outputs given input pattern x.
 Backward stage: update weights by calculating delta.
19
Multilayer Networks of Sigmoid Units
Back-propagation Algorithm
 Error function for BP
 E defined as a sum of the squared errors overall the output
units k for all the training examples d.
 Error surface can have multiple local minima
 Guarantee toward some local minimum
 No guarantee to the global minimum
20
Multilayer Networks of Sigmoid Units
Back-propagation Algorithm
Process:
21
Multilayer Networks of Sigmoid Units
Derivation of the BP Rule
xi
…
j
…
22
Multilayer Networks of Sigmoid Units
Derivation of the BP Rule
xi
…
j
…
23
Multilayer Networks of Sigmoid Units
Derivation of the BP Rule
Case 1: Rule for Output Unit Weights
xi
…
24
j
Multilayer Networks of Sigmoid Units
Derivation of the BP Rule
Case 2: Rule for Hidden Unit Weights
Multivariate chain
rule
xi
…
j
…
25
Multilayer Networks of Sigmoid Units
Back-propagation Algorithm
Process:(revisited)
26
Practical Issues
Over-fitting and Local Minima
27
Outline
Over-fitting
1.
1.
2.
3.
Local Minima
2.
1.
2.
28
Splitting
Early stopping
Cross-validation
Randomize initial weights & Train multiple times
Tune the learning rate appropriately
Over-fitting
Why not choose a method with the best fit to the data?
29
Preventing Over-fitting

Goal: To train a neural network having the BEST
generalization power
1.
2.
3.
30
When to stop the algorithm?
How to evaluate the neural network fairly?
How to select a neural network with the best
generalization power?
Over-fitting

Motivation of learning




Build a model to learn the
underlying knowledge
Apply the model to predict
the unseen data
“Generalization”
What is over-fitting?



31
Perform well on the
training data but perform
poorly on the unseen data
Memorize the training data
“Specialization”
Normal
Over-fitted
Splitting Method (Cross validation)

Splitting method
Random shuffle
Divide data into two sets
1.
2.


3.
4.
5.
32
Training set: 70% of data
Validation set: 30% of data
Train (update weights) the
network using the training
set
Evaluate (do not update
weights) the network
using validation set
If stopping criteria is
encountered, quit; else
repeat step 3-4
Splitting Method( n-fold validation)

However, there is a problem:
As only part of the data is used as testing data, there may
be errors on the training methodology, which
coincidentally cannot be reflected by the validation data.
Analogy:
A child knows only addition but not subtraction. However,
as in the maths quiz, no question related to subtraction was
asked, he still got 100% marks.
33
Early Stopping



Any large enough neural
networks may lead to
over-fitting if over-trained.
If we can know the time
to stop training, we can
prevent over-fitting
Early stopping
Early stopping
34
Criteria for Early Stopping
1.
2.
3.
Validation error increases
Weights do not update
much
Training epoch (maximum
number of iterations) is
larger than a constant
defined by you
One epoch means all the data
(including both training and
validation) has been used once.
Early stopping
35
Cross-validation
Data set Randomly Division 1st learning
mixed into k folds
data set
2nd learning
Test set
Test set
…
Learning
set
Error 1
36
Learning
set
Error 2
Mean error = Low
biased error estimator
Cross-validation


•
Cross-validation: 10-fold
Divide the input data into 10 parts: p1, p2, …, p10
for i = 1 to 10 do
•
•
•
•
Use parts other than pi for training
Use pi for validation and get performancei
End for
Take the average of from performance1 to performance10
37
Fold 1
Fold 2
Accuracy
0.99
Precision
…
Fold 10
Mean
0.97
0.87
0.90
0.90
0.92
0.95
0.92
Recall
0.85
0.84
0.90
0.91
Fmeasue
0.86
0.84
0.86
0.87
How to choose the best model ?

Keep a part of the training data as testing data and
evaluate your model using the secretly kept dataset.
F-measure on
the testing data
Fold 1’s Model
0.90
Fold 2’s Model
0.92
…
0.91
Fold 10’s Model
0.99
Best Model: Fold 10’s model
38
Training
Validation
Testing
Cross
Validation
Local Minima
𝜕𝐸
−𝜇
𝜕𝑤𝑖,𝑗

𝑤𝑖,𝑗 = 𝑤𝑖,𝑗 +

Gradient descent will not
always guarantee a global
minimum
We may get stuck in a
local minimum!
Can you think of some
ways to solve the
problem?


Global minimum
Local minimum
39
Local Minima
𝑤𝑖,𝑗 = 𝑤𝑖,𝑗 +

1)
2)
3)
𝜕𝐸
−𝜇
𝜕𝑤𝑖,𝑗
Randomize initial weights &
Train multiple times
Tune the learning rate
appropriately
Anymore?
Global minimum
Local minimum
40
References
Backward Propagation Tutorial
http://bi.snu.ac.kr/Courses/g-ai01/Chapter5-NN.pdf

CS-449: Neural Networks
http://www.willamette.edu/~gorr/classes/cs449/intro.html

Cross-validation for detecting and preventing
over-fitting
http://www.autonlab.org/tutorials/overfit10.pdf

41
Download