An informal account of BackProp

advertisement
An informal account of
BackProp
For each pattern in the training set:
Compute the error at the output nodes
Compute Dw for each wt in 2nd layer
Compute delta (generalized error
expression) for hidden units
Compute Dw for each wt in 1st layer
After amassing Dw for all weights and, change each wt a little bit, as
determined by the learning rate
Dwij   ipo jp
Backprop Details
Here we go…
Also refer to web notes for derivation
The output layer
wjk
k
wij
j
yi
ti: target
i
E = Error = ½ ∑i (ti – yi)2
The derivative of the sigmoid is just
learning rate
E
Wij
E
DWij   
Wij
Wij  Wij   
E E yi xi



 ti  yi   f ' ( xi )  y j
Wij yi xi Wij
yi 1  yi 
DWij    ti  yi   yi 1  yi   y j
DWij     y j   i
 i  ti  yi  yi 1  yi 
The hidden layer
wjk
wij
yi
ti: target
DW jk   
E
W jk
E
E y j x j



W jk y j x j W jk
k
j
i
E = Error = ½ ∑i (ti – yi)2
E
E yi xi



   (ti  yi )  f ' ( xi ) Wij
y j
i yi xi y j
i
E


    (ti  yi )  f ' ( xi ) Wij   f ' ( x j )  yk
W jk  i



DW jk       (ti  yi )  yi 1  yi  Wij   y j 1  y j  yk
 i

DW jk     yk   j


 j    (ti  yi )  yi 1  yi Wij   y j 1  y j 
 i



 j   Wij   i   y j 1  y j 
 i

Momentum term

The speed of learning is governed by the learning rate.

If the rate is low, convergence is slow
 If the rate is too high, error oscillates without reaching minimum.
Dw ( n)  Dw ( n  1)  i ( n)y j ( n)
ij
ij
0  1

Momentum tends to smooth small weight error fluctuations.
the momentum accelerates the descent in steady downhill directions.
the momentum has a stabilizing effect in directions that oscillate in time.
Convergence
May get stuck in local minima
 Weights may diverge
…but works well in practice


Representation power:
2
layer networks : any continuous function
 3 layer networks : any function
Local Minimum
USE A RANDOM
COMPONENT
SIMULATED
ANNEALING
Overfitting and generalization
TOO MANY HIDDEN
NODES TENDS TO
OVERFIT
Overfitting in ANNs
Early Stopping (Important!!!)

Stop training when error goes up on
validation set
Stopping criteria

Sensible stopping criteria:
 total
mean squared error change:
Back-prop is considered to have converged when the
absolute rate of change in the average squared error per
epoch is sufficiently small (in the range [0.01, 0.1]).
 generalization
based criterion:
After each epoch the NN is tested for generalization. If the
generalization performance is adequate then stop. If this
stopping criterion is used then the part of the training set
used for testing the network generalization will not be used
for updating the weights.
Architectural Considerations
What is the right size network for a given job?
How many hidden units?
Too many: no generalization
Too few:
no solution
Possible answer: Constructive algorithm, e.g.
Cascade Correlation (Fahlman, & Lebiere 1990)
etc
Network Topology
The number of layers and of neurons
depend on the specific task. In practice this
issue is solved by trial and error.
 Two types of adaptive algorithms can be
used:

 start
from a large network and successively
remove some nodes and links until network
performance degrades.
 begin with a small network and introduce new
neurons until performance is satisfactory.
Problems and Networks
•Some problems have natural "good" solutions
•Solving a problem may be possible by providing
the right armory of general-purpose tools, and
recruiting them as needed
•Networks are general purpose tools.
•Choice of network type, training, architecture, etc
greatly influences the chances of successfully
solving a problem
•Tension: Tailoring tools for a specific job Vs
Exploiting general purpose learning mechanism
Summary

Multiple layer feed-forward networks
 Replace
Step with Sigmoid (differentiable)
function
 Learn weights by gradient descent on error
function
 Backpropagation algorithm for learning
 Avoid overfitting by early stopping
ALVINN drives 70mph on highways
Use MLP Neural Networks when …
(vectored) Real inputs, (vectored) real
outputs
 You’re not interested in understanding how
it works
 Long training times acceptable
 Short execution (prediction) times required
 Robust to noise in the dataset

Applications of FFNN
Classification, pattern recognition:
 FFNN can be applied to tackle non-linearly
separable learning problems.




Recognizing printed or handwritten characters,
Face recognition
Classification of loan applications into credit-worthy and
non-credit-worthy groups
Analysis of sonar radar to determine the nature of the
source of a signal
Regression and forecasting:
 FFNN can be applied to learn non-linear functions
(regression) and in particular functions whose inputs
is a sequence of measurements over time (time
series).
Extensions of Backprop Nets
Recurrent Architectures
 Backprop through time

Elman Nets & Jordan Nets
Output
1
Output
1
Hidden
Context
Input
α
Hidden
Context
Input
Updating the context as we receive input
• In Jordan nets we model “forgetting” as well
• The recurrent connections have fixed weights
• You can train these networks using good ol’ backprop
Recurrent Backprop
w2
a
w4
b
w1
c
w3
unrolling
3 iterations
a
b
c
a
b
c
a
b
c
w1
a
w2 w3
b
w4
c
• we’ll pretend to step through the network one iteration
at a time
• backprop as usual, but average equivalent weights (e.g.
all 3 highlighted edges on the right are equivalent)
Connectionist Models in Cognitive Science
Structured
PDP (Elman)
Hybrid
Neural
Conceptual
Existence
Data Fitting
5 levels of Neural Theory of Language
Pyscholinguistic
experiments
Spatial
Relation
Motor
Control
Metaphor Grammar
Cognition and Language
abstraction
Computation
Structured Connectionism
Triangle Nodes
Neural Net and
learning
SHRUTI
Computational Neurobiology
Biology
Neural
Development
Quiz
Midterm
Finals
The Color Story: A Bridge
between Levels of NTL
(http://www.ritsumei.ac.jp/~akitaoka/color-e.html
A Tour of the Visual System
• two regions of interest:
– retina
– LGN
The Physics of Light
Light: Electromagnetic energy whose
wavelength is between 400 nm and 700 nm.
(1 nm = 10 -6 meter)
© Stephen E. Palmer, 2002
The Physics of Light
Some examples of the spectra of light sources
.
B. Gallium Phosphide Crystal
# Photons
# Photons
A. Ruby Laser
400 500
600
700
400 500
Wavelength (nm.)
700
Wavelength (nm.)
D. Normal Daylight
# Photons
C. Tungsten Lightbulb
# Photons
600
400 500
600
700
400 500
600
700
© Stephen E. Palmer, 2002
The Physics of Light
% Photons Reflected
Some examples of the reflectance spectra of surfaces
Red
400
Yellow
700 400
Blue
700 400
Wavelength (nm)
Purple
700 400
700
© Stephen E. Palmer, 2002
The Psychophysical Correspondence
There is no simple functional description for the perceived
color of all lights under all viewing conditions, but …...
A helpful constraint:
Consider only physical spectra with normal distributions
mean
area
# Photons
400
500
variance
600
700
Wavelength (nm.)
© Stephen E. Palmer, 2002
Physiology of Color Vision
Two types of light-sensitive receptors
Cones
cone-shaped
less sensitive
operate in high light
color vision
Rods
rod-shaped
highly sensitive
operate at night
gray-scale vision
© Stephen E. Palmer, 2002
The Microscopic View
Rods and Cones in the Retina
http://www.iit.edu/~npr/DrJennifer/visual/retina.html
What Rods and Cones Detect
Notice how they aren’t distributed evenly, and the
rod is more sensitive to shorter wavelengths
Physiology of Color Vision
Three kinds of cones: Absorption spectra
440
RELATIVE ABSORBANCE (%)
.
530 560 nm.
100
S
M
L
Opponent Processes:
R/G = L-M
G/R = M-L
B/Y = S-(M+L)
Y/B = (M+L)-S
50
400
450
500
550
600 650
WAVELENGTH (nm.)
Implementation of Trichromatic theory
© Stephen E. Palmer, 2002
Physiology of Color Vision
Double Opponent Cells in V1
G+ R -
Y+B-
R +G -
B+Y-
R + G-
B+Y-
G+ R -
Y+B-
Red/Green
Blue/Yellow
© Stephen E. Palmer, 2002
Color Blindness
Not everybody perceives colors in the same way!
What numbers do you see in these displays?
© Stephen E. Palmer, 2002
Theories of Color Vision
A Dual Process Wiring Diagram
Trichromatic Stage
S
M
+
B+ Y-
L
+
ML +
+
S
+
-
R+ G-
M
-
-
+
S+M+L
+
ML
Y+ B- +
-
W+ Bk-
S-M-L
L
G+ R-
Bk+ W-
L-M
-S+M+L
-S-M-L
M-L
Opponent Process Stage
© Stephen E. Palmer, 2002
Color Naming
Basic Color Terms (Berlin & Kay)
Criteria:
1. Single words -- not “light-blue” or “blue-green”
2. Frequently used -- not “mauve” or “cyan”
3. Refer primarily to colors -- not “lime” or “gold”
4. Apply to any object -- not “roan” or “blond”
© Stephen E. Palmer, 2002
Color Naming
BCTs in English
Red
Green
Blue
Yellow
Black
White
Gray
Brown
Purple
Orange*
Pink
© Stephen E. Palmer, 2002
Color Naming
Five more BCTs in a study of 98 languages
Light-Blue
Warm
Cool
Light-Warm
Dark-Cool
© Stephen E. Palmer, 2002
The WCS Color Chips
• Basic color terms:
– Single word (not blue-green)
– Frequently used (not mauve)
– Refers primarily to colors (not lime)
– Applies to any object (not blonde)
FYI:
English has 11
basic color terms
Results of Kay’s Color Study
Stage I
II
IIIa / IIIb
IV
V
VI
VII
W or R or Y
W
W
W
W
W
W
Bk or G or Bu
R or Y
R or Y
R
R
R
R
Bk or G or Bu
G or Bu
Y
Y
Y
Y
Bk
G or Bu
G
G
G
Bk
Bu
Bu
Bu
Bk
Bk
Bk
Y+Bk (Brown)
Y+Bk (Brown)
W
R
Y
R+W (Pink)
Bk or G or Bu
R + Bu (Purple)
R+Y (Orange)
B+W (Grey)
If you group languages into the number of basic color terms
they have, as the number of color terms increases,
additional terms specify focal colors
Color Naming
Typical “developmental” sequence of BCTs
( 2 Terms)
( 3 Terms)
Whit e
( 4 Terms) ( 5 Terms)
Whit e
Light -w arm
Warm
Dark-cool
Whit e
Whit e
Red
Red
Yello w
Yellow
Black
Black
Warm
Black
( 6 Terms)
Dark-cool
Green
Cool
Cool
Blue
© Stephen E. Palmer, 2002
Color Naming
(Berlin & Kay)
Studied color categories in two ways
Boundaries
Best examples
© Stephen E. Palmer, 2002
Color Naming
(Rosch)
MEMORY :
Focal colors are remembered better than nonfocal colors.
LEARNING:
New color categories centered on focal colors are learned faster.
Categorization:
Focal colors are categorized more quickly than nonfocal colors.
© Stephen E. Palmer, 2002
Color Naming
SETS
FUZZY
LOGIC(Kay
(Zadeh)
A fuzzy logicalFUZZY
model
ofAND
color
naming
& Mc Daniel)
Fuzzy set theory (Zadeh)
Degree of Membership
"Green"
1.0
extremely
very
Degree of
Membership
sorta
a little bit
0
not-at-all
0
Hue
© Stephen E. Palmer, 2002
Color Naming
“Primary” color categories
focal
blue
focal
green
focal
yellow
focal
red
Blue
Green
Yellow
Red
1
Degree of
Membership
0
Hue
1
Degree of
Membership
0
Green
Blue
Yellow
Red
Hue
© Stephen E. Palmer, 2002
Color Naming
“Primary” color categories
Red
Green
Blue
Yellow
Black
White
© Stephen E. Palmer, 2002
Color Naming
“Derived” color categories
.
Yellow
Red
Hue
Y
Degree of
Membership
Fuzzy
logical
“ANDf”
U
1
R
0
1
Orange
Hue
Degree of
Membership
0
Hue
© Stephen E. Palmer, 2002
Color Naming
“Derived” color categories
Orange = Red ANDf Yellow
Purple = Red ANDf Blue
Gray = Black ANDf White
Pink = Red ANDf White
Brown = Yellow ANDf Black
(Goluboi = Blue ANDf White)
© Stephen E. Palmer, 2002
Color Naming
“Composite” color categories
Fuzzy
logical
“ORf”
Warm = Red Orf Yellow
Cool = Blue Orf Green
Light-warm = White Orf Warm
Dark-cool = Black Orf Cool
© Stephen E. Palmer, 2002
Color Naming
FUZZY LOGICAL MODEL OF COLOR NAMING (Kay & McDaniel)
Only 16 Basic Color Terms in Hundreds of Languages:
Red
Green
Blue
Yellow
Black
W hite
Orange
Purple
Brown
Pink
Gray
[Light-blue]
PRIMARY
(Fuzzy sets)
Degree of Membership
0
COMPOSITE
(Fuzzy OR f )
1.0
1.0
0
DERIVED
(Fuzzy AND f )
[Warm]
[Cool]
[Light-warm]
[Dark-cool]
0
Yellow
Orange = Yellow AND f Red
Warm = Yellow OR f RED
© Stephen E. Palmer, 2002
Download