Lecture 13 The Cognitive Inversion Artificial Neural Networks 11/2/06 17:29

advertisement
Artificial Neural Networks
11/2/06 17:29
The Cognitive Inversion
• Computers can do some things very well that are difficult
for people
– e.g., arithmetic calculations
– playing chess & other board games
Lecture 13
– doing proofs in formal logic & mathematics
– handling large amounts of data precisely
• But computers are very bad at some things that are easy for
people (and even some animals)
Artificial Neural Networks
– e.g., face recognition & general object recognition
– autonomous locomotion
– sensory-motor coordination
• Conclusion: brains work very differently from Von
Neumann computers
11/2/06
1
11/2/06
Two Different Approaches to
Computing
The 100-Step Rule
• Typical recognition
tasks take less than
one second
• Neurons take several
milliseconds to fire
• Therefore then can be
at most about 100
sequential processing
steps
11/2/06
2
Von Neumann: Narrow but Deep
Neural Computation: Shallow but Wide
…
3
11/2/06
4
Neurons are Not Logic Gates
How Wide?
• Speed
– electronic logic gates are very fast (nanoseconds)
– neurons are comparatively slow (milliseconds)
Retina:
100 million receptors
• Precision
– logic gates are highly reliable digital devices
– neurons are imprecise analog devices
• Connections
– logic gates have few inputs (usually 1 to 3)
– many neurons have >100 000 inputs
Optic nerve: one million nerve fibers
11/2/06
5
11/2/06
6
1
Artificial Neural Networks
11/2/06 17:29
Typical Artificial Neuron
connection
weights
Artificial Neural Networks
inputs
output
threshold
11/2/06
7
Typical Artificial Neuron
summation
11/2/06
8
Operation of Artificial Neuron
activation
function
0×2 = 0
1 × 1 = –1
1×4 = 4
0
2
1
–1
net input
Σ
3
3>2
1
4
2
1
11/2/06
9
11/2/06
Operation of Artificial Neuron
10
Feedforward Network
1×2 = 2
1 × 1 = –1
0×4 = 0
1
...
0
...
1<2
...
1
...
Σ
...
–1
...
2
1
4
0
11/2/06
input
layer
2
11
11/2/06
hidden
layers
output
layer
12
2
Artificial Neural Networks
11/2/06 17:29
13
...
...
...
...
...
...
...
11/2/06
16
17
11/2/06
...
...
...
...
...
Feedforward Network In
Operation (6)
...
...
...
...
...
...
...
...
15
Feedforward Network In
Operation (5)
11/2/06
14
Feedforward Network In
Operation (4)
...
...
...
...
...
...
Feedforward Network In
Operation (3)
11/2/06
...
11/2/06
...
11/2/06
...
...
Feedforward Network In
Operation (2)
...
...
...
...
...
input
...
Feedforward Network In
Operation (1)
18
3
Artificial Neural Networks
11/2/06 17:29
11/2/06
19
Comparison with Non-Neural
Net Approaches
– subtle discriminations
– holistic judgments
– context sensitivity
...
...
...
...
11/2/06
20
• The knowledge is implicit in the connection
weights between the neurons
• Items of knowledge are not stored in dedicated
memory locations, as in a Von Neumann
computer
• “Holographic” knowledge representation:
– each knowledge item is distributed over many
connections
– each connection encodes many knowledge items
• Memory & processing is robust in face of damage,
errors, inaccuracy, noise, …
21
11/2/06
Differences from
Digital Calculation
22
Supervised Learning
• Produce desired outputs for training inputs
• Generalize reasonably & appropriately to
other inputs
• Good example: pattern recognition
• Neural nets are trained rather than
programmed
• Information represented in continuous
images (rather than language-like
structures)
• Information processing by continuous
image processing (rather than explicit rules
applied in individual steps)
• Indefiniteness is inevitable (rather than
definiteness assumed)
11/2/06
output
Connectionist Architectures
• Non-NN approaches typically decide output
from a small number of dominant factors
• NNs typically look at a large number of
factors, each of which weakly influences
output
• NNs permit:
11/2/06
...
...
Feedforward Network In
Operation (8)
...
...
...
...
...
...
Feedforward Network In
Operation (7)
– another difference from Von Neumann
computation
23
11/2/06
24
4
Artificial Neural Networks
11/2/06 17:29
Learning for Output Neuron (1)
Learning for Output Neuron (2)
0
0
Desired
output:
2
–1
Σ
3
3>2
1
1
1
No change!
4
25
Learning for Output Neuron (3)
0
Output is
too large
26
Learning for Output Neuron (4)
1
1<2
1
0
0
No change!
0
27
Output is
too small
Back-Propagation:
Forward Pass
Input
...
...
Desired
output
1
...
...
...
X
0 1
28
How do we adjust the weights of the hidden layers?
...
1<2
XXX
2.25
>2
11/2/06
Credit Assignment Problem
...
1
X
2.25
2
0
11/2/06
hidden
layers
Σ
4
2
...
–0.5–1
X
...
1
...
Σ
Desired
output:
X
2 2.75
...
–1
4
...
X
1 0
2
Desired
output:
2
...
3>2
XXX
1.75
<2
11/2/06
1
11/2/06
3
X
1.75
1
11/2/06
input
layer
Σ
X
4 3.25
2
1
1
–1.5–1
X
...
1
Desired
output:
2
Desired
output
output
layer
29
11/2/06
30
5
Artificial Neural Networks
11/2/06 17:29
31
Desired
output
Input
35
11/2/06
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
34
...
...
...
11/2/06
Back-Propagation:
Correct All Hidden Layers
...
...
...
...
...
...
...
...
...
...
33
...
...
11/2/06
Desired
output
Input
Back-Propagation:
Correct All Hidden Layers
Input
32
Back-Propagation:
Correct All Hidden Layers
Desired
output
11/2/06
...
11/2/06
Back-Propagation:
Correct Last Hidden Layer
Input
Desired
output
Input
...
11/2/06
...
...
...
...
...
...
Desired
output
Input
...
...
Back-Propagation:
Correct Output Neuron Weights
...
...
Back-Propagation:
Actual vs. Desired Output
Desired
output
36
6
Artificial Neural Networks
11/2/06 17:29
...
...
...
...
...
Use of Back-Propagation (BP)
• Typically the weights are changed slowly
• Typically net will not give correct outputs
for all training inputs after one adjustment
• Each input/output pair is used repeatedly for
training
• BP may be slow
• But there are many better ANN learning
algorithms
...
...
Back-Propagation:
Correct First Hidden Layer
Desired
output
Input
11/2/06
37
11/2/06
ANN Training Procedures
• Supervised training: we show the net the
output it should produce for each training
input (e.g., BP)
• Reinforcement training: we tell the net if its
output is right or wrong, but not what the
correct output is
• Unsupervised training: the net attempts to
find patterns in its environment without
external guidance
11/2/06
39
38
Applications of ANNs
• “Neural nets are the second-best way of
doing everything”
• If you really understand a problem, you can
design a special purpose algorithm for it,
which will beat a NN
• However, if you don’t understand your
problem very well, you can generally train a
NN to do it well enough
11/2/06
40
Hopfield Network
•
•
•
•
•
The Hopfield Network
(and constraint satisfaction)
Symmetric weights: wij = wji
No self-action: wii = 0
Zero threshold: θ = 0
Bipolar states: si ∈ {–1, +1}
Discontinuous bipolar activation function:
$#1,
" ( h ) = sgn( h ) = %
&+1,
11/2/06
41
11/2/06
h<0
h>0
42
!
7
Artificial Neural Networks
11/2/06 17:29
Positive Coupling
Negative Coupling
• Positive sense (sign)
• Large strength
11/2/06
• Negative sense (sign)
• Large strength
43
11/2/06
44
State = –1 & Local Field < 0
Weak Coupling
• Either sense (sign)
• Little strength
h<0
11/2/06
11/2/06
45
11/2/06
46
State = –1 & Local Field > 0
State Reverses
h>0
h>0
47
11/2/06
48
8
Artificial Neural Networks
11/2/06 17:29
State = +1 & Local Field > 0
State = +1 & Local Field < 0
h>0
h<0
11/2/06
49
11/2/06
50
Hopfield Net as Soft Constraint
Satisfaction System
State Reverses
• States of neurons as yes/no decisions
• Weights represent soft constraints between
decisions
h<0
– hard constraints must be respected
– soft constraints have degrees of importance
• Decisions change to better respect
constraints
• Is there an optimal set of decisions that best
respects all constraints?
11/2/06
51
11/2/06
52
Convergence
• Does such a system converge to a stable
state?
• Under what conditions does it converge?
• There is a sense in which each step relaxes
the “tension” in the system (or increases its
“harmony”)
• But could a relaxation of one neuron lead to
greater tension in other places?
Demonstration of Hopfield Net
Run Hopfield Demo
11/2/06
53
11/2/06
54
9
Artificial Neural Networks
11/2/06 17:29
Quantifying Harmony
–10
H=9
2
H=2
–1
H=1
–1
2
9
–10
decreasing harmony
increasing harmony
9
Energy
H = 10
H = –1
H = –2
• “Energy” (or “tension”) is the opposite of
harmony
• E = –H
H = –9
H = –10
11/2/06
Energy
Harmony
55
11/2/06
Energy Does Not Increase
56
Conclusion
• In each step in which a neuron is considered
for update:
E{s(t + 1)} – E{s(t)} ≤ 0
• Energy cannot increase
• Energy decreases if any neuron changes
• Must it stop? (Yes)
• If we do asynchronous updating, the
Hopfield net must reach a stable, minimum
energy state in a finite number of updates
• This does not imply that it is a global
minimum
11/2/06
11/2/06
57
58
Conceptual
Picture of
Descent on
Energy
Surface
11/2/06
Energy
Surface
(fig. from Solé & Goodwin)
59
11/2/06
(fig. from Haykin Neur. Netw.)
60
10
Artificial Neural Networks
11/2/06 17:29
Energy
Surface
+
Flow
Lines
(fig. from Haykin Neur. Netw.)
11/2/06
61
Storing
Memories as
Attractors
Flow
Lines
Basins of
Attraction
(fig. from Haykin Neur. Netw.)
11/2/06
62
Demonstration of Hopfield Net
Run Hopfield Demo
11/2/06
(fig. from Solé & Goodwin)
63
Example of
Pattern
Restoration
11/2/06
11/2/06
64
Example of
Pattern
Restoration
(fig. from Arbib 1995)
65
11/2/06
(fig. from Arbib 1995)
66
11
Artificial Neural Networks
11/2/06 17:29
Example of
Pattern
Restoration
11/2/06
Example of
Pattern
Restoration
(fig. from Arbib 1995)
67
Example of
Pattern
Restoration
11/2/06
(fig. from Arbib 1995)
68
(fig. from Arbib 1995)
70
(fig. from Arbib 1995)
72
Example of
Pattern
Completion
(fig. from Arbib 1995)
69
Example of
Pattern
Completion
11/2/06
11/2/06
11/2/06
Example of
Pattern
Completion
(fig. from Arbib 1995)
71
11/2/06
12
Artificial Neural Networks
11/2/06 17:29
Example of
Pattern
Completion
11/2/06
Example of
Pattern
Completion
(fig. from Arbib 1995)
73
Example of
Association
11/2/06
(fig. from Arbib 1995)
74
(fig. from Arbib 1995)
76
(fig. from Arbib 1995)
78
Example of
Association
(fig. from Arbib 1995)
75
Example of
Association
11/2/06
11/2/06
11/2/06
Example of
Association
(fig. from Arbib 1995)
77
11/2/06
13
Artificial Neural Networks
11/2/06 17:29
Applications of
Hopfield Memory
Example of
Association
11/2/06
•
•
•
•
(fig. from Arbib 1995)
79
Pattern restoration
Pattern completion
Pattern generalization
Pattern association
11/2/06
Hopfield Net for Optimization
and for Associative Memory
Hebb’s Rule
• For optimization:
“When an axon of cell A is near enough to
excite a cell B and repeatedly or persistently
takes part in firing it, some growth or
metabolic change takes place in one or both
cells such that A’s efficiency, as one of the
cells firing B, is increased.”
– we know the weights (couplings)
– we want to know the minima (solutions)
• For associative memory:
– we know the minima (retrieval states)
– we want to know the weights
11/2/06
—Donald Hebb (The Organization of Behavior, 1949, p. 62)
81
Example of Hebbian Learning:
Pattern Imprinted
11/2/06
80
11/2/06
82
Example of Hebbian Learning:
Partial Pattern Reconstruction
83
11/2/06
84
14
Artificial Neural Networks
11/2/06 17:29
Trapping in Local Minimum
Stochastic Neural Networks
(in particular, the stochastic Hopfield network)
11/2/06
85
11/2/06
Escape from Local Minimum
11/2/06
Escape from Local Minimum
87
11/2/06
88
The Stochastic Neuron
Motivation
Deterministic neuron : s"i = sgn( hi )
Pr{si" = +1} = #( hi )
Pr{s"i = $1} = 1$ #( hi )
• Idea: with low probability, go against the local
field
– move up the energy surface
– make the “wrong” microdecision
σ(h)
Stochastic neuron :
• Potential value for optimization: escape from local
optima
• Potential value for associative memory: escape
from spurious states
– because they have higher energy than imprinted states
11/2/06
86
Pr{s"i = +1} = # ( hi )
!
Logistic sigmoid : " ( h ) =
!
89
h
Pr{s"i = $1} = 1$ # ( hi )
11/2/06
1
1+ exp(#2 h T )
90
!
15
Artificial Neural Networks
11/2/06 17:29
Logistic Sigmoid
With Varying T
Properties of Logistic Sigmoid
" ( h) =
1
1+ e#2h T
• As h → +∞, σ(h) → 1
• As h → –∞, σ(h)!→ 0
• σ(0) = 1/2
11/2/06
91
11/2/06
Logistic Sigmoid
T = 0.5
92
Logistic Sigmoid
T = 0.01
Slope at origin = 1 / 2T
11/2/06
93
11/2/06
Logistic Sigmoid
T = 0.1
11/2/06
94
Logistic Sigmoid
T=1
95
11/2/06
96
16
Artificial Neural Networks
11/2/06 17:30
Logistic Sigmoid
T = 10
11/2/06
Logistic Sigmoid
T = 100
97
11/2/06
98
Pseudo-Temperature
•
•
•
•
Temperature = measure of thermal energy (heat)
Thermal energy = vibrational energy of molecules
A source of random motion
Pseudo-temperature = a measure of nondirected
(random) change
• Logistic sigmoid gives same equilibrium
probabilities as Boltzmann-Gibbs distribution
11/2/06
99
Simulated Annealing
(Kirkpatrick, Gelatt & Vecchi, 1983)
11/2/06
Dilemma
100
Quenching vs. Annealing
• In the early stages of search, we want a high
temperature, so that we will explore the
space and find the basins of the global
minimum
• In the later stages we want a low
temperature, so that we will relax into the
global minimum and not wander away from
it
• Solution: decrease the temperature
gradually during search
• Quenching:
11/2/06
11/2/06
101
–
–
–
–
rapid cooling of a hot material
may result in defects & brittleness
local order but global disorder
locally low-energy, globally frustrated
• Annealing:
–
–
–
–
slow cooling (or alternate heating & cooling)
reaches equilibrium at each temperature
allows global order to emerge
achieves global low-energy state
102
17
Artificial Neural Networks
11/2/06 17:30
Multiple Domains
Moving Domain Boundaries
global
incoherence
local
coherence
11/2/06
103
11/2/06
Effect of Moderate Temperature
11/2/06
(fig. from Anderson Intr. Neur. Comp.)
105
104
Effect of High Temperature
11/2/06
Effect of Low Temperature
(fig. from Anderson Intr. Neur. Comp.)
106
Annealing Schedule
• Controlled decrease of temperature
• Should be sufficiently slow to allow
equilibrium to be reached at each
temperature
• With sufficiently slow annealing, the global
minimum will be found with probability 1
• Design of schedules is a topic of research
11/2/06
(fig. from Anderson Intr. Neur. Comp.)
107
11/2/06
108
18
Artificial Neural Networks
11/2/06 17:30
Necker Cube
Demonstration of Boltzmann
Machine
& Necker Cube Example
Run ~mclennan/pub/cube/cubedemo
11/2/06
109
11/2/06
Biased Necker Cube
110
Summary
• Non-directed change (random motion)
permits escape from local optima and
spurious states
• Pseudo-temperature can be controlled to
adjust relative degree of exploration and
exploitation
11/2/06
111
11/2/06
112
19
Download