Artificial Neural Networks 11/2/06 17:29 The Cognitive Inversion • Computers can do some things very well that are difficult for people – e.g., arithmetic calculations – playing chess & other board games Lecture 13 – doing proofs in formal logic & mathematics – handling large amounts of data precisely • But computers are very bad at some things that are easy for people (and even some animals) Artificial Neural Networks – e.g., face recognition & general object recognition – autonomous locomotion – sensory-motor coordination • Conclusion: brains work very differently from Von Neumann computers 11/2/06 1 11/2/06 Two Different Approaches to Computing The 100-Step Rule • Typical recognition tasks take less than one second • Neurons take several milliseconds to fire • Therefore then can be at most about 100 sequential processing steps 11/2/06 2 Von Neumann: Narrow but Deep Neural Computation: Shallow but Wide … 3 11/2/06 4 Neurons are Not Logic Gates How Wide? • Speed – electronic logic gates are very fast (nanoseconds) – neurons are comparatively slow (milliseconds) Retina: 100 million receptors • Precision – logic gates are highly reliable digital devices – neurons are imprecise analog devices • Connections – logic gates have few inputs (usually 1 to 3) – many neurons have >100 000 inputs Optic nerve: one million nerve fibers 11/2/06 5 11/2/06 6 1 Artificial Neural Networks 11/2/06 17:29 Typical Artificial Neuron connection weights Artificial Neural Networks inputs output threshold 11/2/06 7 Typical Artificial Neuron summation 11/2/06 8 Operation of Artificial Neuron activation function 0×2 = 0 1 × 1 = –1 1×4 = 4 0 2 1 –1 net input Σ 3 3>2 1 4 2 1 11/2/06 9 11/2/06 Operation of Artificial Neuron 10 Feedforward Network 1×2 = 2 1 × 1 = –1 0×4 = 0 1 ... 0 ... 1<2 ... 1 ... Σ ... –1 ... 2 1 4 0 11/2/06 input layer 2 11 11/2/06 hidden layers output layer 12 2 Artificial Neural Networks 11/2/06 17:29 13 ... ... ... ... ... ... ... 11/2/06 16 17 11/2/06 ... ... ... ... ... Feedforward Network In Operation (6) ... ... ... ... ... ... ... ... 15 Feedforward Network In Operation (5) 11/2/06 14 Feedforward Network In Operation (4) ... ... ... ... ... ... Feedforward Network In Operation (3) 11/2/06 ... 11/2/06 ... 11/2/06 ... ... Feedforward Network In Operation (2) ... ... ... ... ... input ... Feedforward Network In Operation (1) 18 3 Artificial Neural Networks 11/2/06 17:29 11/2/06 19 Comparison with Non-Neural Net Approaches – subtle discriminations – holistic judgments – context sensitivity ... ... ... ... 11/2/06 20 • The knowledge is implicit in the connection weights between the neurons • Items of knowledge are not stored in dedicated memory locations, as in a Von Neumann computer • “Holographic” knowledge representation: – each knowledge item is distributed over many connections – each connection encodes many knowledge items • Memory & processing is robust in face of damage, errors, inaccuracy, noise, … 21 11/2/06 Differences from Digital Calculation 22 Supervised Learning • Produce desired outputs for training inputs • Generalize reasonably & appropriately to other inputs • Good example: pattern recognition • Neural nets are trained rather than programmed • Information represented in continuous images (rather than language-like structures) • Information processing by continuous image processing (rather than explicit rules applied in individual steps) • Indefiniteness is inevitable (rather than definiteness assumed) 11/2/06 output Connectionist Architectures • Non-NN approaches typically decide output from a small number of dominant factors • NNs typically look at a large number of factors, each of which weakly influences output • NNs permit: 11/2/06 ... ... Feedforward Network In Operation (8) ... ... ... ... ... ... Feedforward Network In Operation (7) – another difference from Von Neumann computation 23 11/2/06 24 4 Artificial Neural Networks 11/2/06 17:29 Learning for Output Neuron (1) Learning for Output Neuron (2) 0 0 Desired output: 2 –1 Σ 3 3>2 1 1 1 No change! 4 25 Learning for Output Neuron (3) 0 Output is too large 26 Learning for Output Neuron (4) 1 1<2 1 0 0 No change! 0 27 Output is too small Back-Propagation: Forward Pass Input ... ... Desired output 1 ... ... ... X 0 1 28 How do we adjust the weights of the hidden layers? ... 1<2 XXX 2.25 >2 11/2/06 Credit Assignment Problem ... 1 X 2.25 2 0 11/2/06 hidden layers Σ 4 2 ... –0.5–1 X ... 1 ... Σ Desired output: X 2 2.75 ... –1 4 ... X 1 0 2 Desired output: 2 ... 3>2 XXX 1.75 <2 11/2/06 1 11/2/06 3 X 1.75 1 11/2/06 input layer Σ X 4 3.25 2 1 1 –1.5–1 X ... 1 Desired output: 2 Desired output output layer 29 11/2/06 30 5 Artificial Neural Networks 11/2/06 17:29 31 Desired output Input 35 11/2/06 ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 34 ... ... ... 11/2/06 Back-Propagation: Correct All Hidden Layers ... ... ... ... ... ... ... ... ... ... 33 ... ... 11/2/06 Desired output Input Back-Propagation: Correct All Hidden Layers Input 32 Back-Propagation: Correct All Hidden Layers Desired output 11/2/06 ... 11/2/06 Back-Propagation: Correct Last Hidden Layer Input Desired output Input ... 11/2/06 ... ... ... ... ... ... Desired output Input ... ... Back-Propagation: Correct Output Neuron Weights ... ... Back-Propagation: Actual vs. Desired Output Desired output 36 6 Artificial Neural Networks 11/2/06 17:29 ... ... ... ... ... Use of Back-Propagation (BP) • Typically the weights are changed slowly • Typically net will not give correct outputs for all training inputs after one adjustment • Each input/output pair is used repeatedly for training • BP may be slow • But there are many better ANN learning algorithms ... ... Back-Propagation: Correct First Hidden Layer Desired output Input 11/2/06 37 11/2/06 ANN Training Procedures • Supervised training: we show the net the output it should produce for each training input (e.g., BP) • Reinforcement training: we tell the net if its output is right or wrong, but not what the correct output is • Unsupervised training: the net attempts to find patterns in its environment without external guidance 11/2/06 39 38 Applications of ANNs • “Neural nets are the second-best way of doing everything” • If you really understand a problem, you can design a special purpose algorithm for it, which will beat a NN • However, if you don’t understand your problem very well, you can generally train a NN to do it well enough 11/2/06 40 Hopfield Network • • • • • The Hopfield Network (and constraint satisfaction) Symmetric weights: wij = wji No self-action: wii = 0 Zero threshold: θ = 0 Bipolar states: si ∈ {–1, +1} Discontinuous bipolar activation function: $#1, " ( h ) = sgn( h ) = % &+1, 11/2/06 41 11/2/06 h<0 h>0 42 ! 7 Artificial Neural Networks 11/2/06 17:29 Positive Coupling Negative Coupling • Positive sense (sign) • Large strength 11/2/06 • Negative sense (sign) • Large strength 43 11/2/06 44 State = –1 & Local Field < 0 Weak Coupling • Either sense (sign) • Little strength h<0 11/2/06 11/2/06 45 11/2/06 46 State = –1 & Local Field > 0 State Reverses h>0 h>0 47 11/2/06 48 8 Artificial Neural Networks 11/2/06 17:29 State = +1 & Local Field > 0 State = +1 & Local Field < 0 h>0 h<0 11/2/06 49 11/2/06 50 Hopfield Net as Soft Constraint Satisfaction System State Reverses • States of neurons as yes/no decisions • Weights represent soft constraints between decisions h<0 – hard constraints must be respected – soft constraints have degrees of importance • Decisions change to better respect constraints • Is there an optimal set of decisions that best respects all constraints? 11/2/06 51 11/2/06 52 Convergence • Does such a system converge to a stable state? • Under what conditions does it converge? • There is a sense in which each step relaxes the “tension” in the system (or increases its “harmony”) • But could a relaxation of one neuron lead to greater tension in other places? Demonstration of Hopfield Net Run Hopfield Demo 11/2/06 53 11/2/06 54 9 Artificial Neural Networks 11/2/06 17:29 Quantifying Harmony –10 H=9 2 H=2 –1 H=1 –1 2 9 –10 decreasing harmony increasing harmony 9 Energy H = 10 H = –1 H = –2 • “Energy” (or “tension”) is the opposite of harmony • E = –H H = –9 H = –10 11/2/06 Energy Harmony 55 11/2/06 Energy Does Not Increase 56 Conclusion • In each step in which a neuron is considered for update: E{s(t + 1)} – E{s(t)} ≤ 0 • Energy cannot increase • Energy decreases if any neuron changes • Must it stop? (Yes) • If we do asynchronous updating, the Hopfield net must reach a stable, minimum energy state in a finite number of updates • This does not imply that it is a global minimum 11/2/06 11/2/06 57 58 Conceptual Picture of Descent on Energy Surface 11/2/06 Energy Surface (fig. from Solé & Goodwin) 59 11/2/06 (fig. from Haykin Neur. Netw.) 60 10 Artificial Neural Networks 11/2/06 17:29 Energy Surface + Flow Lines (fig. from Haykin Neur. Netw.) 11/2/06 61 Storing Memories as Attractors Flow Lines Basins of Attraction (fig. from Haykin Neur. Netw.) 11/2/06 62 Demonstration of Hopfield Net Run Hopfield Demo 11/2/06 (fig. from Solé & Goodwin) 63 Example of Pattern Restoration 11/2/06 11/2/06 64 Example of Pattern Restoration (fig. from Arbib 1995) 65 11/2/06 (fig. from Arbib 1995) 66 11 Artificial Neural Networks 11/2/06 17:29 Example of Pattern Restoration 11/2/06 Example of Pattern Restoration (fig. from Arbib 1995) 67 Example of Pattern Restoration 11/2/06 (fig. from Arbib 1995) 68 (fig. from Arbib 1995) 70 (fig. from Arbib 1995) 72 Example of Pattern Completion (fig. from Arbib 1995) 69 Example of Pattern Completion 11/2/06 11/2/06 11/2/06 Example of Pattern Completion (fig. from Arbib 1995) 71 11/2/06 12 Artificial Neural Networks 11/2/06 17:29 Example of Pattern Completion 11/2/06 Example of Pattern Completion (fig. from Arbib 1995) 73 Example of Association 11/2/06 (fig. from Arbib 1995) 74 (fig. from Arbib 1995) 76 (fig. from Arbib 1995) 78 Example of Association (fig. from Arbib 1995) 75 Example of Association 11/2/06 11/2/06 11/2/06 Example of Association (fig. from Arbib 1995) 77 11/2/06 13 Artificial Neural Networks 11/2/06 17:29 Applications of Hopfield Memory Example of Association 11/2/06 • • • • (fig. from Arbib 1995) 79 Pattern restoration Pattern completion Pattern generalization Pattern association 11/2/06 Hopfield Net for Optimization and for Associative Memory Hebb’s Rule • For optimization: “When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth or metabolic change takes place in one or both cells such that A’s efficiency, as one of the cells firing B, is increased.” – we know the weights (couplings) – we want to know the minima (solutions) • For associative memory: – we know the minima (retrieval states) – we want to know the weights 11/2/06 —Donald Hebb (The Organization of Behavior, 1949, p. 62) 81 Example of Hebbian Learning: Pattern Imprinted 11/2/06 80 11/2/06 82 Example of Hebbian Learning: Partial Pattern Reconstruction 83 11/2/06 84 14 Artificial Neural Networks 11/2/06 17:29 Trapping in Local Minimum Stochastic Neural Networks (in particular, the stochastic Hopfield network) 11/2/06 85 11/2/06 Escape from Local Minimum 11/2/06 Escape from Local Minimum 87 11/2/06 88 The Stochastic Neuron Motivation Deterministic neuron : s"i = sgn( hi ) Pr{si" = +1} = #( hi ) Pr{s"i = $1} = 1$ #( hi ) • Idea: with low probability, go against the local field – move up the energy surface – make the “wrong” microdecision σ(h) Stochastic neuron : • Potential value for optimization: escape from local optima • Potential value for associative memory: escape from spurious states – because they have higher energy than imprinted states 11/2/06 86 Pr{s"i = +1} = # ( hi ) ! Logistic sigmoid : " ( h ) = ! 89 h Pr{s"i = $1} = 1$ # ( hi ) 11/2/06 1 1+ exp(#2 h T ) 90 ! 15 Artificial Neural Networks 11/2/06 17:29 Logistic Sigmoid With Varying T Properties of Logistic Sigmoid " ( h) = 1 1+ e#2h T • As h → +∞, σ(h) → 1 • As h → –∞, σ(h)!→ 0 • σ(0) = 1/2 11/2/06 91 11/2/06 Logistic Sigmoid T = 0.5 92 Logistic Sigmoid T = 0.01 Slope at origin = 1 / 2T 11/2/06 93 11/2/06 Logistic Sigmoid T = 0.1 11/2/06 94 Logistic Sigmoid T=1 95 11/2/06 96 16 Artificial Neural Networks 11/2/06 17:30 Logistic Sigmoid T = 10 11/2/06 Logistic Sigmoid T = 100 97 11/2/06 98 Pseudo-Temperature • • • • Temperature = measure of thermal energy (heat) Thermal energy = vibrational energy of molecules A source of random motion Pseudo-temperature = a measure of nondirected (random) change • Logistic sigmoid gives same equilibrium probabilities as Boltzmann-Gibbs distribution 11/2/06 99 Simulated Annealing (Kirkpatrick, Gelatt & Vecchi, 1983) 11/2/06 Dilemma 100 Quenching vs. Annealing • In the early stages of search, we want a high temperature, so that we will explore the space and find the basins of the global minimum • In the later stages we want a low temperature, so that we will relax into the global minimum and not wander away from it • Solution: decrease the temperature gradually during search • Quenching: 11/2/06 11/2/06 101 – – – – rapid cooling of a hot material may result in defects & brittleness local order but global disorder locally low-energy, globally frustrated • Annealing: – – – – slow cooling (or alternate heating & cooling) reaches equilibrium at each temperature allows global order to emerge achieves global low-energy state 102 17 Artificial Neural Networks 11/2/06 17:30 Multiple Domains Moving Domain Boundaries global incoherence local coherence 11/2/06 103 11/2/06 Effect of Moderate Temperature 11/2/06 (fig. from Anderson Intr. Neur. Comp.) 105 104 Effect of High Temperature 11/2/06 Effect of Low Temperature (fig. from Anderson Intr. Neur. Comp.) 106 Annealing Schedule • Controlled decrease of temperature • Should be sufficiently slow to allow equilibrium to be reached at each temperature • With sufficiently slow annealing, the global minimum will be found with probability 1 • Design of schedules is a topic of research 11/2/06 (fig. from Anderson Intr. Neur. Comp.) 107 11/2/06 108 18 Artificial Neural Networks 11/2/06 17:30 Necker Cube Demonstration of Boltzmann Machine & Necker Cube Example Run ~mclennan/pub/cube/cubedemo 11/2/06 109 11/2/06 Biased Necker Cube 110 Summary • Non-directed change (random motion) permits escape from local optima and spurious states • Pseudo-temperature can be controlled to adjust relative degree of exploration and exploitation 11/2/06 111 11/2/06 112 19