An informal account of BackProp For each pattern in the training set: Compute the error at the output nodes Compute Dw for each wt in 2nd layer Compute delta (generalized error expression) for hidden units Compute Dw for each wt in 1st layer After amassing Dw for all weights and, change each wt a little bit, as determined by the learning rate Dwij ipo jp Backprop Details Here we go… Also refer to web notes for derivation The output layer wjk k wij j yi ti: target i E = Error = ½ ∑i (ti – yi)2 The derivative of the sigmoid is just learning rate E Wij E DWij Wij Wij Wij E E yi xi ti yi f ' ( xi ) y j Wij yi xi Wij yi 1 yi DWij ti yi yi 1 yi y j DWij y j i i ti yi yi 1 yi The hidden layer wjk wij yi ti: target DW jk E W jk E E y j x j W jk y j x j W jk k j i E = Error = ½ ∑i (ti – yi)2 E E yi xi (ti yi ) f ' ( xi ) Wij y j i yi xi y j i E (ti yi ) f ' ( xi ) Wij f ' ( x j ) yk W jk i DW jk (ti yi ) yi 1 yi Wij y j 1 y j yk i DW jk yk j j (ti yi ) yi 1 yi Wij y j 1 y j i j Wij i y j 1 y j i Momentum term The speed of learning is governed by the learning rate. If the rate is low, convergence is slow If the rate is too high, error oscillates without reaching minimum. Dw ( n) Dw ( n 1) i ( n)y j ( n) ij ij 0 1 Momentum tends to smooth small weight error fluctuations. the momentum accelerates the descent in steady downhill directions. the momentum has a stabilizing effect in directions that oscillate in time. Convergence May get stuck in local minima Weights may diverge …but works well in practice Representation power: 2 layer networks : any continuous function 3 layer networks : any function Local Minimum USE A RANDOM COMPONENT SIMULATED ANNEALING Overfitting and generalization TOO MANY HIDDEN NODES TENDS TO OVERFIT Overfitting in ANNs Early Stopping (Important!!!) Stop training when error goes up on validation set Stopping criteria Sensible stopping criteria: total mean squared error change: Back-prop is considered to have converged when the absolute rate of change in the average squared error per epoch is sufficiently small (in the range [0.01, 0.1]). generalization based criterion: After each epoch the NN is tested for generalization. If the generalization performance is adequate then stop. If this stopping criterion is used then the part of the training set used for testing the network generalization will not be used for updating the weights. Architectural Considerations What is the right size network for a given job? How many hidden units? Too many: no generalization Too few: no solution Possible answer: Constructive algorithm, e.g. Cascade Correlation (Fahlman, & Lebiere 1990) etc Network Topology The number of layers and of neurons depend on the specific task. In practice this issue is solved by trial and error. Two types of adaptive algorithms can be used: start from a large network and successively remove some nodes and links until network performance degrades. begin with a small network and introduce new neurons until performance is satisfactory. Problems and Networks •Some problems have natural "good" solutions •Solving a problem may be possible by providing the right armory of general-purpose tools, and recruiting them as needed •Networks are general purpose tools. •Choice of network type, training, architecture, etc greatly influences the chances of successfully solving a problem •Tension: Tailoring tools for a specific job Vs Exploiting general purpose learning mechanism Summary Multiple layer feed-forward networks Replace Step with Sigmoid (differentiable) function Learn weights by gradient descent on error function Backpropagation algorithm for learning Avoid overfitting by early stopping ALVINN drives 70mph on highways Use MLP Neural Networks when … (vectored) Real inputs, (vectored) real outputs You’re not interested in understanding how it works Long training times acceptable Short execution (prediction) times required Robust to noise in the dataset Applications of FFNN Classification, pattern recognition: FFNN can be applied to tackle non-linearly separable learning problems. Recognizing printed or handwritten characters, Face recognition Classification of loan applications into credit-worthy and non-credit-worthy groups Analysis of sonar radar to determine the nature of the source of a signal Regression and forecasting: FFNN can be applied to learn non-linear functions (regression) and in particular functions whose inputs is a sequence of measurements over time (time series). Extensions of Backprop Nets Recurrent Architectures Backprop through time Elman Nets & Jordan Nets Output 1 Output 1 Hidden Context Input α Hidden Context Input Updating the context as we receive input • In Jordan nets we model “forgetting” as well • The recurrent connections have fixed weights • You can train these networks using good ol’ backprop Recurrent Backprop w2 a w4 b w1 c w3 unrolling 3 iterations a b c a b c a b c w1 a w2 w3 b w4 c • we’ll pretend to step through the network one iteration at a time • backprop as usual, but average equivalent weights (e.g. all 3 highlighted edges on the right are equivalent) Connectionist Models in Cognitive Science Structured PDP (Elman) Hybrid Neural Conceptual Existence Data Fitting 5 levels of Neural Theory of Language Pyscholinguistic experiments Spatial Relation Motor Control Metaphor Grammar Cognition and Language abstraction Computation Structured Connectionism Triangle Nodes Neural Net and learning SHRUTI Computational Neurobiology Biology Neural Development Quiz Midterm Finals The Color Story: A Bridge between Levels of NTL (http://www.ritsumei.ac.jp/~akitaoka/color-e.html A Tour of the Visual System • two regions of interest: – retina – LGN The Physics of Light Light: Electromagnetic energy whose wavelength is between 400 nm and 700 nm. (1 nm = 10 -6 meter) © Stephen E. Palmer, 2002 The Physics of Light Some examples of the spectra of light sources . B. Gallium Phosphide Crystal # Photons # Photons A. Ruby Laser 400 500 600 700 400 500 Wavelength (nm.) 700 Wavelength (nm.) D. Normal Daylight # Photons C. Tungsten Lightbulb # Photons 600 400 500 600 700 400 500 600 700 © Stephen E. Palmer, 2002 The Physics of Light % Photons Reflected Some examples of the reflectance spectra of surfaces Red 400 Yellow 700 400 Blue 700 400 Wavelength (nm) Purple 700 400 700 © Stephen E. Palmer, 2002 The Psychophysical Correspondence There is no simple functional description for the perceived color of all lights under all viewing conditions, but …... A helpful constraint: Consider only physical spectra with normal distributions mean area # Photons 400 500 variance 600 700 Wavelength (nm.) © Stephen E. Palmer, 2002 Physiology of Color Vision Two types of light-sensitive receptors Cones cone-shaped less sensitive operate in high light color vision Rods rod-shaped highly sensitive operate at night gray-scale vision © Stephen E. Palmer, 2002 The Microscopic View Rods and Cones in the Retina http://www.iit.edu/~npr/DrJennifer/visual/retina.html What Rods and Cones Detect Notice how they aren’t distributed evenly, and the rod is more sensitive to shorter wavelengths Physiology of Color Vision Three kinds of cones: Absorption spectra 440 RELATIVE ABSORBANCE (%) . 530 560 nm. 100 S M L Opponent Processes: R/G = L-M G/R = M-L B/Y = S-(M+L) Y/B = (M+L)-S 50 400 450 500 550 600 650 WAVELENGTH (nm.) Implementation of Trichromatic theory © Stephen E. Palmer, 2002 Physiology of Color Vision Double Opponent Cells in V1 G+ R - Y+B- R +G - B+Y- R + G- B+Y- G+ R - Y+B- Red/Green Blue/Yellow © Stephen E. Palmer, 2002 Color Blindness Not everybody perceives colors in the same way! What numbers do you see in these displays? © Stephen E. Palmer, 2002 Theories of Color Vision A Dual Process Wiring Diagram Trichromatic Stage S M + B+ Y- L + ML + + S + - R+ G- M - - + S+M+L + ML Y+ B- + - W+ Bk- S-M-L L G+ R- Bk+ W- L-M -S+M+L -S-M-L M-L Opponent Process Stage © Stephen E. Palmer, 2002 Color Naming Basic Color Terms (Berlin & Kay) Criteria: 1. Single words -- not “light-blue” or “blue-green” 2. Frequently used -- not “mauve” or “cyan” 3. Refer primarily to colors -- not “lime” or “gold” 4. Apply to any object -- not “roan” or “blond” © Stephen E. Palmer, 2002 Color Naming BCTs in English Red Green Blue Yellow Black White Gray Brown Purple Orange* Pink © Stephen E. Palmer, 2002 Color Naming Five more BCTs in a study of 98 languages Light-Blue Warm Cool Light-Warm Dark-Cool © Stephen E. Palmer, 2002 The WCS Color Chips • Basic color terms: – Single word (not blue-green) – Frequently used (not mauve) – Refers primarily to colors (not lime) – Applies to any object (not blonde) FYI: English has 11 basic color terms Results of Kay’s Color Study Stage I II IIIa / IIIb IV V VI VII W or R or Y W W W W W W Bk or G or Bu R or Y R or Y R R R R Bk or G or Bu G or Bu Y Y Y Y Bk G or Bu G G G Bk Bu Bu Bu Bk Bk Bk Y+Bk (Brown) Y+Bk (Brown) W R Y R+W (Pink) Bk or G or Bu R + Bu (Purple) R+Y (Orange) B+W (Grey) If you group languages into the number of basic color terms they have, as the number of color terms increases, additional terms specify focal colors Color Naming Typical “developmental” sequence of BCTs ( 2 Terms) ( 3 Terms) Whit e ( 4 Terms) ( 5 Terms) Whit e Light -w arm Warm Dark-cool Whit e Whit e Red Red Yello w Yellow Black Black Warm Black ( 6 Terms) Dark-cool Green Cool Cool Blue © Stephen E. Palmer, 2002 Color Naming (Berlin & Kay) Studied color categories in two ways Boundaries Best examples © Stephen E. Palmer, 2002 Color Naming (Rosch) MEMORY : Focal colors are remembered better than nonfocal colors. LEARNING: New color categories centered on focal colors are learned faster. Categorization: Focal colors are categorized more quickly than nonfocal colors. © Stephen E. Palmer, 2002 Color Naming SETS FUZZY LOGIC(Kay (Zadeh) A fuzzy logicalFUZZY model ofAND color naming & Mc Daniel) Fuzzy set theory (Zadeh) Degree of Membership "Green" 1.0 extremely very Degree of Membership sorta a little bit 0 not-at-all 0 Hue © Stephen E. Palmer, 2002 Color Naming “Primary” color categories focal blue focal green focal yellow focal red Blue Green Yellow Red 1 Degree of Membership 0 Hue 1 Degree of Membership 0 Green Blue Yellow Red Hue © Stephen E. Palmer, 2002 Color Naming “Primary” color categories Red Green Blue Yellow Black White © Stephen E. Palmer, 2002 Color Naming “Derived” color categories . Yellow Red Hue Y Degree of Membership Fuzzy logical “ANDf” U 1 R 0 1 Orange Hue Degree of Membership 0 Hue © Stephen E. Palmer, 2002 Color Naming “Derived” color categories Orange = Red ANDf Yellow Purple = Red ANDf Blue Gray = Black ANDf White Pink = Red ANDf White Brown = Yellow ANDf Black (Goluboi = Blue ANDf White) © Stephen E. Palmer, 2002 Color Naming “Composite” color categories Fuzzy logical “ORf” Warm = Red Orf Yellow Cool = Blue Orf Green Light-warm = White Orf Warm Dark-cool = Black Orf Cool © Stephen E. Palmer, 2002 Color Naming FUZZY LOGICAL MODEL OF COLOR NAMING (Kay & McDaniel) Only 16 Basic Color Terms in Hundreds of Languages: Red Green Blue Yellow Black W hite Orange Purple Brown Pink Gray [Light-blue] PRIMARY (Fuzzy sets) Degree of Membership 0 COMPOSITE (Fuzzy OR f ) 1.0 1.0 0 DERIVED (Fuzzy AND f ) [Warm] [Cool] [Light-warm] [Dark-cool] 0 Yellow Orange = Yellow AND f Red Warm = Yellow OR f RED © Stephen E. Palmer, 2002