Neural Networks Learning Objectives Characteristics of neural nets Supervised learning – Back-propagation Probabilistic nets What is a Neural Network? According to the DARPA Neural Network Study (1988, AFCEA International Press, p. 60): ... a neural network is a system composed of many simple processing elements operating in parallel whose function is determined by network structure, connection strengths, and the processing performed at computing elements or nodes. Characteristics of Neural Nets The good news: They exhibit some brain-like behaviors that are difficult to program directly like: learning association categorization generalization feature extraction optimization noise immunity The bad news: neural nets are black boxes difficult to train in some cases There is a wide range of neural network architectures: Multi-Layer Perceptron (Back-Prop Nets) 1974-85 Neocognitron 1978-84 Adaptive Resonance Theory (ART) 1976-86 Sef-Organizing Map 1982 Hopfield 1982 Bi-directional Associative Memory 1985 Boltzmann/Cauchy Machine 1985 Counterpropagation 1986 Radial Basis Function 1988 Probabilistic Neural Network 1988 General Regression Neural Network 1991 Support Vector Machine 1995 Our single "neuron" model −1 xn b wn x2 x1 w2 w1 Σ n +1 D = ∑ wi xi i =1 ⎧ 1 D≥0 O=⎨ ⎩−1 D < 0 xn +1 = −1 wn +1 = b class c1 class c2 Basic Neuron Model x1 x2 wji wj2 sum hj Dj wj3 x3 wjn xn D j = ∑ w ji xi f ( Dj ) threshold activation function input features and bias jth neuron input layer output C1 C2 C3 hidden layer output C1 C2 C3 hidden layer 2 output hidden layer 1 C1 C2 C3 Most neural nets use a smooth activation function x1 x2 wji wj2 sum threshold hj Dj wj3 x3 wjn D j = ∑ w ji xi f ( Dj ) xn input features and bias sigmoidal 1 f ( z) = 1 + e− z f ′ = f (1 − f ) Major question – How do we adjust the weights to learn the mapping between inputs and outputs? Answer: Use the back propagation algorithm, which is just an application of the chain rule of differential calculus. Consider this simple example w x x u h y = f (u h ) h = f ( w x) y = f (u f ( w x )) = F ( u , w, x ) y y To learn the weights we can try to minimize the output error i.e. , we start with an initial guess for the weights and then present an example of known input x and output value d (training example). Then we form up the error 1 2 E= 2 (d − y) and adjust the weights to reduce this error. Since ΔE = we can make by choosing ∂E ∂E Δu + Δw ∂u ∂w ΔE ≤ 0 ∂E = −λΔu ∂u ∂E = −λΔw ∂w λ … constant This leads to the adjustment rule ∂E u ( m + 1) = u ( m ) − μ ∂u ( m ) ∂E w ( m + 1) = w ( m ) − μ ∂w ( m ) μ … learning rate So we now need to find these derivatives of E with respect to the weights. w x x u h y y y = f (u h ) h = f ( w x) y = f (u f ( w x )) = F ( u , w, x ) For the derivatives involving the weight between the hidden layer and output layer ∂E ∂ ⎡1 ∂ ⎡1 2⎤ 2 ⎤ ∂y = − = − d y d y ( )⎥ ( )⎥ ⎢ ⎢ ∂u ( m ) ∂u ⎣ 2 ⎦ ∂y ⎣ 2 ⎦ ∂u = ( y − d ) f ′ ( u h ) h = ( y − d ) y (1 − y ) h = ( y − d ) y (1 − y ) h u m , w m ( ) ( ) Similarly for the weight between the hidden layer and input layer ∂E ∂ ⎡1 ∂ ⎡1 2⎤ 2 ⎤ ∂y = (d − y) ⎥ = ⎢ (d − y) ⎥ ⎢ ∂w ( m ) ∂w ⎣ 2 ⎦ ∂y ⎣ 2 ⎦ ∂w ∂ (u h) = ( y − d ) f ′ (u h) = ( y − d ) y (1 − y ) u h′ ( w x ) x ∂w = ( y − d ) y (1 − y ) h (1 − h ) ux u m , w m ( ) ( ) Thus, the training algorithm is: 1. Initialize weights to small random values 2. Using a training set of known pairs of inputs and outputs (x, d) change the weights according to w ( m + 1) = w ( m ) − μ ( y − d ) y (1 − y ) h (1 − h ) u x u m , w m ( ) ( ) u ( m + 1) = u ( m ) − μ ( y − d ) y (1 − y ) h u m , w m ( ) ( ) with y = f (u h ) h = f ( w x) until E= 1 2 d − y ( ) 2 becomes sufficiently small E iteration stop This is an example of supervised learning One of the most popular neural nets is a feed forward net with one hidden layer trained by the back propagation algorithm hidden layer outputs It can be shown that in principle this type of network can represent an arbitrary input-output mapping or solve an arbitrary classification problem Back propagation algorithm (three layer feed forward network) xi yk w ji P input nodes ukj M hidden nodes ukjnew = ukjold − μ ( yk − d k ) yk (1 − yk ) h j w new ji =w K output nodes ( k = 1,..., K − μ ∑ ( yk − d k ) yk (1 − yk ) ukjold h j (1 − h j ) xi K old ji j = 1,..., M ) k =1 yk = with 1 ⎛ M old ⎞ 1 + exp ⎜ −∑ ukj h j ⎟ ⎝ j =1 ⎠ 1 hj = ⎛ P old ⎞ 1 + exp ⎜ −∑ w ji xi ⎟ ⎝ i =1 ⎠ ( j = 1,..., M i = 1,..., P ) Some issues associated with this "backprop" network 1. 2. 3. 4. design of training, testing and validation data sets determination of the network structure selection of the learning rate (μ) problems with under or over-training E testing set training set iterations stop learning over trained Some important issues for neural networks: Pre-processing the data to provide: • reduction of data dimensionality • noise filtering or suppression • enhancement strengthening of relevant features centering data within a sensory aperture or window scanning a window over the data •invariance in the measurement space to: translations rotations scale changes distortion • data preparation analog to digital conversion data scaling data normalization thresholding Some examples of pre-processing include 1-D and 2-D FFTs Filtering Convolution Kernels Correlation Masks or Template Matching Autocorrelation Edge Detection and Enhancement Morphological Image Processing Fourier Descriptors Walsh, Hadamard, Cosine, Hartley, Hotelling, Hough Transforms Higher order spectra Homomorphic Transformations (e.g. Cepstrums) Time-Frequency transforms ( Wavelet, Wigner-Ville, Zak) Linear Predictive Coding Principal Component Analysis Independent Component Analysis Geometric Moments Thresholding Data Sampling Scanning Probabilistic Neural Network (PNN) Basic idea: Use training samples themselves to obtain a representation of the probability distributions for each class and then use Bayes decision rule to make a classification Basis functions usually chosen are Gaussians: ⎡ − ( x − x )T ( x − x ) ⎤ 1 ij ij ⎢ ⎥ fi ( x ) = exp ∑ p/2 2 p ⎥ 2σ ( 2π ) σ M i j =1 ⎢⎣ ⎦ Mi i … class number (i = 1, 2, …, N) j … training pattern number xij … j th training pattern from i th class Mi … number of training vectors in class i p .. dimension of vector x fi (x) = sum of Gaussians centered at each training pattern from the ith class to represent the probability density of that class σ … smoothing factor (standard deviation, width of the Gaussians) probability density function for class i xij If we normalize the vectors x and xij to unit length and assume the number of training samples from each class are in proportion to their a priori probability of occurrence then we can take as our decision function ⎡ ( x ⋅ xij ) − 1 ⎤ exp ⎢ gi ( x ) = M i fi ( x ) = ⎥ p/2 2 p ∑ ⎥⎦ ( 2π ) σ j =1 ⎢⎣ σ 1 Mi ⎡ ( x ⋅ xij ) − 1 ⎤ exp ⎢ gi ( x ) = M i fi ( x ) = ⎥ p/2 2 p ∑ ⎥⎦ ( 2π ) σ j =1 ⎢⎣ σ 1 Mi Since we decide for a given class k based on g k ( x ) > gi ( x ) for all i = 1, 2,..., N (i ≠ k ) the common constant outside the sum makes no difference and we can take ⎡ ( x ⋅ xij ) − 1 ⎤ gi ( x ) = ∑ exp ⎢ ⎥ 2 j =1 ⎢⎣ σ ⎥⎦ Mi This can now be easily implemented in a neural network form x1 xj xp weights are just elements of the training patterns xij 1 M1 1 Mi 1 MN ∑ ∑ ∑ g1 ( x ) gi ( x ) gN (x) Probabilistic Neural Network Characteristics of the PNN 1. no training, weights are just the training vectors themselves 2. only parameter that needs to be found is the smoothing factor, σ 3. outputs are representative of probabilities of each class directly 4. the decision surfaces are guaranteed to approach the Bayes optimal boundaries as the number of training samples grows 5. "outliers" are tolerated 6. sparse samples are adequate for good network performance 7. can update the network as new training samples become available 8. needs to store all the training samples, requiring a large memory 9. testing can be slower than with other nets References Specht, D.F., " Probabilistic neural networks," Neural Networks, 3, 109118, 1990. Zaknich, A., Neural Networks for Intelligent Signal Processing, World Scientific, 2003. Haykin, S., Neural Networks, a Comprehensive Foundation, 2nd Ed., Prentice-Hall, 1999. Bishop, C.S., Neural Networks for Pattern Recognition, Clarendon Press, 1995. Resources There are many, many neural network resources and tools available on the web. Some software packages: MATLAB Neural Network Toolbox Neuroshell Classifier ClassifierXL Brainmaker Neurosolutions Neuroxl www.mathworks.com www.wardsystems.com www.analyzerxl.com www.calsci.com www.nd.com www.neuroxl.com