Introduction to Predictive Learning LECTURE SET 6 Neural Network Learning Electrical and Computer Engineering 1 OUTLINE • Objectives - introduce biologically inspired NN learning methods for clustering, regression and classification - explain similarities and differences between statistical and NN methods - show examples using synthetic and real-life data • • • • • Brief history and motivation for artificial neural networks Sequential estimation of model parameters Methods for supervised learning Methods for unsupervised learning Summary and discussion 2 Brief history and motivation for ANN • • • • Huge interest in understanding the nature and mechanism of biological/ human learning Biologists + psychologists do not adopt classical parametric statistical learning, because: - parametric modeling is not biologically plausible - biological info processing is clearly different from algorithmic models of computation Mid 1980’s: growing interest in applying biologically inspired computational models to: - developing computer models (of human brain) - various engineering applications New field Artificial Neural Networks (~1986 – 1987) ANN’s represent nonlinear estimators implementing the ERM approach (usually squared-loss function) 3 History and motivation (cont’d) • Relationship to the problem of inductive learning: y x Generator of samples Learning Machine y System • • • The same learning problem setting Neural-style learning algorithm: - on-line (flow through) - simple processing Biological terminology Syn ap se x w y Hebbian Rule: w ~ xy 4 Neural vs Algorithmic computation • Biological systems do not use principles of digital circuits Digital Biological Connectivity 1~10 ~10,000 Signal digital analog Timing synchronous asynchronous Signal propag. feedforward feedback Redundancy no yes Parallel proc. no yes Learning no yes Noise tolerance no yes 5 Neural vs Algorithmic computation • • • Computers excel at algorithmic tasks (wellposed mathematical problems) Biological systems are superior to digital systems for ill-posed problems with noisy data Example: object recognition [Hopfield, 1987] PIGEON: ~ 10^^9 neurons, cycle time ~ 0.1 sec, each neuron sends 2 bits to ~ 1K other neurons 2x10^^13 bit operations per sec OLD PC: ~ 10^^7 gates, cycle time 10^^-7, connectivity=2 10x10^^14 bit operations per sec Both have similar raw processing capability, but pigeons are better at recognition tasks 6 Neural terminology and artificial neurons Some general descriptions of ANN’s: http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.html http://en.wikipedia.org/wiki/Neural_network • McCulloch-Pitts neuron (1943) x0=1 x1 w1 x2 w2 wd b (w x) b g (t ) y xd • Threshold (indicator) function of weighted sum of inputs 7 Goals of ANN’s • • • Develop models of computation inspired by biological systems Study computational capabilities of networks of interconnected neurons Apply these models to real-life applications Learning in NNs = modification (adaptation) of synaptic connections (weights) in response to external inputs 8 Historical highlights of ANN 1943 1949 1960’s 60’s-70’s 1980’s 1990’s 2000’s McCulloch-Pitts neuron Hebbian learning Rosenblatt (perceptron), Widrow dominance of ‘hard’ AI resurgence of interest (PDP group, MLP, SOM etc.) connection to statistics/VC-theory mature field/ fragmentation 9 OUTLINE • Objectives • Brief history and motivation for artificial neural networks • Sequential estimation of model parameters • Methods for supervised learning • Methods for unsupervised learning • Summary and Discussion 10 Sequential estimation of model parameters • Batch vs on-line (iterative) learning - Algorithmic (statistical) approaches ~ batch - Neural-network inspired methods ~ on-line BUT the only difference is on the implementation level (so both types of methods should yield similar generalization) • Recall ERM inductive principle (for regression): 1 n 1 n 2 Remp w Lx i , yi , w yi f x i , w n i 1 n i 1 • Assume dictionary parameterization with fixed basis fcts m ˆy f x,w w j g j x j 1 11 Sequential (on-line) least squares minimization • • Training pairs x(k ), y(k ) presented sequentially On-line update equations for minimizing empirical risk (MSE) wrt parameters w are: w k 1 w k k Lxk , y k , w w (gradient descent learning) where the gradient is computed via the chain rule: L yˆ Lx, y, w 2yˆ ygj x wj yˆ w j the learning rate k is a small positive value (decreasing with k) 12 On-line least-squares minimization algorithm • Known as delta-rule (Widrow and Hoff, 1960): Given initial parameter estimates w(0), update parameters during each presentation of k-th training sample x(k),y(k) • Step 1: forward pass computation zj k gj x(k) j 1,...,m m yˆ k wj k z j k • - estimated output Step 2: backward pass computation k yˆk yk - error term (delta) j 1 w j k 1 w j k k k z j k , j 1,...,m 13 Neural network interpretation of delta rule • Forward pass Backward pass ˆyk k ˆyk yk w j k k k z j k w0 k 1 • z1 k w1 k w j k 1 w j k w j k wm k zm k 1 z1 k zm k Syn ap se Biological learning x w Hebbian Rule: w ~ xy y 14 Theoretical basis for on-line learning • Standard inductive learning: given training data z1,...,z n find the model providing min of prediction risk R Lz, pzdz • Stochastic Approximation guarantees minimization of risk (asymptotically): k 1 k k grad Lzk , k under general conditions on the learning rate: lim k 0 k k 1 k 2 k k 1 15 Practical issues for on-line learning • Given finite training set (n samples): z1 ,...,z n this set is presented sequentially to a learning algorithm many times. Each presentation of n samples is called an epoch, and the process of repeated presentations is called recycling (of training data) • Learning rate schedule: initially set large, then slowly decreasing with k (iteration number). Typically ’good’ learning rate schedules are data-dependent. Stopping conditions: (1) monitor the gradient (i.e., stop when the gradient falls below some small threshold) (2) early stopping can be used for complexity control • 16 OUTLINE • • • • • • Objectives Brief history and motivation for artificial neural networks Sequential estimation of model parameters Methods for supervised learning - MultiLayer Perceptron (MLP) networks - Radial Basis Function (RBF) Networks Methods for unsupervised learning Summary and discussion 17 Multilayer Perceptrons (MLP) • Recall graphical NN representation for dictionary methods: m ˆy w j z j j 1 W is m 1 where z1 d gx,vi s vi 0 xk vik sx v i k 1 1 st 1 expt st tanht • expt expt expt expt 1 z2 2 zm m zj gx,v j V is d m x1 x2 xd How to estimate parameters (weights) via ERM? 18 Learning for a single neuron (delta rule): • Forward pass Backward pass ˆyk k ˆyk yk w j k k k z j k w0 k 1 • z1 k w1 k w j k 1 w j k w j k wm k zm k 1 z1 k zm k How to implement gradient-descent learning in a network of neurons? 19 Backpropagation training • Minimization of n Remp f x i , W, V yi 2 i 1 with respect to parameters (weights) W, V • Gradient descent optimization for k 1,...,n,... Vk 1 Vk k gradV Lx k,yk, Vk,wk wk 1 wk k gradw Lxk, yk,Vk ,wk where Lxk, yk,Vk ,wk 1 f x,w,V y2 2 • Careful application of gradient descent leads leads to the backpropagation algorithm 20 Backpropagation: forward pass for training input x(k), estimate predicted output yˆ k 21 Backpropagation: backward pass update the weights by propagating the error 22 Details of backpropagation • • • st 1 1 expt Sigmoid activation - picture? st st 1 st simple derivative Poor behaviour for large t ~ saturation How to avoid saturation? - Proper initialization (small weights) - Pre-scaling of inputs (zero mean, unit variance) • • • Learning rate schedule (initial, final) Stopping rules, number of epochs Number of hidden units 23 Regularization Effect of Backpropagation • • Backpropagation ~ iterative optimization Final model (weights) depends on: - initial point + final point (stopping rules) initialization and/ or stopping rules can be used for model complexity control training validation MSE Stop Training Number of epochs 24 Various forms of complexity control • • • • • • MLP topology ~ number of hidden units Constraints on parameters (weights) ~ weight decay Type of optimization algorithm (many versions of backprop., other opt. methods) Stopping rules Initial conditions (initial ‘small’ weights) Multiple factors make it difficult to control complexity; usually vary one complexity parameter while keeping all others fixed 25 Example: univariate regression • Data set: 30 samples generated using sine-squared target function with Gaussian noise (st. deviation 0.1). MLP network 1.2 (two hidden units) underfitting 1 0.8 0.6 Y • 0.4 0.2 0 -0.2 0 0.2 0.4 0.6 0.8 1 X 26 Example: univariate regression • Data set: 30 samples generated using sine-squared target function with Gaussian noise (st. deviation 0.1). MLP network 1.2 (five hidden units) near optimal 1 0.8 0.6 Y • 0.4 0.2 0 -0.2 0 0.2 0.4 0.6 0.8 1 X 27 Example: univariate regression • Data set: 30 samples generated using sine-squared target function with Gaussian noise (st. deviation 0.1). MLP network 1.2 (20 hidden units) little overfitting 1 0.8 0.6 Y • 0.4 0.2 0 -0.2 0 0.2 0.4 0.6 0.8 1 X 28 Backpropagation for classification m • ˆy w j z j Original MLP is for regression (as shown) j 1 W is m 1 z1 1 z2 2 zm m zj gx,v j V is d m • For classification: x x x - sigmoid output unit (~ logistic regression using loglikelihood loss – see textbook) - during training, use real-values 0/1 for class labels - during operation, threshold the output of a trained MLP classifier at 0.5 to predict class labels 29 1 2 d Classification example (Ripley’s data set) • Data set: 250 samples ~ mixture of gaussians, where Class 0 data has centers (-0.3, 0.7) and (0.4, 0.7), and Class 1 data has centers (-0.7, 0.3) and (0.3, 0.3). The variance of all gaussians is 0.03. • MLP classifier (two hidden units) 1.2 1 0.8 underfitting 0.6 0.4 0.2 0 -0.2 -1.5 -1 -0.5 0 0.5 1 30 Classification Example • MLP classifier (three hidden units) ~ near optimal solution 1.2 1 0.8 0.6 0.4 0.2 0 -0.2 -1.5 -1 -0.5 0 0.5 1 31 Classification Example • MLP classifier (six hidden units) some overfitting 1.2 1 0.8 0.6 0.4 0.2 0 -0.2 -1.5 -1 -0.5 0 0.5 1 32 MLP software • • MLP software widely available in public domain Can handle multi-class problems • For example, Netlab toolbox (in Matlab) at http://www1.aston.ac.uk/eas/research/groups/ncrg/ resources/netlab/ • Many commercial products (full of marketing hype) ’Nearly 80% Accurate Market Forecasting Software Get FREE up to date predictions and see for yourself!’ 33 NetTalk (Sejnowski and Rosenberg, 1987) One of the first successful applications of backpropagation: http://www.cnl.salk.edu/ParallelNetsPronounce/index.php • Goal: Learning to read (English text) aloud, i.e. Learn Mapping: English text phonemes using MLP classifier network • Network inputs encode 7-letter window (the 4-th letter • • in the middle needs to be pronounced) Network outputs (26 units) encode phonemes that drive a speech synthesizer The MLP network is trained using labeled data (both individual words and unrestricted text) 34 NetTalk architecture Input encoding: 7x29 = 203 units Output encoding: 26 units (phonemes) Hidden layer: 80 hidden units 35 Listening to NetTalk-generated speech Listen to tape recordings illustrating NETtalk operation available on Youtube http://www.youtube.com/watch?v=gakJlr3GecE These three recordings contain 3 different audio outputs of NETtalk: (a) during the first 5 minutes of training, starting with weights initialized to zero. (b) after training using the set of 10,000 words. This training set corresponds to 20 passes (epochs) over 500-word text. (c) generated with new text input that was not part of the training set. After listening to these recordings, answer and comment on the following questions: - can you recognize words in the recording (a), (b) and (c)? – Explain why. - compare the quality of outputs (b) and (c). Which one seems closer to human speech and why? Question for discussion: Problem 6.8 - Why NETtalk uses a seven-letter window? 36 Radial Basis Function (RBF) Networks • xvj f m x w j g j j 1 m w 0 z 1 - each b.f. is (usually) local - center v j and width j i.e. Gaussian: xv x j v j 2 exp gx exp 2 2 2 2 j 1 d ˆy w j z j j 1 W is m 1 z2 1 • m Dictionary parameterization: 2 zm m zj gx,v j V is d m x1 2 x2 xd Typically used for regression or classification 37 RBF network training • • • • RBF training (learning) ~ estimation of (1) RBF parameters (centers, width) (2) linear weights w’s Non-adaptive implementation: (1) Estimate RBF parameters via unsupervised learning (only x-values of training data) – can use SOM, GLA etc. (2) Estimate weights w via linear least squares Advantages: - fast training; - when x-samples are plenty, but (x,y) data are few Limitations: cannot discard irrelevant inputs the curse of dimensionalty 38 Non-adaptive RBF training algorithm 1. Choose the number of basis functions (centers) m. 2. Estimate centers v j using x-values of training data 3. via unsupervised learning (SOM, GLA, clustering etc.) Determine width parameters j using heuristic: For a given center v j (a) find the distance to the closest center: for all k j r j min vk vj k (b) set the width parameter j rj where parameter controls degree of overlap between adjacent basis functions. Typically 1 3 4. Estimate weights w via linear least squares (minimization of the empirical risk MSE). 39 RBF network complexity control RBF model complexity can be controlled by • The number of RBFs: Goal: select opt number of units (RBFs) • RBF width: Goal: select opt width parameter (for large number of RBF’s) • Penalization of large weights w’s See toy examples next (using the number of units as the complexity parameter) 40 Example: RBF regression • Data set: 30 samples generated using sine-squared target function with Gaussian noise (st. deviation 0.1). RBF network: automatic width selection (via x-validation) 2 RBF’s underfitting 1.2 1 0.8 0.6 Y • 0.4 0.2 0 -0.2 0 0.2 0.4 0.6 0.8 1 X 41 Example: RBF regression • Data set: 30 samples generated using sine-squared target function with Gaussian noise (st. deviation 0.1). RBF network: automatic width selection 5 RBF’s ~ optimal 1.2 1 0.8 0.6 Y • 0.4 0.2 0 -0.2 0 0.2 0.4 0.6 0.8 1 X 42 Example: RBF regression • Data set: 30 samples generated using sine-squared target function with Gaussian noise (st. deviation 0.1). RBF network: automatic width selection 20 RBF’s overfitting 1.2 1 0.8 0.6 0.4 Y • 0.2 0 -0.2 -0.4 -0.6 -0.8 0 0.2 0.4 0.6 0.8 1 X 43 RBF Classification example (Ripley’s data) • Data set: 250 samples ~ mixture of gaussians, where Class 0 data has centers (-0.3, 0.7) and (0.4, 0.7), and Class 1 data has centers (-0.7, 0.3) and (0.3, 0.3). The variance of all gaussians is 0.03. • RBF classifier (4 units) 1.2 1 0.8 some underfitting 0.6 0.4 0.2 0 -0.2 -1.5 -1 -0.5 0 0.5 1 44 RBF Classification example (cont’d) • RBF classifier (9 units) 1.2 Optimal 1 0.8 0.6 0.4 0.2 0 -0.2 -1.5 -1 -0.5 0 0.5 1 45 RBF Classification example (cont’d) • RBF classifier (25 units) Little overfitting 1.2 1 0.8 0.6 0.4 0.2 0 -0.2 -1.5 -1 -0.5 0 0.5 1 46 OUTLINE • • • • • • Objectives Brief history and motivation for artificial neural networks Sequential estimation of model parameters Methods for supervised learning Methods for unsupervised learning - clustering and vector quantization - Self-Organizing Maps (SOM) - Application example Summary and discussion 47 Overview Recall from Lecture Set 2: unsupervised learning data reduction approach • Example: Training data represented by 3 ‘centers’ H 48 Two types of problems 1. Data reduction: VQ + clustering ‘Model’ ~ m points Vector Quantizer Q: m f x, Qx c j I x R j j 1 VQ setting: given n training samples X x ,x ,...,x 1 2 n find the coordinates c j of m centers (prototypes) such that the total squared error distortion is minimized 2 R x f x, pxdx 49 2. Dimensionality reduction: x2 linear nonlinear x1 ‘Model’ ~ projection of high-dim. data onto low-dim. space. Note: the goal is to estimate a mapping from d-dimensional input space (d=2) to low-dimensional feature space (d*=1) 2 R x f x, pxdx 50 Vector Quantization and Clustering • Two complementary goals of VQ: 1. partition the input space into disjoint regions 2. find positions of units (coordinates of prototypes) Note: optimal partitioning into regions is according to the nearest-neighbor rule (~ the Voronoi regions) 51 Generalized Lloyd Algorithm(GLA) for VQ Given data points xk k 1,2,... , loss function L (i.e., squared loss) and initial centers c j 0 j 1,...,m Perform the following updates upon presentation of xk 1. Find the nearest center to the data point (the winning unit): j arg min xk c i k i 2. Update the winning unit coordinates (only) via c j k 1 c j k k xk c j k Increment k and iterate steps (1) – (2) above Note: - the learning rate decreases with iteration number k - biological interpretations of steps (1)-(2) exist 52 Batch version of GLA Given data points x i i 1,...,n , loss function L (i.e., j 1,...,m squared loss) and initial centers c j 0 Iterate the following two steps 1. Partition the data (assign sample x i to unit j ) using the nearest neighbor rule. Partitioning matrix Q: 1 if L x i ,c j k min L x i ,cl k l qij 0 otherwise 2. Update unit coordinates as centroids of the data: n qij x i c j k 1 i1n , j 1,... ,m qij i 1 Note: final solution may depend on initialization (local min) – potential problem for both on-line and batch GLA 53 Numeric Example of univariate VQ Given data: {2,4,10,12,3,20,30,11,25}, set m=2 • Initialization (random): c1=3,c2=4 • Iteration 1 Projection: P1={2,3} P2={4,10,12,20,30,11,25} Expectation (averaging): c1=2.5, c2=16 • Iteration 2 Projection: P1={2,3,4}, P2={10,12,20,30,11,25} Expectation(averaging): c1=3, c2=18 • Iteration 3 Projection: P1={2,3,4,10},P2={12,20,30,11,25} Expectation(averaging): c1=4.75, c2=19.6 • Iteration 4 Projection: P1={2,3,4,10,11,12}, P2={20,30,25} Expectation(averaging): c1=7, c2=25 • Stop as the algorithm is stabilized with these values 54 GLA Example 1 • Modeling doughnut distribution using 5 units (a) initialization (b) final position (of units) 55 GLA Example 2 • Modeling doughnut distribution using 3 units: Bad initialization poor local minimum 56 GLA Example 3 • Modeling doughnut distribution using 20 units: 7 units were never moved by the GLA the problem of unused units (dead units) 57 Avoiding local minima with GLA • • Starting with many random initializations, and then choosing the best GLA solution Conscience mechanism: forcing ‘dead’ units to participate in competition, by keeping the frequency count (of past winnings) for each unit, i.e. for on-line version of GLA in Step 1 j arg min xk ci k freqi (k ) i • Self-Organizing Map: introduce topological relationship (map), thus forcing the neighbors of the winning unit to move towards the data. 58 Clustering methods • • • • Clustering: separating a data set into several groups (clusters) according to some measure of similarity Goals of clustering: interpretation (of resulting clusters) exploratory data analysis preprocessing for supervised learning often the goal is not formally stated VQ-style methods (GLA) often used for clustering, i.e. k-means or c-means Many other clustering methods as well 59 Clustering (cont’d) • • • Clustering: partition a set of n objects (samples) into k disjoint groups, based on some similarity measure. Assumptions: - similarity ~ distance metric dist (i,j) - usually k given a priori (but not always!) Intuitive motivation: similar objects into one cluster dissimilar objects into different clusters the goal is not formally stated Similarity (distance) measure is critical but usually hard to define (objectively). Distance needs to be defined for different types of input variables. 60 Self-Organizing Maps History and biological motivation • • • • Brain changes its internal structure to reflect life experiences interaction with environment is critical at early stages of brain development (first 1-2 years of life) Existence of various regions (maps) in the brain How these maps may be formed? i.e. information-processing model leading to map formation T. Kohonen (early 1980’s) proposed SOM 61 Goal of SOM • • • • Dimensionality reduction: project given (high-dim.) data onto low-dimensional space (called a map) Feature space (Z-space) is 1D or 2D and is discretized as a number of units, i.e., 10x10 map Z-space has distance metric ordering of units Similarities and differences between VQ and SOM X G(X) Z F(Z) ˆ X 62 Self-Organizing Map Discretization of 2D space via 10x10 map. In this discrete space, distance relations exist between all pairs of units. Distance relation ~ map topology Units in 2D feature space 63 SOM Algorithm (flow through) Given data points xk k 1,2,..., distance metric in the input space (~ Euclidean), map topology (in z-space), initial position of units (in x-space) c j 0 j 1,...,m Perform the following updates upon presentation of xk 1. Find the nearest unit to the data point (the winning unit denoted as z(k)): z(k ) arg min xk ci k 1 i 2. Update all units around the winning unit via c j (k 1) c j (k ) K ( z j , z(k ))(x(k ) c(k )) Increment k, decrease the learning rate and the neighborhood width, and repeat steps (1) – (2) above 64 SOM example (one iteration) Step 1: Step 2: 65 SOM example (next iteration) Step 1: Step 2: Final map 66 Hyper-parameters of SOM SOM performance depends on parameters (~ user-defined): • Map dimension and topology (usually 1D or 2D) • Number of SOM units ~ quantization level (of z-space) • Neighborhood function K ( z j , z(k )) ~ usually rectangular or gaussian (shape not important) • Neighborhood width decrease schedule (important), i.e. exponential decrease for Gaussian k initial final initial k kmax with user defined: • z z 2 K k z , z exp 2 2 k k max initial final Also linear decrease of neighborhood width Learning rate schedule (important) (k ) 0 /0 k / k max f (also linear decrease) Note: learning rate and neighborhood decrease should be set jointly 67 Modeling uniform distribution via SOM (a) 300 random samples (b) 10X10 map 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 SOM neighborhood: Gaussian Learning rate: linear decrease (k ) 0.1(1 k / k max ) 68 Position of SOM units: (a) initial, (b) after 50 iterations, (c) after 100 iterations, (d) after 10,000 iterations 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 0.2 0.4 0.6 0.8 0 0 1 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 0.2 0.4 0.6 0.8 1 0 0 0.2 0.2 0.4 0.4 0.6 0.6 0.8 0.8 1 1 69 Batch SOM (similar to Batch GLA) Given data points x i , distance metric (i.e., squared loss), j 1,...,m map topology and initial centers c j 0 Iterate the following two steps 1. Partition the data into clusters using the minimum distance rule. This results in assignment of n samples to m clusters (units) according to assignment matrix Q 1 if L x i ,c j k min L x i ,cl k l qij 0 otherwise 2. Update center coordinates as the weighted average of all data samples (in each cluster): x K z n c j (k 1) i 1 n i K z i 1 j j , zi , zi Decrease the neighborhood width, and iterate. 70 Example: effect of the final neighborhood width 90% 50% 10% 71 SOM Applications • Two types of applications: - Vector Quantization Clustering of multivariate data Main web site: http://www.cis.hut.fi/research/som-research/ Numerous Applications • • • • Marketing surveys/ segmentation Financial/ stock market data Text data / document map – WEBSOM Image data / picture map - PicSOM see HUT web site 72 • • • • • Practical Issues for SOM Pre-scaling of inputs, usually to [0, 1] range. Why? Map topology: usually 1D or 2D Number of map units (per dimension) Learning rate schedule (for on-line version) Neighborhood type and schedule: Initial size (~1), final size Final neighborhood size + the number of units affect model complexity. 73 Modeling US states using 1D SOM (performed by Feng Cai) • • Purpose: clustering of US states Data encoding: each state described by 5 socio-economic indicators: obesity index, result of 2004 presidential elections, median income, mean NAEP, IQ score • Data scaling: each input scaled independently to [0,1] range • SOM specs: 1D map, 9 units, initial neighborhood width 1, final width 0.05 74 State Obesity index Hawaii Colorado Connecticut Massachusetts New Hampshire Utah California Maryland New Jersey Rhode Island Vermont Florida Montana Oregon Arizona Idaho New Mexico Wyoming Maine New York Washington South Dakota Delaware Illinois Minnesota Wisconsin Nevada Alaska 17 17 18 18 18 18 19 19 19 19 19 19 19 20 20 20 20 20 21 21 21 21 22 22 22 22 22 23 Election_04 Median Income Mean NAEP 0 1 0 0 1 1 0 0 0 0 0 1 1 0 1 1 0 1 0 0 0 1 0 0 0 0 1 1 49775 49617 53325 50587 53549 48537 48113 55912 53266 44311 41929 38533 33900 42704 41554 38613 35251 40499 37654 42432 44252 38755 50878 45906 54931 46351 46289 55412 238 252 255 257 257 250 238 248 253 245 256 245 254 250 241 249 235 253 253 251 251 254 250 248 256 252 239 245 IQ score 94 104 99 111 102 89 94 95 103 89 102 87 100 100 92 96 85 102 99 90 92 100 90 93 113 105 92 92 75 Iowa Kansas Missouri Nebraska North Dakota Ohio Oklahoma Pennsylvania Arkansas Georgia Indiana North Carolina Virginia Michigan Kentucky Tennessee Alabama Louisiana South Carolina Texas Mississippi West Virginia 23 23 23 23 23 23 23 24 24 24 24 24 24 25 25 25 26 26 26 26 27 28 0 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 41827 42523 43955 43566 36717 43332 35500 43577 32423 43316 41581 38432 49974 45335 37893 36329 36771 33312 38460 40659 32447 30072 253 253 251 251 254 252 244 249 242 243 251 252 253 249 247 241 236 238 246 247 236 245 109 101 92 101 111 107 98 99 98 93 105 106 99 99 94 90 90 99 87 98 90 92 76 SOM Modeling 1 of US states Unit States (assigned to each unit) 1 HI, CA, MD, RI, NM, 2 OR, ME, NY, WA, DE, IL, PA, MI, 3 CT, MA, NJ, VT, MN, WI, 4 5 CO, NH, MT, WY, SD, 6 KS, NE, ND, OH, IN, NC, VA, 7 UT, ID, AK, IA, MO, 8 FL, AZ, NV, OK, GA, KY, TX 9 AR, TN, AL, LA, SC, MS, WV 77 78 SOM Modeling 2 of US states - remove the voting input and apply 1D SOM: Unit 1 States CO, CT, MA, NH, NJ, MN, 2 WI, IA, ND, OH, IN, NC, 3 VT, MT, OR, ID, WY, ME, SD, 4 KS, MO, NE, PA, VA, MI, 5 UT, MD, NY, WA, DE, IL, AK, 6 HI, CA , RI, 7 FL, AZ, NM, NV, 8 OK, GA, KY, SC, TX, 9 AR, TN, AL, LA, MS, WV 79 SOM Modeling 2 of US states (cont’d) - remove voting input and apply 1D SOM: 80 Clustering of European Languages • Background: historical linguistics studies relatedness btwn languages based on phonology, morphology, syntax and lexicon • Difficulty of the problem: due to evolving nature of human languages and globalization. • Hypothesis: similarity based on analysis of a small ‘stable’ word set. See glottochronology, Swadesh list, at http://en.wikipedia.org/wiki/Glottochronology 81 SOM Clustering of European Languages Modeling approach: language ~ 10 word set. Assuming words in different languages are encoded in the same alphabet, it is possible to perform clustering using some distance measure. • • • Issues: selection of a stable word set data encoding + distance metric Stable word set: numbers 1 to 10 Data encoding: Latin alphabet, use 3 first letters (in each word) 82 Numbers word set in 18 European languages Each language is a feature vector encoding 10 words English Norwegian Polish Czech Slovakian Flemish Croatian Portuguese French Spanish Italian Swedish Danish Finnish Estonian Dutch German Hungarian one two three four five six seven eight nine ten en to tre fire fem seks sju atte ni ti jeden dwa trzy cztery piec szesc sediem osiem dziewiec dziesiec jeden dva tri ctyri pet sest sedm osm devet deset jeden dva tri styri pat sest sedem osem devat desat ien twie drie viere vuvve zesse zevne achte negne tiene jedan dva tri cetiri pet sest sedam osam devet deset um dois tres quarto cinco seis sete oito nove dez un deux trois quatre cinq six sept huit neuf dix uno dos tres cuatro cinco seis siete ocho nueve dies uno due tre quattro cinque sei sette otto nove dieci en tva tre fyra fem sex sju atta nio tio en to tre fire fem seks syv otte ni ti yksi kaksi kolme nelja viisi kuusi seitseman kahdeksan yhdeksan kymmenen uks kaks kolme neli viis kuus seitse kaheksa uheksa kumme een twee drie vier vijf zes zeven acht negen tien erins zwei drie vier funf sechs sieben acht neun zehn egy ketto harom negy ot hat het nyolc kilenc tiz 83 Data Encoding • Word ~ feature vector encoding 3 first letters • Alphabet ~ 26 letters + 1 symbol ‘BLANK’ vector encoding: ALPHABET INDEX ‘BLANK’ A B C D … X Y Z 00 01 02 03 04 … 24 25 26 For example, ONE : ‘O’~14 ‘N’~15 ‘E’~05 84 Word Encoding (cont’d) • Word 27-dimensional feature vector one (Word) 15 14 05 (Indices) 0 0 0 0 0 0 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 • • 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 Encoding is insensitive to order (of 3 letters) Encoding of 10-word set: concatenate feature vectors of all words: ‘one’ + ‘two’ + …+ ‘ten’ word set encoded as vector of dim. [1 X 270] 85 SOM Modeling Approach • 2-Dimensional SOM (Batch Algorithm) Number of Units per dimension=4 Initial Neighborhood =1 Final Neighborhood = 0.15 Total Number of Iterations= 70 86 OUTLINE • Objectives • Brief history and motivation for artificial neural networks • Sequential estimation of model parameters • Methods for supervised learning • Methods for unsupervised learning • Summary and discussion 87 Summary and Discussion • • • • Neural Network methods (vs statistical approaches): - new techniques/ grad descent style methods - simple (brute-force) computational approaches - black-box models (e.g. MLP network) - biological motivation The same fundamental issues: small-sample problems, curse-of-dimensionality, non-linear optimization, complexity control Neural network methods implement ERM or SRM approach (under predictive learning setting) Hype and controversy 88