ppt file - Electrical and Computer Engineering

Introduction to
Predictive Learning
LECTURE SET 6
Neural Network Learning
Electrical and Computer Engineering
1
OUTLINE
•
Objectives
- introduce biologically inspired NN learning methods for
clustering, regression and classification
- explain similarities and differences between statistical and NN
methods
- show examples using synthetic and real-life data
•
•
•
•
•
Brief history and motivation for artificial
neural networks
Sequential estimation of model parameters
Methods for supervised learning
Methods for unsupervised learning
Summary and discussion
2
Brief history and motivation for ANN
•
•
•
•
Huge interest in understanding the nature and
mechanism of biological/ human learning
Biologists + psychologists do not adopt classical
parametric statistical learning, because:
- parametric modeling is not biologically plausible
- biological info processing is clearly different from
algorithmic models of computation
Mid 1980’s: growing interest in applying biologically
inspired computational models to:
- developing computer models (of human brain)
- various engineering applications
 New field Artificial Neural Networks (~1986 – 1987)
ANN’s represent nonlinear estimators implementing
the ERM approach (usually squared-loss function)
3
History and motivation (cont’d)
•
Relationship to the problem of inductive learning:

y
x
Generator
of samples
Learning
Machine
y
System
•
•
•
The same learning problem setting
Neural-style learning algorithm:
- on-line (flow through)
- simple processing
Biological terminology
Syn ap se
x
w
y
Hebbian Rule:
w ~ xy
4
Neural vs Algorithmic computation
•
Biological systems do not use principles of
digital circuits
Digital
Biological
Connectivity
1~10
~10,000
Signal
digital
analog
Timing
synchronous
asynchronous
Signal propag. feedforward
feedback
Redundancy
no
yes
Parallel proc.
no
yes
Learning
no
yes
Noise tolerance no
yes
5
Neural vs Algorithmic computation
•
•
•
Computers excel at algorithmic tasks (wellposed mathematical problems)
Biological systems are superior to digital
systems for ill-posed problems with noisy data
Example: object recognition [Hopfield, 1987]
PIGEON: ~ 10^^9 neurons, cycle time ~ 0.1 sec,
each neuron sends 2 bits to ~ 1K other neurons
 2x10^^13 bit operations per sec
OLD PC: ~ 10^^7 gates, cycle time 10^^-7, connectivity=2
 10x10^^14 bit operations per sec
Both have similar raw processing capability, but pigeons
are better at recognition tasks
6
Neural terminology and artificial neurons
Some general descriptions of ANN’s:
http://www.doc.ic.ac.uk/~nd/surprise_96/journal/vol4/cs11/report.html
http://en.wikipedia.org/wiki/Neural_network
•
McCulloch-Pitts neuron (1943)
x0=1
x1 w1
x2 w2
wd
b

(w x)  b
g (t )
y
xd
•
Threshold (indicator) function of weighted sum of inputs
7
Goals of ANN’s
•
•
•
Develop models of computation inspired by
biological systems
Study computational capabilities of networks
of interconnected neurons
Apply these models to real-life applications
Learning in NNs = modification (adaptation) of
synaptic connections (weights) in response to
external inputs
8
Historical highlights of ANN
1943
1949
1960’s
60’s-70’s
1980’s
1990’s
2000’s
McCulloch-Pitts neuron
Hebbian learning
Rosenblatt (perceptron), Widrow
dominance of ‘hard’ AI
resurgence of interest (PDP group,
MLP, SOM etc.)
connection to statistics/VC-theory
mature field/ fragmentation
9
OUTLINE
• Objectives
• Brief history and motivation for
artificial neural networks
• Sequential estimation of model
parameters
• Methods for supervised learning
• Methods for unsupervised learning
• Summary and Discussion
10
Sequential estimation of model parameters
•
Batch vs on-line (iterative) learning
- Algorithmic (statistical) approaches ~ batch
- Neural-network inspired methods ~ on-line
BUT the only difference is on the implementation level (so
both types of methods should yield similar generalization)
•
Recall ERM inductive principle (for regression):
1 n
1 n
2
Remp w    Lx i , yi , w     yi  f x i , w 
n i 1
n i 1
•
Assume dictionary parameterization
with fixed basis fcts
m
ˆy  f x,w    w j g j x 
j 1
11
Sequential (on-line) least squares minimization
•
•
Training pairs x(k ), y(k ) presented sequentially
On-line update equations for minimizing
empirical risk (MSE) wrt parameters w are:

w k  1  w k    k
Lxk , y k , w 
w
(gradient descent learning)
where the gradient is computed via the chain rule:

 L  yˆ
Lx, y, w 
 2yˆ  ygj x 
 wj
yˆ w j
the learning rate  k is a small positive value
(decreasing with k)
12
On-line least-squares minimization algorithm
•
Known as delta-rule (Widrow and Hoff, 1960):
Given initial parameter estimates w(0), update
parameters during each presentation of k-th
training sample x(k),y(k)
• Step 1: forward pass computation
zj k   gj x(k)
j 1,...,m
m
yˆ k    wj k  z j k 
•
- estimated output
Step 2: backward pass computation
k  yˆk  yk - error term (delta)
j 1
w j k 1  w j k    k  k z j k , j 1,...,m
13
Neural network interpretation of delta rule
•
Forward pass
Backward pass
ˆyk 
 k   ˆyk   yk 
w j k    k  k z j k 
w0 k 
1
•
z1 k 
w1 k 
w j k  1  w j k   w j k 
wm k 
zm k 
1
z1 k 
zm k 
Syn ap se
Biological learning
x
w
Hebbian Rule:
w ~ xy
y
14
Theoretical basis for on-line learning
•
Standard inductive learning: given training
data z1,...,z n find the model providing min of
prediction risk R   Lz,  pzdz

•
Stochastic Approximation guarantees
minimization of risk (asymptotically):
k 1  k   k grad Lzk , k
under general conditions
on the learning rate:
lim  k  0
k 


k 1
k


2

 k 
k 1
15
Practical issues for on-line learning
•
Given finite training set (n samples): z1 ,...,z n
this set is presented sequentially to a learning algorithm
many times. Each presentation of n samples is called
an epoch, and the process of repeated presentations is
called recycling (of training data)
•
Learning rate schedule: initially set large, then slowly
decreasing with k (iteration number). Typically ’good’
learning rate schedules are data-dependent.
Stopping conditions:
(1) monitor the gradient (i.e., stop when the gradient
falls below some small threshold)
(2) early stopping can be used for complexity control
•
16
OUTLINE
•
•
•
•
•
•
Objectives
Brief history and motivation for artificial
neural networks
Sequential estimation of model parameters
Methods for supervised learning
- MultiLayer Perceptron (MLP) networks
- Radial Basis Function (RBF) Networks
Methods for unsupervised learning
Summary and discussion
17
Multilayer Perceptrons (MLP)
•
Recall graphical NN
representation for
dictionary methods:
m
ˆy   w j z j
j 1
W is m  1
where
z1
d



gx,vi   s vi 0   xk vik  sx v i 


k 1
1
st  
1 expt 
st   tanht  
•
expt   expt 
expt   expt 
1
z2
2
zm
m
zj  gx,v j 
V is d  m
x1
x2
xd
How to estimate parameters (weights) via ERM?
18
Learning for a single neuron (delta rule):
•
Forward pass
Backward pass
ˆyk 
 k   ˆyk   yk 
w j k    k  k z j k 
w0 k 
1
•
z1 k 
w1 k 
w j k  1  w j k   w j k 
wm k 
zm k 
1
z1 k 
zm k 
How to implement gradient-descent learning in a
network of neurons?
19
Backpropagation training
•
Minimization of
n
Remp    f x i , W, V   yi 
2
i 1
with respect to parameters (weights) W, V
•
Gradient descent optimization for k 1,...,n,...
Vk 1  Vk   k gradV Lx k,yk, Vk,wk
wk 1  wk    k gradw Lxk, yk,Vk ,wk 
where Lxk, yk,Vk ,wk  1 f x,w,V  y2
2
•
Careful application of gradient descent leads
leads to the backpropagation algorithm
20
Backpropagation: forward pass
for training input x(k), estimate predicted output yˆ k 
21
Backpropagation: backward pass
update the weights by propagating the error
22
Details of backpropagation
•
•
•
st  
1
1 expt 
Sigmoid activation
- picture?
st   st 1  st 
simple derivative
 Poor behaviour for large t ~ saturation
How to avoid saturation?
- Proper initialization (small weights)
- Pre-scaling of inputs (zero mean, unit variance)
•
•
•
Learning rate schedule (initial, final)
Stopping rules, number of epochs
Number of hidden units
23
Regularization Effect of Backpropagation
•
•
Backpropagation ~ iterative optimization
Final model (weights) depends on:
- initial point + final point (stopping rules)
 initialization and/ or stopping rules can be used
for model complexity control
training
validation
MSE
Stop Training
Number of epochs
24
Various forms of complexity control
•
•
•
•
•
•
MLP topology ~ number of hidden units
Constraints on parameters (weights) ~
weight decay
Type of optimization algorithm (many
versions of backprop., other opt. methods)
Stopping rules
Initial conditions (initial ‘small’ weights)
Multiple factors make it difficult to control
complexity; usually vary one complexity
parameter while keeping all others fixed
25
Example: univariate regression
•
Data set: 30 samples generated using sine-squared
target function with Gaussian noise (st. deviation 0.1).
MLP network
1.2
(two hidden units)
underfitting
1
0.8
0.6
Y
•
0.4
0.2
0
-0.2
0
0.2
0.4
0.6
0.8
1
X
26
Example: univariate regression
•
Data set: 30 samples generated using sine-squared
target function with Gaussian noise (st. deviation 0.1).
MLP network
1.2
(five hidden units)
near optimal
1
0.8
0.6
Y
•
0.4
0.2
0
-0.2
0
0.2
0.4
0.6
0.8
1
X
27
Example: univariate regression
•
Data set: 30 samples generated using sine-squared
target function with Gaussian noise (st. deviation 0.1).
MLP network
1.2
(20 hidden units)
little overfitting
1
0.8
0.6
Y
•
0.4
0.2
0
-0.2
0
0.2
0.4
0.6
0.8
1
X
28
Backpropagation for classification
m
•
ˆy   w j z j
Original MLP is for regression
(as shown)
j 1
W is m  1
z1
1
z2
2
zm
m
zj  gx,v j 
V is d  m
•
For classification:
x
x
x
- sigmoid output unit (~ logistic regression using loglikelihood loss – see textbook)
- during training, use real-values 0/1 for class labels
- during operation, threshold the output of a trained MLP
classifier at 0.5 to predict class labels
29
1
2
d
Classification example (Ripley’s data set)
•
Data set: 250 samples ~ mixture of gaussians, where
Class 0 data has centers (-0.3, 0.7) and (0.4, 0.7), and
Class 1 data has centers (-0.7, 0.3) and (0.3, 0.3).
The variance of all gaussians is 0.03.
•
MLP classifier
(two hidden units)
1.2
1
0.8
underfitting
0.6
0.4
0.2
0
-0.2
-1.5
-1
-0.5
0
0.5
1
30
Classification Example
•
MLP classifier (three hidden units)
~ near optimal solution
1.2
1
0.8
0.6
0.4
0.2
0
-0.2
-1.5
-1
-0.5
0
0.5
1
31
Classification Example
•
MLP classifier (six hidden units)
some overfitting
1.2
1
0.8
0.6
0.4
0.2
0
-0.2
-1.5
-1
-0.5
0
0.5
1
32
MLP software
•
•
MLP software widely available in public domain
Can handle multi-class problems
•
For example, Netlab toolbox (in Matlab) at
http://www1.aston.ac.uk/eas/research/groups/ncrg/
resources/netlab/
•
Many commercial products (full of marketing hype)
’Nearly 80% Accurate Market Forecasting Software
Get FREE up to date predictions and see for yourself!’
33
NetTalk
(Sejnowski and Rosenberg, 1987)
One of the first successful applications of backpropagation:
http://www.cnl.salk.edu/ParallelNetsPronounce/index.php
•
Goal: Learning to read (English text) aloud, i.e.
Learn Mapping: English text  phonemes
using MLP classifier network
•
Network inputs encode 7-letter window (the 4-th letter
•
•
in the middle needs to be pronounced)
Network outputs (26 units) encode phonemes that
drive a speech synthesizer
The MLP network is trained using labeled data (both
individual words and unrestricted text)
34
NetTalk architecture
Input encoding: 7x29 = 203 units
Output encoding: 26 units (phonemes)
Hidden layer: 80 hidden units
35
Listening to NetTalk-generated speech
Listen to tape recordings illustrating NETtalk operation available on Youtube
http://www.youtube.com/watch?v=gakJlr3GecE
These three recordings contain 3 different audio outputs of NETtalk:
(a) during the first 5 minutes of training, starting with weights initialized to zero.
(b) after training using the set of 10,000 words. This training set corresponds to 20
passes (epochs) over 500-word text.
(c) generated with new text input that was not part of the training set.
After listening to these recordings, answer and comment on the following questions:
- can you recognize words in the recording (a), (b) and (c)? – Explain why.
- compare the quality of outputs (b) and (c). Which one seems closer to human
speech and why?
Question for discussion: Problem 6.8
- Why NETtalk uses a seven-letter window?
36
Radial Basis Function (RBF) Networks
•
 xvj
f m x    w j g
 j
j 1

m

w
0


z
1
- each b.f. is (usually) local
- center v j and width  j
i.e. Gaussian:
 xv
  x j  v j 2 
  exp
gx    exp
2
2



2

2

j 1



d
ˆy   w j z j
j 1
W is m  1
z2
1
•
m
Dictionary parameterization:
2
zm
m
zj  gx,v j 
V is d  m
x1
2




x2
xd
Typically used for regression or classification
37
RBF network training
•
•
•
•
RBF training (learning) ~ estimation of
(1) RBF parameters (centers, width)
(2) linear weights w’s
Non-adaptive implementation:
(1) Estimate RBF parameters via unsupervised learning
(only x-values of training data) – can use SOM, GLA etc.
(2) Estimate weights w via linear least squares
Advantages:
- fast training;
- when x-samples are plenty, but (x,y) data are few
Limitations: cannot discard irrelevant inputs
the curse of dimensionalty
38
Non-adaptive RBF training algorithm
1. Choose the number of basis functions
(centers) m.
2. Estimate centers v j using x-values of training data
3.
via unsupervised learning (SOM, GLA, clustering etc.)
Determine width parameters  j using heuristic:
For a given center v j
(a) find the distance to the closest center:
for all k  j
r j  min vk  vj
k
(b) set the width parameter j   rj
where parameter  controls degree of overlap between
adjacent basis functions. Typically 1    3
4. Estimate weights w via linear least squares
(minimization of the empirical risk MSE).
39
RBF network complexity control
RBF model complexity can be controlled by
• The number of RBFs:
Goal: select opt number of units (RBFs)
• RBF width:
Goal: select opt width parameter (for large
number of RBF’s)
• Penalization of large weights w’s
See toy examples next (using the number of
units as the complexity parameter)
40
Example: RBF regression
•
Data set: 30 samples generated using sine-squared
target function with Gaussian noise (st. deviation 0.1).
RBF network: automatic width selection (via x-validation)
2 RBF’s
underfitting
1.2
1
0.8
0.6
Y
•
0.4
0.2
0
-0.2
0
0.2
0.4
0.6
0.8
1
X
41
Example: RBF regression
•
Data set: 30 samples generated using sine-squared
target function with Gaussian noise (st. deviation 0.1).
RBF network: automatic width selection
5 RBF’s
~ optimal
1.2
1
0.8
0.6
Y
•
0.4
0.2
0
-0.2
0
0.2
0.4
0.6
0.8
1
X
42
Example: RBF regression
•
Data set: 30 samples generated using sine-squared
target function with Gaussian noise (st. deviation 0.1).
RBF network: automatic width selection
20 RBF’s
overfitting
1.2
1
0.8
0.6
0.4
Y
•
0.2
0
-0.2
-0.4
-0.6
-0.8
0
0.2
0.4
0.6
0.8
1
X
43
RBF Classification example (Ripley’s data)
•
Data set: 250 samples ~ mixture of gaussians, where
Class 0 data has centers (-0.3, 0.7) and (0.4, 0.7), and
Class 1 data has centers (-0.7, 0.3) and (0.3, 0.3).
The variance of all gaussians is 0.03.
•
RBF classifier
(4 units)
1.2
1
0.8
some underfitting
0.6
0.4
0.2
0
-0.2
-1.5
-1
-0.5
0
0.5
1
44
RBF Classification example (cont’d)
•
RBF classifier (9 units)
1.2
Optimal
1
0.8
0.6
0.4
0.2
0
-0.2
-1.5
-1
-0.5
0
0.5
1
45
RBF Classification example (cont’d)
•
RBF classifier (25 units)
Little overfitting
1.2
1
0.8
0.6
0.4
0.2
0
-0.2
-1.5
-1
-0.5
0
0.5
1
46
OUTLINE
•
•
•
•
•
•
Objectives
Brief history and motivation for artificial
neural networks
Sequential estimation of model parameters
Methods for supervised learning
Methods for unsupervised learning
- clustering and vector quantization
- Self-Organizing Maps (SOM)
- Application example
Summary and discussion
47
Overview
Recall from Lecture Set 2:
unsupervised learning
data reduction approach
• Example: Training data represented by 3 ‘centers’
H
48
Two types of problems
1. Data reduction:
VQ + clustering
‘Model’ ~ m points
Vector Quantizer Q:
m
f  x,   Qx    c j I x R j 
j 1
VQ setting: given n training samples X  x ,x ,...,x 
1
2
n
find the coordinates c j of m centers (prototypes) such
that the total squared error distortion is minimized
2
R    x  f x,  pxdx
49
2. Dimensionality reduction:
x2
linear
nonlinear
x1
‘Model’ ~ projection of high-dim. data onto low-dim. space.
Note: the goal is to estimate a mapping from d-dimensional
input space (d=2) to low-dimensional feature space (d*=1)
2
R    x  f x,  pxdx
50
Vector Quantization and Clustering
•
Two complementary goals of VQ:
1. partition the input space into disjoint regions
2. find positions of units (coordinates of prototypes)
Note: optimal partitioning into regions is according to
the nearest-neighbor rule (~ the Voronoi regions)
51
Generalized Lloyd Algorithm(GLA) for VQ
Given data points xk k  1,2,... , loss function L (i.e.,
squared loss) and initial centers c j  0
j  1,...,m
Perform the following updates upon presentation of xk
1. Find the nearest center to the data point (the
winning unit):
j  arg min xk   c i k 
i
2. Update the winning unit coordinates (only) via


c j k  1  c j k    k  xk   c j k 
Increment k and iterate steps (1) – (2) above
Note: - the learning rate decreases with iteration number k
- biological interpretations of steps (1)-(2) exist
52
Batch version of GLA
Given data points x i i 1,...,n , loss function L (i.e.,
j  1,...,m
squared loss) and initial centers c j  0
Iterate the following two steps
1. Partition the data (assign sample x i to unit j )
using the nearest neighbor rule. Partitioning matrix Q:
1 if L x i ,c j k   min L x i ,cl k 
l
qij  
0 otherwise
2. Update unit coordinates as centroids of the data:
n
qij x i

c j  k  1  i1n
, j  1,... ,m
 qij
i 1
Note: final solution may depend on initialization (local min)
– potential problem for both on-line and batch GLA
53
Numeric Example of univariate VQ
Given data: {2,4,10,12,3,20,30,11,25}, set m=2
• Initialization (random): c1=3,c2=4
• Iteration 1
Projection: P1={2,3} P2={4,10,12,20,30,11,25}
Expectation (averaging): c1=2.5, c2=16
• Iteration 2
Projection: P1={2,3,4}, P2={10,12,20,30,11,25}
Expectation(averaging): c1=3, c2=18
• Iteration 3
Projection: P1={2,3,4,10},P2={12,20,30,11,25}
Expectation(averaging): c1=4.75, c2=19.6
• Iteration 4
Projection: P1={2,3,4,10,11,12}, P2={20,30,25}
Expectation(averaging): c1=7, c2=25
• Stop as the algorithm is stabilized with these values
54
GLA Example 1
•
Modeling doughnut distribution using 5 units
(a) initialization
(b) final position (of units)
55
GLA Example 2
•
Modeling doughnut distribution using 3 units:
Bad initialization  poor local minimum
56
GLA Example 3
•
Modeling doughnut distribution using 20 units:
7 units were never moved by the GLA
 the problem of unused units (dead units)
57
Avoiding local minima with GLA
•
•
Starting with many random initializations,
and then choosing the best GLA solution
Conscience mechanism: forcing ‘dead’
units to participate in competition, by keeping
the frequency count (of past winnings) for
each unit,
i.e. for on-line version of GLA in Step 1
j  arg min xk   ci k  freqi (k )
i
•
Self-Organizing Map: introduce topological
relationship (map), thus forcing the neighbors
of the winning unit to move towards the data.
58
Clustering methods
•
•
•
•
Clustering: separating a data set into
several groups (clusters) according to some
measure of similarity
Goals of clustering:
interpretation (of resulting clusters)
exploratory data analysis
preprocessing for supervised learning
often the goal is not formally stated
VQ-style methods (GLA) often used for
clustering, i.e. k-means or c-means
Many other clustering methods as well
59
Clustering (cont’d)
•
•
•
Clustering: partition a set of n objects
(samples) into k disjoint groups, based on
some similarity measure. Assumptions:
- similarity ~ distance metric dist (i,j)
- usually k given a priori (but not always!)
Intuitive motivation:
similar objects into one cluster
dissimilar objects into different clusters
 the goal is not formally stated
Similarity (distance) measure is critical
but usually hard to define (objectively).
Distance needs to be defined for different
types of input variables.
60
Self-Organizing Maps
History and biological motivation
•
•
•
•
Brain changes its internal structure to reflect
life experiences  interaction with
environment is critical at early stages of
brain development (first 1-2 years of life)
Existence of various regions (maps) in the
brain
How these maps may be formed?
i.e. information-processing model leading to
map formation
T. Kohonen (early 1980’s) proposed SOM
61
Goal of SOM
•
•
•
•
Dimensionality reduction: project given (high-dim.)
data onto low-dimensional space (called a map)
Feature space (Z-space) is 1D or 2D and is discretized
as a number of units, i.e., 10x10 map
Z-space has distance metric  ordering of units
Similarities and differences between VQ and SOM
X
G(X)
Z
F(Z)
ˆ
X
62
Self-Organizing Map
Discretization of 2D space via 10x10 map. In this discrete
space, distance relations exist between all pairs of units.
Distance relation ~ map topology
Units in 2D feature space
63
SOM Algorithm (flow through)
Given data points xk k  1,2,..., distance metric in the
input space (~ Euclidean), map topology (in z-space),
initial position of units (in x-space) c j  0 j  1,...,m
Perform the following updates upon presentation of xk
1. Find the nearest unit to the data point (the winning
unit denoted as z(k)):
z(k )  arg min xk   ci k  1
i
2. Update all units around the winning unit via
c j (k  1)  c j (k )  K ( z j , z(k ))(x(k )  c(k ))
Increment k, decrease the learning rate and the
neighborhood width, and repeat steps (1) – (2) above
64
SOM example (one iteration)
Step 1:
Step 2:
65
SOM example (next iteration)
Step 1:
Step 2:
Final map
66
Hyper-parameters of SOM
SOM performance depends on parameters (~ user-defined):
•
Map dimension and topology (usually 1D or 2D)
•
Number of SOM units ~ quantization level (of z-space)
•
Neighborhood function K ( z j , z(k )) ~ usually
rectangular or gaussian (shape not important)
•
Neighborhood width decrease schedule (important),
i.e. exponential decrease for Gaussian
 k    initial  final  initial 
k kmax
with user defined:
•
 z  z 2 

K k  z , z   exp 
 2 2 k  


k max  initial  final
Also linear decrease of neighborhood width
Learning rate schedule (important)  (k )   0

/0 
k / k max
f
(also linear decrease)
Note: learning rate and neighborhood decrease should be
set jointly
67
Modeling uniform distribution via SOM
(a) 300 random samples
(b) 10X10 map
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
SOM neighborhood: Gaussian
Learning rate: linear decrease
 (k )  0.1(1  k / k max )
68
Position of SOM units: (a) initial, (b) after 50 iterations,
(c) after 100 iterations, (d) after 10,000 iterations
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
0.2
0.4
0.6
0.8
0
0
1
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
0.2
0.4
0.6
0.8
1
0
0
0.2
0.2
0.4
0.4
0.6
0.6
0.8
0.8
1
1
69
Batch SOM (similar to Batch GLA)
Given data points x i , distance metric (i.e., squared loss),
j  1,...,m
map topology and initial centers c j  0
Iterate the following two steps
1. Partition the data into clusters using the minimum
distance rule. This results in assignment of n samples
to m clusters (units) according to assignment matrix Q
1 if L x i ,c j k   min L x i ,cl k 
l
qij  
0 otherwise
2. Update center coordinates as the weighted
average of all data samples (in each cluster):
 x K  z
n
c j (k  1) 
i 1
n
i
 K  z
i 1
j
j
, zi 
, zi 
Decrease the neighborhood width, and iterate.
70
Example: effect of the final neighborhood width
90%
50%
10%
71
SOM Applications
•
Two types of applications:
-
Vector Quantization
Clustering of multivariate data
Main web site: http://www.cis.hut.fi/research/som-research/
Numerous Applications
•
•
•
•
Marketing surveys/ segmentation
Financial/ stock market data
Text data / document map – WEBSOM
Image data / picture map - PicSOM
see HUT web site
72
•
•
•
•
•
Practical Issues for SOM
Pre-scaling of inputs, usually to [0, 1]
range. Why?
Map topology: usually 1D or 2D
Number of map units (per dimension)
Learning rate schedule (for on-line
version)
Neighborhood type and schedule:
Initial size (~1), final size
Final neighborhood size + the number
of units affect model complexity.
73
Modeling US states using 1D SOM
(performed by Feng Cai)
•
•
Purpose: clustering of US states
Data encoding: each state described by 5
socio-economic indicators: obesity index,
result of 2004 presidential elections, median
income, mean NAEP, IQ score
•
Data scaling: each input scaled
independently to [0,1] range
•
SOM specs: 1D map, 9 units, initial
neighborhood width 1, final width 0.05
74
State
Obesity index
Hawaii
Colorado
Connecticut
Massachusetts
New Hampshire
Utah
California
Maryland
New Jersey
Rhode Island
Vermont
Florida
Montana
Oregon
Arizona
Idaho
New Mexico
Wyoming
Maine
New York
Washington
South Dakota
Delaware
Illinois
Minnesota
Wisconsin
Nevada
Alaska
17
17
18
18
18
18
19
19
19
19
19
19
19
20
20
20
20
20
21
21
21
21
22
22
22
22
22
23
Election_04 Median Income Mean NAEP
0
1
0
0
1
1
0
0
0
0
0
1
1
0
1
1
0
1
0
0
0
1
0
0
0
0
1
1
49775
49617
53325
50587
53549
48537
48113
55912
53266
44311
41929
38533
33900
42704
41554
38613
35251
40499
37654
42432
44252
38755
50878
45906
54931
46351
46289
55412
238
252
255
257
257
250
238
248
253
245
256
245
254
250
241
249
235
253
253
251
251
254
250
248
256
252
239
245
IQ score
94
104
99
111
102
89
94
95
103
89
102
87
100
100
92
96
85
102
99
90
92
100
90
93
113
105
92
92
75
Iowa
Kansas
Missouri
Nebraska
North Dakota
Ohio
Oklahoma
Pennsylvania
Arkansas
Georgia
Indiana
North Carolina
Virginia
Michigan
Kentucky
Tennessee
Alabama
Louisiana
South Carolina
Texas
Mississippi
West Virginia
23
23
23
23
23
23
23
24
24
24
24
24
24
25
25
25
26
26
26
26
27
28
0
1
1
1
1
1
1
0
1
1
1
1
1
0
1
1
1
1
1
1
1
1
41827
42523
43955
43566
36717
43332
35500
43577
32423
43316
41581
38432
49974
45335
37893
36329
36771
33312
38460
40659
32447
30072
253
253
251
251
254
252
244
249
242
243
251
252
253
249
247
241
236
238
246
247
236
245
109
101
92
101
111
107
98
99
98
93
105
106
99
99
94
90
90
99
87
98
90
92
76
SOM Modeling 1 of US states
Unit
States (assigned to each unit)
1
HI, CA, MD, RI, NM,
2
OR, ME, NY, WA, DE, IL, PA, MI,
3
CT, MA, NJ, VT, MN, WI,
4
5
CO, NH, MT, WY, SD,
6
KS, NE, ND, OH, IN, NC, VA,
7
UT, ID, AK, IA, MO,
8
FL, AZ, NV, OK, GA, KY, TX
9
AR, TN, AL, LA, SC, MS, WV
77
78
SOM Modeling 2 of US states
- remove the voting input and apply 1D SOM:
Unit
1
States
CO, CT, MA, NH, NJ, MN,
2
WI, IA, ND, OH, IN, NC,
3
VT, MT, OR, ID, WY, ME, SD,
4
KS, MO, NE, PA, VA, MI,
5
UT, MD, NY, WA, DE, IL, AK,
6
HI, CA , RI,
7
FL, AZ, NM, NV,
8
OK, GA, KY, SC, TX,
9
AR, TN, AL, LA, MS, WV
79
SOM Modeling 2 of US states (cont’d)
- remove voting input and apply 1D SOM:
80
Clustering of European Languages
•
Background: historical linguistics studies
relatedness btwn languages based on
phonology, morphology, syntax and lexicon
•
Difficulty of the problem: due to evolving
nature of human languages and globalization.
•
Hypothesis: similarity based on analysis of a
small ‘stable’ word set.
See glottochronology, Swadesh list, at
http://en.wikipedia.org/wiki/Glottochronology
81
SOM Clustering of European Languages
Modeling approach: language ~ 10 word set.
Assuming words in different languages are encoded
in the same alphabet, it is possible to perform
clustering using some distance measure.
•
•
•
Issues:
selection of a stable word set
data encoding + distance metric
Stable word set: numbers 1 to 10
Data encoding: Latin alphabet, use 3 first
letters (in each word)
82
Numbers word set in 18 European languages
Each language is a feature vector encoding 10 words
English
Norwegian
Polish
Czech
Slovakian
Flemish
Croatian
Portuguese
French
Spanish
Italian
Swedish
Danish
Finnish
Estonian
Dutch
German
Hungarian
one
two
three
four
five
six
seven
eight
nine
ten
en
to
tre
fire
fem
seks
sju
atte
ni
ti
jeden
dwa
trzy
cztery
piec
szesc
sediem
osiem
dziewiec
dziesiec
jeden
dva
tri
ctyri
pet
sest
sedm
osm
devet
deset
jeden
dva
tri
styri
pat
sest
sedem
osem
devat
desat
ien
twie
drie
viere
vuvve
zesse
zevne
achte
negne
tiene
jedan
dva
tri
cetiri
pet
sest
sedam
osam
devet
deset
um
dois
tres
quarto
cinco
seis
sete
oito
nove
dez
un
deux
trois
quatre
cinq
six
sept
huit
neuf
dix
uno
dos
tres
cuatro
cinco
seis
siete
ocho
nueve
dies
uno
due
tre
quattro
cinque
sei
sette
otto
nove
dieci
en
tva
tre
fyra
fem
sex
sju
atta
nio
tio
en
to
tre
fire
fem
seks
syv
otte
ni
ti
yksi
kaksi
kolme
nelja
viisi
kuusi
seitseman
kahdeksan
yhdeksan
kymmenen
uks
kaks
kolme
neli
viis
kuus
seitse
kaheksa
uheksa
kumme
een
twee
drie
vier
vijf
zes
zeven
acht
negen
tien
erins
zwei
drie
vier
funf
sechs
sieben
acht
neun
zehn
egy
ketto
harom
negy
ot
hat
het
nyolc
kilenc
tiz
83
Data Encoding
• Word ~ feature vector encoding 3 first letters
• Alphabet ~ 26 letters + 1 symbol ‘BLANK’
vector encoding:
ALPHABET
INDEX
‘BLANK’
A
B
C
D
…
X
Y
Z
00
01
02
03
04
…
24
25
26
For example, ONE : ‘O’~14 ‘N’~15 ‘E’~05
84
Word Encoding (cont’d)
•
Word  27-dimensional feature vector
one (Word)
15 14 05 (Indices)
0
0
0
0
0
0
0
0 1
2
3 4 5 6
7
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
•
•
0
1
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
Encoding is insensitive to order (of 3 letters)
Encoding of 10-word set: concatenate feature
vectors of all words: ‘one’ + ‘two’ + …+ ‘ten’
 word set encoded as vector of dim. [1 X 270]
85
SOM Modeling Approach
•
2-Dimensional SOM (Batch Algorithm)
Number of Units per dimension=4
Initial Neighborhood =1 Final Neighborhood = 0.15
Total Number of Iterations= 70
86
OUTLINE
• Objectives
• Brief history and motivation for artificial
neural networks
• Sequential estimation of model
parameters
• Methods for supervised learning
• Methods for unsupervised learning
• Summary and discussion
87
Summary and Discussion
•
•
•
•
Neural Network methods (vs statistical
approaches):
- new techniques/ grad descent style methods
- simple (brute-force) computational approaches
- black-box models (e.g. MLP network)
- biological motivation
The same fundamental issues: small-sample
problems, curse-of-dimensionality, non-linear
optimization, complexity control
Neural network methods implement ERM or SRM
approach (under predictive learning setting)
Hype and controversy
88