MIT Department of Brain and Cognitive Sciences Instructor: Professor Sebastian Seung

advertisement
MIT Department of Brain and Cognitive Sciences
9.641J, Spring 2005 - Introduction to Neural Networks
Instructor: Professor Sebastian Seung
Clustering
Hypothesis: Hebbian synaptic
plasticity enables a perceptron
to compute the mean of its
preferred stimuli.
Unsupervised learning
•
•
•
•
•
Sequence of data vectors
Learn something about their structure
Multivariate statistics
Neural network algorithms
Brain models
Data can be summarized by a
few prototypes.
Vector quantization
• Many telecom applications
• Codebook of prototypes
• Send index of prototype rather than
whole vector
• Lossy encoding
A single prototype
• Summarize all data with the sample
mean.
m
1
µ = ∑ xa
m a=1
Multiple prototypes
• Each prototype is the mean of a subset
of the data.
• Divide data into k clusters.
– One prototype for each cluster.
Assignment matrix
cluster α
data
vector
a
⎧1, xa ∈ cluster α
Aaα = ⎨
otherwise
⎩0,
• Data structure for cluster memberships.
k-means algorithm
• Alternate between computing means
and computing assignments.
m
µα =
∑x A α
a
a
i=1
m
∑A α
b
b=1
Aaα = 1 for
a = arg min xb − µα
b
Objective function
• Why does it work?
• Method of minimizing an objective
function.
Rubber band computer
m
1
2
xa − µ
∑
2 a=1
• Attach rubber band from each data
vector to the prototype vector.
• The prototype will converge to the
sample mean.
The sample mean maximizes
likelihood
• Gaussian distribution
⎛ 1
2⎞
Pµ (x ) ∝ exp⎜ − x − µ ⎟
⎝ 2
⎠
• Maximize
Pµ (x1 )Pµ (x 2 )LPµ (x m )
Objective function for k-means
m
k
1
2
E ( A,µ ) = ∑ ∑ Aaα xa − µα
2 a=1 α =1
µ = arg min E (A,µ )
µ
A = arg min E ( A ,µ )
A
Local minima can exist
Model selection
• How to choose the number of clusters?
• Tradeoff between model complexity and
objective function.
Neural implementation
• A single perceptron can learn the mean
in its weight vector.
• Many competing perceptrons can learn
prototypes for clustering data.
Batch vs. online learning
• Batch
– Store all data vectors in memory explicitly.
• Online
– Data vectors appear sequentially.
– Use one, then discard it.
– Only memory is in learned parameters.
Learning rule 1
w t = w t−1 + ηx t
Learning rule 2
wt = wt−1 + ηt (xt − wt−1 )
= (1− ηt )wt−1 + ηt xt
• “weight decay”
Learning rule 2 again
∂ 1
2
∆w = −η
x−w
∂w 2
• Is there an objective function?
Stochastic gradient following
The average of the update is in
the direction of the gradient.
Stochastic gradient descent
∂
∆w = −η
e(w, x )
∂w
∂E
∆w = −η
∂w
E (w) = e(w, x )
Convergence conditions
• Learning rate vanishes
– slowly
∑η
=∞
t
t
– but not too slowly
∑η
t
2
t
<∞
• Every limit point of the sequence wt is a
stationary point of E(w)
Competitive learning
• Online version of k-means
⎧1, minimal x − w a
ya = ⎨
⎩0, other clusters
∆w a = ηy a (x − w a )
Competition with WTA
• If the wa are normalized
argmin x − w a = max w a ⋅ x
a
a
Objective function
1
2
min x − w a
a 2
Cortical maps
Images removed due to copyright reasons.
Ocular dominance columns
Images removed due to copyright reasons.
Orientation map
Images removed due to copyright reasons.
Kohonen feature map
⎧1, neighborhood of closest cluster
ya = ⎨
elsewhere
⎩0,
∆w a = ηy a (x − w a )
Hypothesis: Receptive fields
are learned by computing the
mean of a subset of images
Nature vs. nurture
• Cortical maps
– dependent on visual experience?
– preprogrammed?
Download