MIT Department of Brain and Cognitive Sciences 9.641J, Spring 2005 - Introduction to Neural Networks Instructor: Professor Sebastian Seung Clustering Hypothesis: Hebbian synaptic plasticity enables a perceptron to compute the mean of its preferred stimuli. Unsupervised learning • • • • • Sequence of data vectors Learn something about their structure Multivariate statistics Neural network algorithms Brain models Data can be summarized by a few prototypes. Vector quantization • Many telecom applications • Codebook of prototypes • Send index of prototype rather than whole vector • Lossy encoding A single prototype • Summarize all data with the sample mean. m 1 µ = ∑ xa m a=1 Multiple prototypes • Each prototype is the mean of a subset of the data. • Divide data into k clusters. – One prototype for each cluster. Assignment matrix cluster α data vector a ⎧1, xa ∈ cluster α Aaα = ⎨ otherwise ⎩0, • Data structure for cluster memberships. k-means algorithm • Alternate between computing means and computing assignments. m µα = ∑x A α a a i=1 m ∑A α b b=1 Aaα = 1 for a = arg min xb − µα b Objective function • Why does it work? • Method of minimizing an objective function. Rubber band computer m 1 2 xa − µ ∑ 2 a=1 • Attach rubber band from each data vector to the prototype vector. • The prototype will converge to the sample mean. The sample mean maximizes likelihood • Gaussian distribution ⎛ 1 2⎞ Pµ (x ) ∝ exp⎜ − x − µ ⎟ ⎝ 2 ⎠ • Maximize Pµ (x1 )Pµ (x 2 )LPµ (x m ) Objective function for k-means m k 1 2 E ( A,µ ) = ∑ ∑ Aaα xa − µα 2 a=1 α =1 µ = arg min E (A,µ ) µ A = arg min E ( A ,µ ) A Local minima can exist Model selection • How to choose the number of clusters? • Tradeoff between model complexity and objective function. Neural implementation • A single perceptron can learn the mean in its weight vector. • Many competing perceptrons can learn prototypes for clustering data. Batch vs. online learning • Batch – Store all data vectors in memory explicitly. • Online – Data vectors appear sequentially. – Use one, then discard it. – Only memory is in learned parameters. Learning rule 1 w t = w t−1 + ηx t Learning rule 2 wt = wt−1 + ηt (xt − wt−1 ) = (1− ηt )wt−1 + ηt xt • “weight decay” Learning rule 2 again ∂ 1 2 ∆w = −η x−w ∂w 2 • Is there an objective function? Stochastic gradient following The average of the update is in the direction of the gradient. Stochastic gradient descent ∂ ∆w = −η e(w, x ) ∂w ∂E ∆w = −η ∂w E (w) = e(w, x ) Convergence conditions • Learning rate vanishes – slowly ∑η =∞ t t – but not too slowly ∑η t 2 t <∞ • Every limit point of the sequence wt is a stationary point of E(w) Competitive learning • Online version of k-means ⎧1, minimal x − w a ya = ⎨ ⎩0, other clusters ∆w a = ηy a (x − w a ) Competition with WTA • If the wa are normalized argmin x − w a = max w a ⋅ x a a Objective function 1 2 min x − w a a 2 Cortical maps Images removed due to copyright reasons. Ocular dominance columns Images removed due to copyright reasons. Orientation map Images removed due to copyright reasons. Kohonen feature map ⎧1, neighborhood of closest cluster ya = ⎨ elsewhere ⎩0, ∆w a = ηy a (x − w a ) Hypothesis: Receptive fields are learned by computing the mean of a subset of images Nature vs. nurture • Cortical maps – dependent on visual experience? – preprogrammed?