Dynamics of Learning VQ and Neural Gas Aree Witoelar, Michael Biehl Mathematics and Computing Science University of Groningen, Netherlands in collaboration with Barbara Hammer (Clausthal), Anarta Ghosh (Groningen) Outline Vector Quantization (VQ) Analysis of VQ Dynamics Learning Vector Quantization (LVQ) Summary Dagstuhl Seminar, 25.03.2007 Vector Quantization Objective: representation of (many) data with (few) prototype vectors Assign data ξμ to nearest prototype vector wj (by a distance measure, e.g. Euclidean) Find optimal set W for lowest quantization error P grouping data into clusters e.g. for classification Dagstuhl Seminar, 25.03.2007 E(W) d(ξ , w j ) data distance to nearest prototype Example: Winner Takes All (WTA) • initialize K prototype vectors • present a single example • identify the closest prototype, i.e the so-called winner • move the winner even closer towards the example • prototypes at areas with high density of data • stochastic gradient descent with respect to a cost function Dagstuhl Seminar, 25.03.2007 Problems Winner Takes All sensitive to initialization Dagstuhl Seminar, 25.03.2007 “winner takes most”: update according to “rank” e.g. Neural Gas less sensitive to initialization? (L)VQ algorithms • intuitive • fast, powerful algorithms • flexible • limited theoretical background w.r.t. convergence speed, robustness to initial conditions, etc. Analysis of VQ Dynamics • exact mathematical description in very high dimensions • study of typical learning behavior Dagstuhl Seminar, 25.03.2007 Model: two Gaussian clusters of high dimensional data Random vectors ξ ∈ ℝN according to P (ξ ) σ 1 p σ P (ξ σ) P (ξ σ) Ν (B σ , σ ) classes: σ = {+1,-1} (p-) prior prob.: p+, p- (p-) p+ + p- = 1 cluster centers: B+, B- ∈ ℝN variance: υ+, υseparation ℓ B- ℓ B+ (p+) separable in projection to (B+ , B-) plane (p+) not separable on other planes only separable in 2 dimensions simple model, but not trivial Dagstuhl Seminar, 25.03.2007 Online learning μ μ sequence of independent random data {ξ , } μ 1,2,3,..., P update of prototype vector ws ∈ ℝN w μs w μs -1 η N learning rate, step size ξ f s rank s , cs , σ μ , ... strength, direction of update etc. w μs -1 move prototype towards current data prototype class fs […] describes the algorithm used μ data class cs , σ 1 rank s 1,2,...,K “winner” Dagstuhl Seminar, 25.03.2007 1. Define few characteristic quantities of the system R μsσ wμs Bσ s, t {1,2,...,K } Qμst wμs wμt projections to cluster centers σ 1,1 length and overlap of prototypes 2. Derive recursion relations of the quantities for new input data w μs w μs -1 ξ η f s ranks , cs , σ μ , ... N N (Rμsσ Rμsσ-1 ) η fs ... bμσ Rμsσ-1 μ w μs -1 N (Qμst Qμst-1 ) η f s ... h μt Qμst-1 η f t ... h μs Qμst-1 η2 f s ... f t ... Ο1 / N random vector ξμ enters as projections 3. Calculate average recursions Dagstuhl Seminar, 25.03.2007 hμs wμs -1 ξμ bμ B ξμ In the thermodynamic limit N∞ ... • the projections • become correlated Gaussian quantities completely specified in terms of first and second moments: hs σ Rsσ hs h t σ - hs b σ b b σ - b σ b σ hs b σ - hs σ b σ Rs if σ 0 else σ ht σ Qst • characteristic quantities Qst R μsσ • self average w.r.t. random sequence of data (fluctuations vanish) μ • define continuous learning time Dagstuhl Seminar, 25.03.2007 t μ N μ : discrete (1,2,…,P) t : continuous 4. Derive ordinary differential equations dR sσ η dt f s ... bμσ R μsσ-1 dQst η f s ... h μt Qμst-1 η f t ... h μs Qμst-1 dt η2 f s ... f t ... 5. Solve for Rsσ(t), Qst(t) • dynamics/asymptotic behavior (t ∞) • quantization/generalization error • sensitivity to initial conditions, learning rates, structure of data Dagstuhl Seminar, 25.03.2007 Results VQ 2 prototypes w μs w μs -1 η d μj d μs N js ξ R1+ R2- R2+ R1t /N Q11 Q22 Numerical integration of the ODEs (ws(0)≈0 p+=0.6, ℓ=1.0, υ+=1.5, υ=1.0, =0.01) quantization error E(W) Q12 Dagstuhl Seminar, 25.03.2007 w μs -1 ws winner characteristic quantities t /N μ t 2 prototypes 3 prototypes Projections of prototypes on the B+,B- plane at t=50 RS- RS- B- ℓ p+ > p- B+ RS+ RS+ Two prototypes move to the stronger cluster Dagstuhl Seminar, 25.03.2007 Neural Gas: a winner take most algorithm 3 prototypes w μs w μs -1 r η 1 exp( s ) N C (t ) ξ μ w μs -1 update strength decreases exponentially by rank λ(t) large initially, decreased over time λ(t)0: identical to WTA λi=2; λf=10-2 RS- t=50 t=0 RS+ Dagstuhl Seminar, 25.03.2007 quantization error E(W) t Sensitivity to initialization WTA Neural Gas RS- RS- RS+ at t=50 E(W) ∇HVQ≈0 “plateau” at t=50 RS+ WTA: • (eventually) reaches minimum E(W) • depends on initialization: possible large learning time Neural Gas: • more robust w.r.t. initialization t Dagstuhl Seminar, 25.03.2007 Learning Vector Quantization (LVQ) Objective: classification of data using prototype vectors Assign data {ξ,σ}; ξ ∈ ℝN to nearest s N prototype vector {w s , c }; w s R (distance measure, e.g. Euclidean) Find optimal set W for lowest generalization error 1 c j g ( , c ) ; ( , c ) else 0 j misclassified by nearest prototype Dagstuhl Seminar, 25.03.2007 j LVQ1 w μs w μs -1 update winner towards/ away from data η cs d μj d μs js N ξ μ w μs -1 ws winner ±1 no cost function related to generalization error two prototypes three prototypes which class to add the 3rd prototype? c={+1, -1} c={+1,-1,-1} c={+1,+1,-1} RS- RS- RS- RS+ Dagstuhl Seminar, 25.03.2007 RS+ RS+ Generalization error εg K p 1 s d j d s ( s , ) K j s class misclassified data εg t p+=0.6, p-= 0.4 υ+=1.5, υ-=1.0 Dagstuhl Seminar, 25.03.2007 Optimal decision boundary (hyper)plane where p P(ξ σ 1) p P(ξ σ 1) equal variance (υ+=υ-): linear decision boundary (p-) ℓ d unequal variance υ+>υK=2 B- B+ (p+>p- ) more prototypes better approximation to optimal decision boundary Dagstuhl Seminar, 25.03.2007 optimal with K=3 Asymptotic εg υ+ >υ- (υ+=0.81, υ- =0.25) c={+1,+1,-1} εg(t∞) c={+1,-1,-1} εg(t∞) εg p+ p+ • Optimal: K=3 better • LVQ1: K=3 better • Optimal: K=3 equal to K=2 • LVQ1: K=3 worse • more prototypes not always better for LVQ1 • best: more prototypes on the class with the larger variance Dagstuhl Seminar, 25.03.2007 Summary dynamics of (Learning) Vector Quantization for high dimensional data Neural Gas: more robust w.r.t. initialization than WTA LVQ1: more prototypes not always better Outlook study different algorithms e.g. LVQ+/-, LFM, RSLVQ more complex models multi-prototype, multi-class problems Reference Dynamics and Generalization Ability of LVQ Algorithms M. Biehl, A. Ghosh, and B. Hammer Journal of Machine Learning Research (8): 323-360 (2007) http://jmlr.csail.mit.edu/papers/v8/biehl07a.html Dagstuhl Seminar, 25.03.2007 Questions ? Dagstuhl Seminar, 25.03.2007 Central Limit Theorem •Let x1, x2,…, xN be independent random numbers from arbitrary probability distribution with mean and finite variance •The distribution of the average of xj approaches a normal distribution as N becomes large. Example: nonnormal distribution p(xj) N=1 Distribution of average of xj: 1 N p x j N j1 N=2 Dagstuhl Seminar, 25.03.2007 N=5 N=50 Self Averaging Fluctuations decreases with larger degree of freedom N At N∞, fluctuations vanish (variance becomes zero) Monte Carlo simulations over 100 independent runs Dagstuhl Seminar, 25.03.2007 “LVQ +/-” w μj w μj -1 update correct and incorrect winners η j μ c ξ w μj -1 ; N j {s, t} ds = min {dk} with cs = σμ dt = min {dk} with ct ≠σμ p+ >> p- : strong repulsion by stronger class t strongly divergent! to overcome divergence: e.g. early stopping (difficult in practice) stop at εg(t)=εg,min εg(t) t Dagstuhl Seminar, 25.03.2007 Comparison LVQ1 and LVQ +/c={+1,+1,-1} υ+ = υ- =1.0 υ+ = 0.81, υ- =0.25 p+ LVQ1 outperforms LVQ+/with early stopping p+ LVQ+/- with early stopping outperforms LVQ1 in a certain p+ interval LVQ+/- performance depends on initial conditions Dagstuhl Seminar, 25.03.2007