Document

advertisement
Dynamics of Learning VQ
and Neural Gas
Aree Witoelar, Michael Biehl
Mathematics and Computing Science
University of Groningen, Netherlands
in collaboration with Barbara Hammer (Clausthal),
Anarta Ghosh (Groningen)
Outline
 Vector Quantization (VQ)
 Analysis of VQ Dynamics
 Learning Vector Quantization (LVQ)
 Summary
Dagstuhl Seminar, 25.03.2007
Vector Quantization
Objective:
representation of (many) data with (few) prototype vectors
Assign data ξμ to nearest prototype
vector wj
(by a distance measure, e.g.
Euclidean)
Find optimal set W for
lowest quantization error
P
grouping data into clusters
e.g. for classification
Dagstuhl Seminar, 25.03.2007
E(W)  d(ξ  , w j )

data
distance to
nearest prototype
Example: Winner Takes All (WTA)
• initialize K prototype vectors
• present a single example
• identify the closest prototype,
i.e the so-called winner
• move the winner even
closer towards the example
• prototypes at areas with high
density of data
• stochastic gradient descent with
respect to a cost function
Dagstuhl Seminar, 25.03.2007
Problems
Winner Takes All
sensitive to
initialization
Dagstuhl Seminar, 25.03.2007
“winner takes most”: update
according to “rank”
e.g. Neural Gas
less sensitive to
initialization?
(L)VQ algorithms
• intuitive
• fast, powerful algorithms
• flexible
• limited theoretical background w.r.t. convergence speed,
robustness to initial conditions, etc.
Analysis of VQ Dynamics
• exact mathematical description in very high dimensions
• study of typical learning behavior
Dagstuhl Seminar, 25.03.2007
Model: two Gaussian clusters of high dimensional data
Random vectors
ξ ∈ ℝN
according to P (ξ ) 

σ  1
p σ P (ξ σ)
P (ξ σ)  Ν (B σ , σ )
classes: σ = {+1,-1}
(p-)
prior prob.: p+, p-
(p-)
p+ + p- = 1
cluster centers:
B+, B- ∈ ℝN
variance: υ+, υseparation
ℓ
B-
ℓ
B+
(p+)
separable in projection
to (B+ , B-) plane
(p+)
not separable on other
planes
only separable in 2 dimensions  simple model, but not trivial
Dagstuhl Seminar, 25.03.2007
Online learning
μ
μ
sequence of independent random data {ξ ,  } μ  1,2,3,..., P
update of prototype vector ws ∈ ℝN
w μs  w μs -1 
η
N
learning rate,
step size

 ξ
f s rank s , cs , σ μ , ...
strength,
direction of
update etc.
 w μs -1

move prototype
towards current
data
prototype
class
fs […] describes the algorithm used
μ
data
class
cs , σ    1
rank s  1,2,...,K
“winner”
Dagstuhl Seminar, 25.03.2007
1. Define few characteristic quantities of the system
R μsσ  wμs  Bσ
s, t {1,2,...,K }
Qμst  wμs  wμt
projections to
cluster centers
σ   1,1
length and overlap
of prototypes
2. Derive recursion relations of the quantities for new input data
w μs  w μs -1 
ξ
η
f s  ranks , cs , σ μ , ...
N



N (Rμsσ  Rμsσ-1 )  η fs ... bμσ  Rμsσ-1


μ

 w μs -1 

N (Qμst  Qμst-1 )  η f s ... h μt  Qμst-1  η f t ... h μs  Qμst-1
 η2 f s ... f t ...  Ο1 / N 
random vector ξμ enters as projections
3. Calculate average recursions
Dagstuhl Seminar, 25.03.2007

hμs  wμs -1  ξμ
bμ  B  ξμ
In the thermodynamic limit N∞ ...
• the projections
• become correlated Gaussian quantities
 completely specified in terms of first and second moments:
hs
σ
  Rsσ
hs h t
σ
- hs
b
σ
 
b  b
σ
- b
σ
b
σ
  
hs b
σ
- hs
σ
b
σ
 Rs
 if   σ

0 else
σ
ht
σ
 Qst
• characteristic quantities Qst R μsσ
• self average w.r.t. random sequence of data (fluctuations vanish)
μ
• define continuous learning time
Dagstuhl Seminar, 25.03.2007
t
μ
N
μ : discrete (1,2,…,P)
t : continuous
4. Derive ordinary differential equations
dR sσ
 η
dt

f s ... bμσ  R μsσ-1




dQst
 η f s ... h μt  Qμst-1  η f t ... h μs  Qμst-1
dt
 η2 f s ... f t ...
5. Solve for Rsσ(t), Qst(t)
• dynamics/asymptotic behavior (t  ∞)
• quantization/generalization error
• sensitivity to initial conditions, learning rates, structure of data
Dagstuhl Seminar, 25.03.2007

Results
VQ 2 prototypes
w μs  w μs -1 

η
  d μj  d μs
N js
 ξ
R1+
R2-
R2+
R1t /N
Q11
Q22

Numerical integration of the ODEs
(ws(0)≈0 p+=0.6, ℓ=1.0, υ+=1.5, υ=1.0, =0.01)
quantization error
E(W)
Q12
Dagstuhl Seminar, 25.03.2007
 w μs -1
ws winner
characteristic quantities
t /N
μ
t
2 prototypes
3 prototypes
Projections of prototypes on
the B+,B- plane at t=50
RS-
RS-
B-
ℓ
p+ > p-
B+
RS+
RS+
Two prototypes move to the
stronger cluster
Dagstuhl Seminar, 25.03.2007
Neural Gas: a winner take most algorithm
3 prototypes
w μs  w μs -1 
r
η 1
exp( s )
N C
 (t )
ξ
μ
 w μs -1
update strength decreases
exponentially by rank

λ(t) large initially,
decreased over time
λ(t)0: identical to WTA
λi=2; λf=10-2
RS-
t=50
t=0
RS+
Dagstuhl Seminar, 25.03.2007
quantization
error E(W)
t
Sensitivity to initialization
WTA
Neural Gas
RS-
RS-
RS+
at t=50
E(W)
∇HVQ≈0
“plateau”
at t=50
RS+
WTA:
• (eventually) reaches minimum E(W)
• depends on initialization: possible
large learning time
Neural Gas:
• more robust w.r.t. initialization
t
Dagstuhl Seminar, 25.03.2007
Learning Vector Quantization (LVQ)
Objective:
classification of data using prototype vectors






Assign data {ξ,σ}; ξ ∈ ℝN to nearest
s
N
prototype vector {w s , c }; w s  R
(distance measure, e.g. Euclidean)
Find optimal set W for
lowest generalization error
1    c j
 g   ( , c ) ;  ( , c )  
else
0

j

misclassified by
nearest prototype
Dagstuhl Seminar, 25.03.2007
j
LVQ1
w μs  w μs -1 
update winner towards/
away from data

η
cs     d μj  d μs
js
N
 ξ
μ
 w μs -1
ws winner
±1
no cost function related to generalization error
two prototypes
three prototypes
which class to add the 3rd prototype?
c={+1, -1}
c={+1,-1,-1}
c={+1,+1,-1}
RS-
RS-
RS-
RS+
Dagstuhl Seminar, 25.03.2007
RS+
RS+

Generalization error
εg 
K
p 


 1
s
 d j  d s   ( s , )
K
j s
class

misclassified data
εg
t
p+=0.6, p-= 0.4
υ+=1.5, υ-=1.0
Dagstuhl Seminar, 25.03.2007
Optimal decision boundary
(hyper)plane where p P(ξ σ  1)  p P(ξ σ  1)
equal variance (υ+=υ-):
linear decision boundary
(p-)
ℓ
d
unequal variance υ+>υK=2
B-
B+
(p+>p- )
more prototypes  better
approximation to optimal decision
boundary
Dagstuhl Seminar, 25.03.2007
optimal
with K=3
Asymptotic εg
υ+ >υ- (υ+=0.81, υ- =0.25)
c={+1,+1,-1}
εg(t∞)
c={+1,-1,-1}
εg(t∞)
εg
p+
p+
• Optimal: K=3 better
• LVQ1: K=3 better
• Optimal: K=3 equal to K=2
• LVQ1: K=3 worse
• more prototypes not always better for LVQ1
• best: more prototypes on the class with the larger variance
Dagstuhl Seminar, 25.03.2007
Summary
 dynamics of (Learning) Vector Quantization for high
dimensional data
 Neural Gas: more robust w.r.t. initialization than WTA
 LVQ1: more prototypes not always better
Outlook
 study different algorithms e.g. LVQ+/-, LFM, RSLVQ
 more complex models
 multi-prototype, multi-class problems
Reference
Dynamics and Generalization Ability of LVQ Algorithms
M. Biehl, A. Ghosh, and B. Hammer
Journal of Machine Learning Research (8): 323-360 (2007)
http://jmlr.csail.mit.edu/papers/v8/biehl07a.html
Dagstuhl Seminar, 25.03.2007
Questions
?
Dagstuhl Seminar, 25.03.2007
Central Limit Theorem
•Let x1, x2,…, xN be independent random
numbers from arbitrary probability
distribution with mean and finite variance
•The distribution of the average of xj
approaches a normal distribution as N
becomes large.
Example: nonnormal distribution
p(xj)
N=1
Distribution of average of xj:
1 N 
p  x j 
 N j1 
N=2
Dagstuhl Seminar, 25.03.2007
N=5
N=50
Self Averaging
Fluctuations decreases with
larger degree of freedom N
At N∞, fluctuations
vanish (variance becomes
zero)
Monte Carlo simulations
over 100 independent runs
Dagstuhl Seminar, 25.03.2007
“LVQ +/-”
w μj  w μj -1 
update correct and
incorrect winners


η j  μ
c  ξ  w μj -1 ;
N
j  {s, t}
ds = min {dk} with cs = σμ
dt = min {dk} with ct ≠σμ
p+ >> p- : strong repulsion
by stronger class
t
strongly divergent!
to overcome divergence: e.g. early
stopping (difficult in practice)
stop at εg(t)=εg,min
εg(t)
t
Dagstuhl Seminar, 25.03.2007
Comparison LVQ1 and LVQ +/c={+1,+1,-1}
υ+ = υ- =1.0
υ+ = 0.81, υ- =0.25
p+
LVQ1 outperforms LVQ+/with early stopping
p+
LVQ+/- with early stopping
outperforms LVQ1 in a certain
p+ interval
LVQ+/-  performance depends on initial conditions
Dagstuhl Seminar, 25.03.2007
Download