Summary of Robust Scalable Visualized Clustering in Vector and

advertisement
Robust Scalable Visualized Clustering in Vector and non Vector
Semimetric Spaces, G Fox - Parallel Processing Letters,
2013 - World Scientific
The Summary
Mashar Cenk Gencal
School of Informatics and Computing,
Indiana University, Bloomington
mcgencal@indiana.edu
1
Abstract
In this paper, a new idea is revealed to data analytics for large systems by using
robust parallel algorithms which runs on clouds and HPC systems. However, it is assumed
that the data is already described in a vector space, and the pairwise distances between
points are also described. Moreover, it is achieved to make some known algorithms better
for functionality, features and performance. Furthermore, it is mentioned about steering
complex analytics where visualization is important for both the non vector semimetric case
and clustering high dimension vector spaces. Deterministic Annealing is used to have fairly
fast robust algorithm. The method is practiced in some life sciences applications.
1 Introduction
In the publication, deterministic annealing is defined, and then it is considered that
clustering with an highlight on some of features which are not found in the openly available
software such as R. After that, it is mentioned about difficulty of parallelization for
heteregeneous geometries. Finally, in the last section, it is suggested to use a data mining
algorithm with some form of visulization since the most of datasets do not show that
whether or not an analysis is adequante. Because of that, it is advised that “ The well studied
area of dimension reduction deserves more systematic use and show how one can map high
dimension vector and non vector spaces into 3 dimensions for this purpose.” [1]
2 Deterministic Annealing
Before explaining the idea in the publication, we need to explain Determistic
Annealing, Gibbs Distribution, Free Energy F :
2.1 Gibbs Distribution, Free Energy
Gibbs Measure:
where 𝛽 =
1
𝑇
π‘ƒπ‘Ÿ (π‘₯ ∈ 𝐢𝑗 ) =
𝑒 −𝛽𝐸π‘₯ (𝑗)
𝑍π‘₯
(1)
,
π‘₯ ∢ π‘π‘œπ‘–π‘›π‘‘ ,
𝐢𝑗 ∢ 𝑗 π‘‘β„Ž π‘π‘™π‘’π‘ π‘‘π‘’π‘Ÿ ,
𝐸π‘₯ (𝑗) ∢ π‘’π‘›π‘’π‘Ÿπ‘”π‘¦ π‘“π‘’π‘›π‘π‘‘π‘–π‘œπ‘› ,
𝑍π‘₯ ∢ π‘π‘Žπ‘Ÿπ‘‘π‘–π‘‘π‘–π‘œπ‘› π‘“π‘’π‘›π‘π‘‘π‘–π‘œπ‘›
2
𝑍π‘₯ = ∑π‘π‘˜=1 𝑒 −𝛽𝐸π‘₯ (π‘˜)
(2)
- if 𝛽 → 0, each point is equally associated with all clusters.
- if 𝛽 → ∞, each point belongs to exactly one cluster
Total partition function : 𝑍 = ∏π‘₯ 𝑍π‘₯
(3)
Expected energy of x;
𝐸π‘₯ = ∑π‘π‘˜=1 π‘ƒπ‘Ÿ (π‘₯ ∈ πΆπ‘˜ ) 𝐸π‘₯ (π‘˜)
(4)
Thus, total energy is,
𝐸 = ∑π‘₯ 𝐸π‘₯
(5)
If F is effective cost function ( free energy ), from the equation (3) ;
1
1
𝛽
𝛽
𝑍 = 𝑒 −𝛽𝐹 π‘œπ‘Ÿ 𝐹 = − log 𝑍 = − ∑π‘₯ log 𝑍π‘₯
(6)
Therefore, if 𝛽 → ∞, then 𝐹 = 𝐸
2
𝐸π‘₯ (𝑗) = |π‘₯ − 𝑦𝑗 | ,
where 𝑦𝑗 is the representative of 𝐢𝑗
(7)
Thus, the probability that x belongs to 𝐢𝑗 , from equations (1) , (2) and (7) ;
π‘ƒπ‘Ÿ (π‘₯ ∈ 𝐢𝑗 ) =
𝑒
2
−𝛽|π‘₯−𝑦𝑗 |
−𝛽|π‘₯−𝑦𝑗 |
∑π‘π‘˜=1 𝑒
(8)
2
From equations (6) and (8) ;
1
2
𝐹 = − ∑π‘₯ log(∑π‘π‘˜=1 𝑒 −𝛽|π‘₯−π‘¦π‘˜ | )
𝛽
(9)
If we take derivation of this function, we will have ;
πœ•πΉ
πœ•π‘¦π‘—
= 0 ⇒ 𝑦𝑗 =
∑π‘₯ π‘₯ π‘ƒπ‘Ÿ (π‘₯∈𝐢𝑗 )
∑π‘₯ π‘ƒπ‘Ÿ (π‘₯∈𝐢𝑗 )
Hence, if we keep taking derivatives of the function, from (9), we will have ;
𝑦 (𝑛+1) = 𝑓(𝑦 (𝑛) ) ,
3
∑π‘₯
where f is based on 𝑦𝑗 (𝑛+1) =
∑π‘₯
−𝛽|π‘₯−𝑦𝑗
π‘₯.𝑒
2
(𝑛) |
−𝛽|π‘₯−π‘¦π‘˜ (𝑛) |
∑𝑐
π‘˜=1 𝑒
2
−𝛽|π‘₯−𝑦𝑗 (𝑛) |
𝑒
−𝛽|π‘₯−π‘¦π‘˜
∑𝑐
π‘˜=1 𝑒
(𝑛) |
2
, ∀𝑗
(10)
2
2.2 Deterministic Annealing Algorithm
1.
2.
3.
4.
5.
Set 𝛽 = 0
Choose an arbitrary set of initial cluster means
Iterate according to equation (10) until converged
Increase 𝛽
If 𝛽 < π›½π‘šπ‘Žπ‘₯ go to step 3.
2.3 The Idea in The Publication
Optimization problems are considered by finding global minima of energy functions
in the publication. Since Monte Carlo approach of simulated annealing is too slow, integrals
are analytically applied by utilizing some different approximations in statistical physics.
Hamiltonian, H(x), is minimized for the variable,  . Then, the idea relating to average the
Gibbs Distribution at Temperature T is introduced.
P() will denote probability distribution for  of type T,
P() =
𝑒
𝐻(π‘₯)
−
𝑇
(11)
𝐻(π‘₯)
−
∫ 𝑑π‘₯ 𝑒 𝑇
or P() = 𝑒
𝐻(π‘₯) 𝐹
+
𝑇
𝑇
−
(12)
and S(P) will denote entropy associated with probability distribution P(),
S(P) = - P() lnP()
(13)
After minimizing the Free Energy F by using Objective Function and Entropy,
𝐻(π‘₯)
𝑇
F = < H − T S(P) > = ∫ 𝑑π‘₯ [P()H + TP()lnP()] = − 𝑇 𝑙𝑛  𝑑. 𝑒−
(14)
For each iteration, the temperature is decreased from 0.95 to 0.9995. Since it has
difficulty for some cases, e.g. vector space clustering, Mixture Model, the new Hamiltonian,
H0(, ο₯), is revealed.
4
𝐻0 (,ο₯)
𝐹
+
𝑇
𝑇
𝑃0 () = 𝑒 −
(15)
F(𝑃0 ) = < H0 - T S0(P0) >|0 = < H – H0> |0 + F0(P0) = − 𝑇 𝑙𝑛  𝑑. 𝑒 −
𝐻0(,ο₯)
𝑇
(16)
where < ….. >|0 shows that averaging with Gibbs 𝑃0 () i.e. ∫ 𝑑π‘₯ 𝑃0 () . ( πœ€ is fixed in these
integrals)
Since P minimizes F(P), we will have ;
𝐹(𝑃) ≤ 𝐹(𝑃0 )
(17)
This helps us for EM method, which is to calculate the expected value of the log
likelihood function, and to find the parameter that maximizing the quantity. Therefore,
<> = <> |0 =  d  Po()
(18)
For the new Hamiltonian function, there are three variables: πœ€, π‘₯, πœ‘. It is descriped
that “ The first variable set πœ€ are used to approximate the real Hamiltonian H(π‘₯, πœ‘) by
H0(ο₯ , , πœ‘ ) ; the second variable set x are subject to annealing while one can follow the
determination of x by finding yet other parameters πœ‘ optimized by traditional method.” [1]
It is said that the deterministic idea has some difficulties such show to choose
parameter. Also, it is challenging to minimize the free energy from the equation (16).
3 Vector Space Clustering
3.1 Basic Central or Vector Space Clustering
It is assumed that we have a clustering problem with N points labeled by x and K
clusters labeled by k. If 𝑀π‘₯ (π‘˜) is the probability that point x belongs to cluster k with
constarint,
𝐾
∑ 𝑀π‘₯ (π‘˜) = 1
π‘“π‘œπ‘Ÿ π‘’π‘Žπ‘β„Ž π‘₯
π‘˜=1
𝐾
𝐻0 = ∑𝑁
π‘₯=1 ∑π‘˜=1 𝑀π‘₯ (π‘˜)πœ€π‘₯ (π‘˜)
(19)
where πœ€π‘₯ (π‘˜) = (𝑋(π‘₯) − π‘Œ(π‘˜))2
(20)
𝑋(π‘₯) : points’ position
π‘Œ(π‘˜) : cluster center
Clearly, we can write the equation (19) as follows :
5
𝐾
2
π»π‘π‘’π‘›π‘‘π‘Ÿπ‘Žπ‘™ = 𝐻0 = ∑𝑁
π‘₯=1 ∑π‘˜=1 𝑀π‘₯ (π‘˜) (𝑋(π‘₯) − π‘Œ(π‘˜))
(21)
If we rewrite the equations (15) and (16) ;
𝐻0 (,ο₯)
𝑇
𝑃0 () = 𝑒 −
𝐹
. 𝑒𝑇
(22)
𝐻0(,ο₯)
𝑇
F(𝑃0 ) = −𝑇. 𝑙𝑛 ∑ 𝑒 −
(23)
After applying the equation (21) and (23) to the equation (22) ;
𝑃0 () =
𝑒
∏𝑁
π‘₯=1
− ∑𝐾
π‘˜=1 𝑀π‘₯ (π‘˜)πœ€π‘₯ (π‘˜)
(24)
πœ€ (π‘˜)
− π‘₯
𝑇
∑𝐾
𝑒
π‘˜=1
The optimal vectors {𝑦𝑣 } are derived by maximizing the entropy of the Gibbs distribution.[5]
𝑦𝛼 = arg π‘šπ‘Žπ‘₯{𝑦𝑣} (− ∑ 𝑃0 . π‘™π‘œπ‘”π‘ƒ0 ())
= arg π‘šπ‘Žπ‘₯{𝑦𝑣} (∑
𝐻0 (,ο₯).𝑃0 ()
𝐾
+ ∑𝑁
π‘₯=1 π‘™π‘œπ‘” ∑π‘˜=1
𝑇
𝑒−
πœ€π‘₯(π‘˜)
𝑇 )
(25)
If we take derivative of the equation (25), and set the result 0 ;
𝑁
0 = ∑ < 𝑀π‘₯ (π‘˜) >
π‘₯=1
< 𝑀π‘₯ (π‘˜) > =
𝑒
πœ•πœ€π‘₯ (π‘˜)
πœ•π‘¦π‘£
πœ€ (π‘˜)
− π‘₯
𝑇
(26)
πœ€π‘₯ (π‘˜)
−
𝑇
∑𝐾
π‘˜=1 𝑒
And the cluster centers Y(k) are easily determined in M step.[1]
π‘Œ(π‘˜) =
∑𝑁
π‘₯=1<𝑀π‘₯ (π‘˜)>𝑋(π‘˜)
(27)
∑𝑁
π‘₯=1<𝑀π‘₯ (π‘˜)>
The non vector case is not mentioned in detail. However, we have the equation ;
𝐾
𝑁
π»π‘π‘œπ‘› π‘‰π‘’π‘π‘‘π‘œπ‘Ÿ = ∑𝑁
π‘₯=1 ∑𝑦=1 𝑑(π‘₯, 𝑦) ∑π‘˜=1
𝑀𝑋 (π‘˜)π‘€π‘Œ (π‘˜)
𝐢(π‘˜)
(28)
where 𝐢(π‘˜) = ∑𝑁
π‘₯=1 𝑀π‘₯ (π‘˜) is the number of points in k th cluster
and ∑𝐾
π‘˜=1 𝐢(π‘˜) = 𝑁 is the total number of points
6
3.2 Fuzzy Clustering and K-means
If the temperature is very high, the exponent of the equation (26) goes to zero, and
all centers have equal probability. In that case, all centers have the position,
π‘Œ(π‘˜) = ∑𝑁
π‘₯=1 𝑋(π‘₯)/𝑁
(29)
If the temperature decreases, the points tend to close to a center by stages, and in
the zero temperature limit, K-means solution can be found by assigning each point with the
nearest center.
Figure 1 declares the idea behind the annealing which is that we could not find false
minima at high (infinite) temperature where the object function is smooth, and the free
energy is minimized. If we decrease the temperature slowly, we close to the minima at lower
temperature. Hence, the annealing is robust as long as temperature lowered slow enough.
[1] In the figure, temperatures T2 and T3 show that the false minima at finite temperature.
Tihs process is being approximated in deterministic annealing. The approximation includes
modifying integrals by evaulating functions through their mean values.
3.3 Multiscale ( Hierarchical ) and Continous Clustering
7
Download