Robust Scalable Visualized Clustering in Vector and non Vector Semimetric Spaces, G Fox - Parallel Processing Letters, 2013 - World Scientific The Summary Mashar Cenk Gencal School of Informatics and Computing, Indiana University, Bloomington mcgencal@indiana.edu 1 Abstract In this paper, a new idea is revealed to data analytics for large systems by using robust parallel algorithms which runs on clouds and HPC systems. However, it is assumed that the data is already described in a vector space, and the pairwise distances between points are also described. Moreover, it is achieved to make some known algorithms better for functionality, features and performance. Furthermore, it is mentioned about steering complex analytics where visualization is important for both the non vector semimetric case and clustering high dimension vector spaces. Deterministic Annealing is used to have fairly fast robust algorithm. The method is practiced in some life sciences applications. 1 Introduction In the publication, deterministic annealing is defined, and then it is considered that clustering with an highlight on some of features which are not found in the openly available software such as R. After that, it is mentioned about difficulty of parallelization for heteregeneous geometries. Finally, in the last section, it is suggested to use a data mining algorithm with some form of visulization since the most of datasets do not show that whether or not an analysis is adequante. Because of that, it is advised that “ The well studied area of dimension reduction deserves more systematic use and show how one can map high dimension vector and non vector spaces into 3 dimensions for this purpose.” [1] 2 Deterministic Annealing Before explaining the idea in the publication, we need to explain Determistic Annealing, Gibbs Distribution, Free Energy F : 2.1 Gibbs Distribution, Free Energy Gibbs Measure: where π½ = 1 π ππ (π₯ ∈ πΆπ ) = π −π½πΈπ₯ (π) ππ₯ (1) , π₯ βΆ πππππ‘ , πΆπ βΆ π π‘β πππ’π π‘ππ , πΈπ₯ (π) βΆ ππππππ¦ ππ’πππ‘πππ , ππ₯ βΆ ππππ‘ππ‘πππ ππ’πππ‘πππ 2 ππ₯ = ∑ππ=1 π −π½πΈπ₯ (π) (2) - if π½ → 0, each point is equally associated with all clusters. - if π½ → ∞, each point belongs to exactly one cluster Total partition function : π = ∏π₯ ππ₯ (3) Expected energy of x; πΈπ₯ = ∑ππ=1 ππ (π₯ ∈ πΆπ ) πΈπ₯ (π) (4) Thus, total energy is, πΈ = ∑π₯ πΈπ₯ (5) If F is effective cost function ( free energy ), from the equation (3) ; 1 1 π½ π½ π = π −π½πΉ ππ πΉ = − log π = − ∑π₯ log ππ₯ (6) Therefore, if π½ → ∞, then πΉ = πΈ 2 πΈπ₯ (π) = |π₯ − π¦π | , where π¦π is the representative of πΆπ (7) Thus, the probability that x belongs to πΆπ , from equations (1) , (2) and (7) ; ππ (π₯ ∈ πΆπ ) = π 2 −π½|π₯−π¦π | −π½|π₯−π¦π | ∑ππ=1 π (8) 2 From equations (6) and (8) ; 1 2 πΉ = − ∑π₯ log(∑ππ=1 π −π½|π₯−π¦π | ) π½ (9) If we take derivation of this function, we will have ; ππΉ ππ¦π = 0 ⇒ π¦π = ∑π₯ π₯ ππ (π₯∈πΆπ ) ∑π₯ ππ (π₯∈πΆπ ) Hence, if we keep taking derivatives of the function, from (9), we will have ; π¦ (π+1) = π(π¦ (π) ) , 3 ∑π₯ where f is based on π¦π (π+1) = ∑π₯ −π½|π₯−π¦π π₯.π 2 (π) | −π½|π₯−π¦π (π) | ∑π π=1 π 2 −π½|π₯−π¦π (π) | π −π½|π₯−π¦π ∑π π=1 π (π) | 2 , ∀π (10) 2 2.2 Deterministic Annealing Algorithm 1. 2. 3. 4. 5. Set π½ = 0 Choose an arbitrary set of initial cluster means Iterate according to equation (10) until converged Increase π½ If π½ < π½πππ₯ go to step 3. 2.3 The Idea in The Publication Optimization problems are considered by finding global minima of energy functions in the publication. Since Monte Carlo approach of simulated annealing is too slow, integrals are analytically applied by utilizing some different approximations in statistical physics. Hamiltonian, H(x), is minimized for the variable, ο£ . Then, the idea relating to average the Gibbs Distribution at Temperature T is introduced. P(ο£) will denote probability distribution for ο£ of type T, P(ο£) = π π»(π₯) − π (11) π»(π₯) − ∫ ππ₯ π π or P(ο£) = π π»(π₯) πΉ + π π − (12) and S(P) will denote entropy associated with probability distribution P(ο£), S(P) = - P(ο£) lnP(ο£) (13) After minimizing the Free Energy F by using Objective Function and Entropy, π»(π₯) π F = < H − T S(P) > = ∫ ππ₯ [P(ο£)H + TP(ο£)lnP(ο£)] = − π ππ ο² πο£. π− (14) For each iteration, the temperature is decreased from 0.95 to 0.9995. Since it has difficulty for some cases, e.g. vector space clustering, Mixture Model, the new Hamiltonian, H0(ο£, ο₯), is revealed. 4 π»0 (ο£,ο₯) πΉ + π π π0 (ο£) = π − (15) F(π0 ) = < H0 - T S0(P0) >|0 = < H – H0> |0 + F0(P0) = − π ππ ο² πο£. π − π»0(ο£,ο₯) π (16) where < ….. >|0 shows that averaging with Gibbs π0 (ο£) i.e. ∫ ππ₯ π0 (ο£) . ( π is fixed in these integrals) Since P minimizes F(P), we will have ; πΉ(π) ≤ πΉ(π0 ) (17) This helps us for EM method, which is to calculate the expected value of the log likelihood function, and to find the parameter that maximizing the quantity. Therefore, <ο£> = <ο£> |0 = ο² dο£ ο£ Po(ο£) (18) For the new Hamiltonian function, there are three variables: π, π₯, π. It is descriped that “ The first variable set π are used to approximate the real Hamiltonian H(π₯, π) by H0(ο₯ , ο£, π ) ; the second variable set x are subject to annealing while one can follow the determination of x by finding yet other parameters π optimized by traditional method.” [1] It is said that the deterministic idea has some difficulties such show to choose parameter. Also, it is challenging to minimize the free energy from the equation (16). 3 Vector Space Clustering 3.1 Basic Central or Vector Space Clustering It is assumed that we have a clustering problem with N points labeled by x and K clusters labeled by k. If ππ₯ (π) is the probability that point x belongs to cluster k with constarint, πΎ ∑ ππ₯ (π) = 1 πππ πππβ π₯ π=1 πΎ π»0 = ∑π π₯=1 ∑π=1 ππ₯ (π)ππ₯ (π) (19) where ππ₯ (π) = (π(π₯) − π(π))2 (20) π(π₯) : points’ position π(π) : cluster center Clearly, we can write the equation (19) as follows : 5 πΎ 2 π»ππππ‘πππ = π»0 = ∑π π₯=1 ∑π=1 ππ₯ (π) (π(π₯) − π(π)) (21) If we rewrite the equations (15) and (16) ; π»0 (ο£,ο₯) π π0 (ο£) = π − πΉ . ππ (22) π»0(ο£,ο₯) π F(π0 ) = −π. ππ ∑ π − (23) After applying the equation (21) and (23) to the equation (22) ; π0 (ο£) = π ∏π π₯=1 − ∑πΎ π=1 ππ₯ (π)ππ₯ (π) (24) π (π) − π₯ π ∑πΎ π π=1 The optimal vectors {π¦π£ } are derived by maximizing the entropy of the Gibbs distribution.[5] π¦πΌ = arg πππ₯{π¦π£} (− ∑ π0 . ππππ0 (ο£)) = arg πππ₯{π¦π£} (∑ π»0 (ο£,ο₯).π0 (ο£) πΎ + ∑π π₯=1 πππ ∑π=1 π π− ππ₯(π) π ) (25) If we take derivative of the equation (25), and set the result 0 ; π 0 = ∑ < ππ₯ (π) > π₯=1 < ππ₯ (π) > = π πππ₯ (π) ππ¦π£ π (π) − π₯ π (26) ππ₯ (π) − π ∑πΎ π=1 π And the cluster centers Y(k) are easily determined in M step.[1] π(π) = ∑π π₯=1<ππ₯ (π)>π(π) (27) ∑π π₯=1<ππ₯ (π)> The non vector case is not mentioned in detail. However, we have the equation ; πΎ π π»πππ ππππ‘ππ = ∑π π₯=1 ∑π¦=1 π(π₯, π¦) ∑π=1 ππ (π)ππ (π) πΆ(π) (28) where πΆ(π) = ∑π π₯=1 ππ₯ (π) is the number of points in k th cluster and ∑πΎ π=1 πΆ(π) = π is the total number of points 6 3.2 Fuzzy Clustering and K-means If the temperature is very high, the exponent of the equation (26) goes to zero, and all centers have equal probability. In that case, all centers have the position, π(π) = ∑π π₯=1 π(π₯)/π (29) If the temperature decreases, the points tend to close to a center by stages, and in the zero temperature limit, K-means solution can be found by assigning each point with the nearest center. Figure 1 declares the idea behind the annealing which is that we could not find false minima at high (infinite) temperature where the object function is smooth, and the free energy is minimized. If we decrease the temperature slowly, we close to the minima at lower temperature. Hence, the annealing is robust as long as temperature lowered slow enough. [1] In the figure, temperatures T2 and T3 show that the false minima at finite temperature. Tihs process is being approximated in deterministic annealing. The approximation includes modifying integrals by evaulating functions through their mean values. 3.3 Multiscale ( Hierarchical ) and Continous Clustering 7