Abstract

advertisement

Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence

A Probabilistic Approach to Latent Cluster Analysis

Zhipeng Xie, Rui Dong, Zhengheng Deng, Zhenying He, Weidong Yang

School of Computer Science

Fudan University, Shanghai, China

{xiezp, 11210240011, 11210240082, zhenying, wdyang}@fudan.edu.cn

1 Introduction

Abstract

Facing a large number of clustering solutions, cluster ensemble method provides an effective approach to aggregating them into a better one. In this paper, we propose a novel cluster ensemble method from probabilistic perspective. It assumes that each clustering solution is generated from a latent cluster model, under the control of two probabilistic parameters. Thus, the cluster ensemble problem is reformulated into an optimization problem of maximum likelihood. An EM-style algorithm is designed to solve this problem. It can determine the number of clusters automatically.

Experimenal results have shown that the proposed algorithm outperforms the state-of-the-art methods including EAC-AL, CSPA, HGPA, and MCLA.

Furthermore, it has been shown that our algorithm is stable in the predicted numbers of clusters. ones. In the generation phase, different clustering solutions can be generated by different clustering algorithms, the same algorithm with different parameter settings or initialization, and injection of random disturbance into data set such as data resampling (Minaei-Bidgoli et al ., 2004), random projection

(Fern and Brodley, 2003), and random feature selection

(Strehl and Ghosh, 2002). Following the generation phase, an optional enemble selection phase will select or prune these clustering solutions according to their qualities and diversities (Fern and Lin, 2008; Azimi and Fern, 2009).

In this paper, we focus on the final phase - clustering combination. There are a lot of algorithms for the combination, which can be categorized according to the kind of information exploited. The algorithm proposed here falls into the category making use of the pairwise similarities between objects, which form a co-association matrix in the

The goal of cluster analysis is to discover the underlying structure of a dataset (Jain et al ., 1999; Jain, 2010). It normally partitions a set of objects so that the objects within the same group are similar while those from different groups are dissimilar. A large number of clustering algorithms have been proposed, e.g. k-Means, Spectral Clustering,

Hierarchical Clustering, Self-Organizing Maps, to name but a few, yet no single one is able to successfully achieve this goal for all datasets. On the same data, different algorithms, or even multiple runs of the same algorithm with different parameters, often lead to clustering solutions that are distinct from each other.

Confronted with a large number of clustering solutions, cluster ensemble or clustering aggregation methods have emerged, which try to combine different clustering solutions into a consensus one, in order to improve the quality of component clustering solutions (Vega-Pons and

Ruiz-Shulcloper, 2011). Cluster ensemble methods usually consist of two or three phases: the ensemble generation phase to produce a variety of clustering solutions; then the ensemble selection phase to select a subset of these clustering solutions, which is optional; and finally the consensus phase to induce a unified partition by combining the component context of cluster ensembles. Any clustering algorithm can be applied on this new similarity matrix to find a consensus partition. Evidence Accummulation Clustering (Fred and

Jain, 2005), or EAC in short, performs a hierarchical clustering of average linkage (AL) or single linkage (SL) on co-associationg matrix, where a maximum lifetime criterion is proposed to determine the number of clusters.

Cluster-based Similarity Partitioning (CSPA) algorithm

(Strehl and Ghosh, 2002) uses a graph-paritioning algorithm instead, but requires the number of clusters be specified manually. Another algorithm HGPA (Strehl and Ghosh, 2002) can be thought as an approximation to CSPA. Out of this category, MCLA algorithm makes a clustering of clusters based on the similarities between clusters, and then assigns objects to its closest meta-cluster. For a thorough list of related algorithms, please refer to the survey paper by

Vega-Pons and Ruiz-Shulcloper (2011).

Although these methods have achieved some success, they are still deficient in several aspects: first, they lack of theoretic underpinning; second, they think that all the clustering solutions be of the same quality, and thus assign the same weight to each clustering solution; last but not the least, most of them (except EAC) require the number of clusters to be specified manually. As to the maximum lifetime criterion adopted by EAC algorithm, it is more or less a rule-of-thumb that is lack of justification. As we shall see in the experiments, the maximum lifetime criterion is unstable in determining the number of clusters.

1813

To tackle these problems, we propose a probabilistic method called LAtent Cluster Analysis, or LACA in short. It assumes that there is a latent cluster model which is unobservable. All the observed Clustering solutions are generated from the latent model under the control of two probabilistic parameters. Our objective is to seek the latent cluster model with the maximum likelihood.

This paper is organized as follows. In Section 2, we introduce the latent cluster model and build its connection with the observed clustering solutions. We devote Section 3 to an EM-style algorithm for inferring the latent cluster model from the observed clustering solutions. In Section 4, we present the experimental results of the proposed method compared with several state-of-the-art cluster ensemble algorithms. Finally, we make the conclusion in Section 5.

2 Latent Cluster Model

Let X = { x

1, x

2

, …, x n

} be a set of n objects, where each object x i

may be represented as a multidimensional vector, a string, or in any other form. Taking X as input, a clustering algorithm (called clusterer) produces a clustering solution that partitions the n objects into groups. By running clustering algorithms multiple times, we can observe an ensemble of clustering solutions, E = { C

1

, C

2

, …, C

|E|

}, where each clustering solution C e

(1 e | E |) partitions the n objects into groups c e , c e ,

!

, c e

| C e

|

.

With respect to a given clustering solution C e

, a co-association relationship between two objects x i

and x j

is defined according to whether they are assigned to the same group:

V e ij

1 ,

0 , if c k e 

C e such that { x i

, x j

}

Ž c k e

. otherwise

(1)

Two objects x i

and x j

are said to be co-associated in the clustering solution C e

if

V e ij

1 ; otherwise, they are not co-associated.

Given the ensemble of observed clustering solution, what we would like to explore is the latent cluster model. We denote the latent cluster model as

:

= {

Z

Z l

1

,

Z

2

, …,

Z s

}, where

(1 d l d

s) is a cluster represented as a subset of objects, and s is the real number of clusters. In this paper, we assume that these s latent clusters are non-overlapping, i.e.,

Z i

ˆ Z j

=

‡

for 1 d i z j d s . Based on the latent cluster model

:

, we define the co-cluster function for a given pair of objects { x i

, x j

}:

: ij

­

®

¯

,

, if

Z k

 : otherwise such that { x i

, x j

}

Ž Z k

(2)

This latent cluster model serves as the major factor in determining what clustering solutions can be observed, while other factors such as the bias of applied clustering algorithm influences the observed results and leads to some false positives and false negatives.

To build the connection between a clustering solution C e and the latent cluster model

:

, we introduce two probabilistic parameters: x

The parameter

U e

to denote the conditional probability that two objects are co-associated in C e

given that they are co-cluster in the hidden model

:

, that is

U e

=

Pr(

V e ij

=1|

: ij

=1); and x

The parameter r e

to denote the conditional probability that two objects are co-associated in C e

given that they are not co-cluster in the hidden model

:

, that is r e

=

Pr(

V e ij

=1|

: ij

=0).

Intuitively, each observed clustering solution provides some evidences about the latent (unobservable) co-cluster relationship between objects. Given a set E of clustering solutions, our objective is to maximize the posterior probability of the latent cluster model

:

, as follows:

: * arg max

:

Pr(

:

| E ) arg max

: l (

:

| E )

Pr( u

E

Pr(

)

:

)

(3) where l (

:

| E ) = Pr( E |

:

) is the likelihood function. Assume that these clustering solutions are independent of each other and prior probabilities of all the possible latent cluster models are distributed uniformly, it can be written as the following maximum-likelihood problem:

: * arg max

: l (

:

| E ) arg max

:

– e l (

:

| C e

) ,

(4) where model l (

:

:

| C e

) = Pr( C e

|

:

) is the likelihood of the latent

given the observed clustering solution probability of observing C e

C e

given the latent model

:

)

The evidence of observing a clustering solution C e

(or the

can be decomposed into the evidence set of co-association relationships of all object pairs, i.e. whether a pair of objects

{ x i

, x j

} are assigned to the same group. Then, we have l (

:

| C e ) ij :

: ij

–

š V

U e ij e

˜ ij :

: ij

–

š V

( e ij

U e

)

˜ ij :

: ij

–

š V r e e ij

˜ ij :

: ij

–

š V

( e ij r e

)

(5)

Taking logrithm on both sides, we get the log-likelihood:

L (

:

| C e

) log l (

:

| C e

) (6) n

:

11

, e log

U e n

:

, e log(

U e

) n

:

,

01 e log r e n

:

, e log( r e

) where z n

1

:

, e | {{ x i

, x j

} |

: ij and

V ij e } | represents the z n

: ,

10 number of object pairs that are co-cluster in the hidden model

:

and also co-associated in C e

; e | {{ x i

, x j

} |

: ij and

V e ij

0 } | represents the z number of object pairs that are co-cluster in the hidden model, but not co-associated in the observed clustering result C e

; n

:

,

01 e

| {{ x i

, x j

} |

: ij

0 and

V e ij

1 } | represents the number of object pairs that are not co-cluster in the hidden model, but co-associated in the observed clustering result C e

; and

1814

z n

:

00

, e | {{ x i

, x j

} |

: ij

0 and

V ij e } | represents the number of the object pairs that are not co-cluster in

: also not co-associated in the clustering result C e

.

By substituting equations (5) or (6) into (4), we can reformulate the optimization problem as:

: * arg max

:

– e l (

:

| C e

) arg max

:

¦ e

L (

:

| C e

)

(7)

Given |E| observed clustering solutions C

1

, C

2

, …, C

|E|

, we use count ( x i

, x j

) to denote the number of clustering solutions where the objects x i

and x j

are co-associated, and rcount ( x i

, x j

) to denote the number of clustering solutions where x i

and x j are not co-associated, that is: and count ( x i

, x j

) = rcount ( x i

, x j

) =

¦

e

(

¦

e

V e ij

V e ij

) .

(8)

(9)

It is evident that count ( x i n , i z j .

, x j

) + rcount ( x i

, x j

) = | E | for 1 i , j d

So far as parameter initialization is concerned, it is taken for granted that different clustering solutions be equally plausible. The higher is the value of count ( x i

, x j

), the more probably the two objects x i

and x j

are co-clustered. Hence, the two probabilistic parameters for each clustering solution

C e

is initialized as the following M-estimates:

U e

¦ i , j :

V e ij

¦ i , j count ( count ( x i x i

,

, x x j j

)

)

0.5

u

ESS

ESS

, (10) and r e

¦ ij : V e ij

1

¦ ij rcount rcount

( x i

( x ,

, x y j

)

))

0.5

u

ESS

ESS

, (11) where ESS , standing for “equivalent sample size”, is set as 30 by default.

Unfortunately, the latent cluster model and the probabilistic parameters of these observed clustering results are all unknown, which make it impossible to seek the solution directly. Here, we proposed an EM-style algorithm to deal with the problem, depicted in Figure 1.

Clustering 1

Latent Model

Generation

Clustering 2

Parameter

Initializati

Convergence? Yes

No

Clustering | E | Parameter

Reestimation

Output the

Soulution

Figure 1. The flowchart of LACA algorithm

The algorithm consists of four major steps: x

Step 1 (Parameter Initialization): initialize the probabilistic parameters for each clustering solution; x

Step 2 (Latent Model Generation): fixing the probabilistic parameters, look for a near-optimal solution (a latent model) to the maximum-likelihood problem with a hill climbing strategy; x

Step 3 (Parameter Estimation): fixing the latent cluster model, estimate the probabilistic parameters for each clustering solution; x

Step 4 (Convergence Test): Repeat Step 2 and Step 3 until convergence.

3.2 Latent Model Generation

Once the probabilistic parameters are initialized or re-estimated, the latent cluster model can determined in a hill-climbing manner with respect to value the log likelihood function. Let’s start with an empty latent model where each object corresponds to a singleton cluster, and thus any two objects are not co-cluster. Then, we iteratively merge two clusters into a larger one. The selection criterion for merging at each step is described below, step by step.

Let

:

= {

Z

1

,

Z

2

, …,

Z current step, and

: ( uv ) t

} be the latent cluster model at the

be produced by merging

Z u

and

Z v

in

:

. Due to the fact that ( n

:

11 uv )

, e n

:

11

, e

) = ( n

: (

01 uv )

, e n

: ,

01 e

) and

( n

: ( uv )

, e n

: , e

) = ( n

: ( uv )

, e n

: , e

) , it can be derived that: l (

: ( uv ) | E )

¦ e

§

¨¨© ( n

: ( uv

11

) , e n

: , e

11

) log

Further, it is evident that:

U e r e

( n

:

11 uv )

, e n

:

11

, e

)

( n

: ( uv ) , e x i

 Z u

,

¦ x j



V

Z ij v e

, and l (

:

| E )

( n

: (

10 uv )

, e n

: ,

10 e ) x i

 Z u

¦ x ,

( j

 Z v

V ij e ) . n

: , e

) log

(12)

U e r e

·

¸¸¹

(13)

(14)

Substituting (13) and (14) into (12), we get: l (

: ( uv ) | E ) l (

:

| E )

¦ e

¨

¨

©

§ x i

 Z u

,

¦ x j



V

Z ij v e ˜ log

U r e e x i

 Z u

,

¦ x

( j

 Z v

V e ij

)

˜ log

U r e e

¸

¸

¹

·

(15)

By interchanging the order of summation, it can be derived that: l (

: ( uv )

| E ) l (

:

| E ) x i

 Z u

¦ ¦

, x j

 Z v e

§

¨¨©

V ij e u log

U r e e (

V e ij

) u log

U r e e

·

¸¸¹

(16)

If we define the affinity score between two objects x i

and x j voted by a clustering solution C e

as:

1815

score e

( x i

, x j

)

V e ij log

U i (

V e ij

) log

U e , (17) r i r e we can sum up all the scores voted by C

1

, C

2

, …, C

|E|

into the corresponding entry M [ i , j ] of a score matrix M , that is,

M [ i , j ]

¦ e score e

( x i

, x j

) . (18)

Substituting (17) and (18) into (16), we have: l (

: ( uv ) | E ) l (

:

| E )

¦ x i

 Z u

, x

M j

 Z v

[ i , j ]

(19)

However, using equation (19) directly as the selection criterion may favor merging larger clusters over smaller ones.

Thus, we choose to merge two clusters Z s

and Z t

such that

( s , t ) arg arg u max

, v max u , v

| l (

:

Z u

|

( uv )

|

1 u

Z

|

| u

E

Z

| v

) u

|

| l

Z

(

: v x i

 Z u

,

¦ x

|

| j

 Z

E

M v

)

[ i , j ]

(20)

If we think of the score matrix M as a similarity matrix, the selection criterion is actually the average-linkage (AL). As a result, a hierarchical clustering with average linkage can be applied on the matrix M . What is important is that the elements in the score matrix has clear probabilistic meaning: each element represents actually the log-likelihood ratio of the correponding two objects being co-cluster to being not.

Hidden Model Inference via Hierarchical clustering

By fixing all the parameters

U e

and r e

(1 d e d

|E| ), the score matrix M can be constructed according to (17) and (18). The hidden model can be generated by applying the following agglommerative hierarchical clustering on M :

Step 1: We start from the simplest singleton model :

0 where each cluster consists of one and only one object.

Step 2: Let t denote the current iteration, and set t = 1.

Step 3: At each iteration t , let

:

be the cluster model in the previous iteration, i.e.,

:

=

: t 1

. Without loss of generality, we denote

:

={

Z

1

Two clusters

Z s

and

Z t

,

Z

2

, …,

Z

| : |

}.

in

:

are selected according to equation (20).

Step 4: If the average link between

Z s

and

Z t

is negative, then terminate the loop and output

:

as the generated hidden cluster model; otherwise, continue to step 5.

Step 5: The clusters selected at step 3 get merged into a new cluster

Z removing

Z new

= s

and

Z

Z s

‰Z t

. We update

:

by t

, and inserting

Z new

. The updated : then serves as the cluster model in the current iteration, that is

: t

=

:

.

Step 6: If only two clusters are left, terminate and output

: ; otherwise, set t = t +1, and go to step 3.

Stopping Criterion: This process is repeated until we can not find a pair of clusters with positive average link (Step 4), or there are only two clusters left (Step 6).

Once a latent cluster model

:

is generated, it can be used to estimate the probabilistic parameters

U e

and r e

for each clustering solution C e

.

Since the parameter

U e

represents the probability that two objects are co-associated in C e co-cluster in the latent model

:

on condition that they are

, that is, r e

= Pr(

V e

= 1|

:

= 1), it can be estimated as:

U e

| {( i , j ) |

: ij

| {( i , j ) |

š

: ij

V ij e

} |

} | 0 .

ESS u

ESS

, (21) where the ESS is also set as 30 by default.

Because the parameter r e

denotes the probability that two objects are co-associated in C e co-cluster in

:

, that is r e

= Pr(

V

on condition that they are not e

= 1|

:

= 0), it can be estimated as: r e

| {( i , j )

|

|

:

{( i ij

, j ) |

š V

: ij e ij

} |

} | .

ESS u

ESS

. (22)

If a clustering solution assigns each object into a distinct group, the corresponding

U e

and r e

will be both close to 0. On the other extreme, if it assigns all objects into a single group, the corresponding

U e

and r e

will be both near to 1.

Once the probabilistic parameters

U e clustering solution C e

and r e

for each

are re-estimated, we compute the difference between the re-estimated values and the previous ones. If the sum of absolute differences over all clustering solutions is less than a user-specified threshold value, we consider the algorithm as converged and output the latent model.

4 Experiments

We have conducted extensive experiments to compare

LACA with several state-of-the-art cluster ensemble methods. Our experiments are designed to demonstrate: 1)

LACA is more stable than EAC-AL in determining the number of clusters; 2) LACA outperforms EAC-AL which is also able to determining the number of clusters automatically;

3) A variant verision of LACA, called k -LACA, outperforms

CSPA, HGPA and MCLA.

Segmentation 210 19 7

Table 1: Descriptions of the datasets.

1816

Data sets.

We use eight data sets from the UCI machine learning repository (Frank and Asuncion, 2010) in our experiments. The characteristics of the data sets are summarized in Table 1. Note that, for Pendigits, we randomly select 100 objects from each class.

Cluster Ensemble Generation. We choose to use the

K-means algorithm (MacQueen, 1967) as our base clusterer, because of its popularity in many previous cluster ensemble studies. At each run, we generate a cluster ensemble of 200 clustering solutions for a given data set. To be more specific, for a dataset of n objects and m features, each clustering solution is produced as follows: x The size s of feature subset is firstly determined by randomly drawing an integer value from the range

[ minS , maxS ], where minS is set to be 3, and maxS is set to be m . x A random feature subset FS of size s is generated by drawing s different features from the original m features. x

An random integer value K is drawn from [ minK , maxK ], where minK is set to be 2, and maxK is set to be n /15. x

A clustering solution is obtained by applying K -means algorithm on the dataset, with access to all the objects, but only the s features in FS .

Evaluation Criterion. As all the datasets are labeled, we use the class labels as a surrogate for the true underlying structure of the data. Two commonly used measures, Normalized

Mutual Information (NMI) and F-measures, are chosen to evaluate our approach against others.

NMI (Strehl and Ghosh, 2002) treats cluster labels X and class labels Y as random variables and makes a tradeoff between the mutual information and the number of clusters:

NMI

I ( X , Y )

H ( X ) H ( Y )

, where I (

˜

) is the mutual information metric and H (

˜

) is the entropy metric.

F-measure (Manning et al ., 2008) views a clustering solution (on a dataset with n objects) as a series of n ( n 1)/2 decisions, one for each pair of objects. It makes a compomise between the precision and the recall of these decisions:

F u

Precision

Precision u

Recall

Recall

.

4.2 Stability of Predicted Cluster Numbers

To the best of our knowledge, most cluster ensemble methods rely on a user-specified number of clusters. The only exception is the maximum lifetime criterion used in

EAC-AL method.

In order to study the stability of our algorithm and

EAC-AL in predicted number of clusters, we generate 30 cluster ensembles for each dataset in the way as described in

Section 4.1. Our algorithm and EAC-AL are applied on these cluster ensembles to get their predicted cluster numbers. The statistics about these numbers are presented in Table 2. It can be seen from the table that the range [Min, Max] of EAC-AL is much wider than that of LACA on each dataset, suggesting that the cluster numbers predicted by EAC-AL fluctuates a lot for each data set, and especially for Pendigits and Pima.

Similar observations can also be made from the standard deviation of cluster numbers on each dataset. We conjecture that this is because life time is not always effective in the prediction of cluster numbers, because the maximum lifetime strategy is more or less a rule of thumb, lack of theoretic justification. As to the average value of the predicted cluster numbers, LACA is larger than EAC-AL on 3 datasets, and smaller on 5, showing inconsistency in some degree.

Iris

Glass

Ecoli

Libras

Segmentation

Seed

LACA 3 4 3.73 0.45

LACA 5 6 5.03 0.18

LACA 3 5 3.73 0.83

LACA 7 9 7.70 0.70

EAC-AL 6 21 11.50 5.54

LACA 5 7 5.23 0.57

LACA 4 5 4.40 0.50

Pima

Pendigits

LACA 4 10 8.70 1.06

EAC-AL 4 30 11.80 8.04

LACA 10 16 14.03 1.75

EAC-AL 4 48 20.53 12.27

Table 2: Statistics of predicted cluster numbers

4.3 Comparison with EAC-AL

Table 3 reports the NMI and F-measure values of our algorithm and EAC-AL on the same cluster ensembles. Each value reported here is obtained by averaging across 30 runs.

We can see that our algorithm performs better than EAC-AL on 7 out of 8 datasets. The only exception is on the Libras dataset. We conjecture that it is because the average cluster number predicted by EAC-AL is closer to the real number of classes in Libras.

Dataset LACA EAC-AL

Iris

Glass

Ecoli

0.8533

0.5502

0.7693

0.7995

0.5395

0.7535

0.3869

0.7528

0.3614

0.7545

0.6790 0.6712

Libras 0.4267 0.5014 0.5333

Segmentation 0.6091

Seed 0.8423

Pima

Pendigits

0.3751

0.7712

0.4091

0.8333

0.3597

0.6809

0.6052

0.6680

0.0674

0.7721

0.4617

0.6613

0.0665

0.7389

Table 3: F-measure and NMI values of LACA and EAC-AL

4.4 Comparison with CSPA, HGPA and MCLA

1817

0.9

0.85

0.8

0.75

0.7

0.65

0.6

0.55

0.68

0.66

0.64

0.62

0.6

0.58

0.56

0.54

0.52

1

0.9

0.8

0.7

0.6

0.5

0.75

0.7

0.65

0.6

0.55

0.5

0.45

3

3

8

8

4

4

10

10 iris

5 6 cluster number k

7 segmentation

12 14 16 cluster number k iris

5 6 cluster number k

7 segmentation

12 14 16 cluster number k glass ecoli

8 k-LACA

CSPA

HGPA

MCLA

9

0.46

0.44

0.42

0.4

0.38

0.36

0.34

0.32

0.3

0.28

6 8 k-LACA

CSPA

HGPA

MCLA

16 18

0.66

0.64

0.62

0.6

0.58

0.56

0.54

0.52

0.5

8 10 12 20 k-LACA

CSPA

HGPA

MCLA

22 24

0.66

0.64

0.62

0.6

0.58

0.56

0.54

0.52

15 10 12 14 cluster number k seed

14 16 18 cluster number k pima

18

18

8 k-LACA

CSPA

HGPA

MCLA

20 k-LACA

CSPA

HGPA

MCLA k-LACA

CSPA

HGPA

MCLA

20

9

0.7

k-LACA

CSPA

HGPA

MCLA

0.09

0.08

0.8

0.78

0.07

0.76

0.65

0.06

0.74

0.05

0.72

0.04

0.6

0.03

0.55

0.02

0.01

3 4 5 6 cluster number k

7 8 9

0

2 3 4 5 cluster number k

6 7 8

Figure 2. NMI Comparison of k-LACA, CSPA, HGPA and MCLA glass ecoli

0.75

0.6

0.55

k-LACA

CSPA

HGPA

MCLA

0.7

0.65

k-LACA

CSPA

HGPA

MCLA

0.6

k-LACA

CSPA

HGPA

MCLA

0.5

0.55

0.7

0.68

0.66

0.64

10

0.5

0.49

0.48

0.5

0.47

0.46

0.45

0.4

0.45

0.45

0.44

0.35

0.4

0.35

0.43

0.42

0.3

0.3

6 8 16 18

0.25

8 10 12 20 22 24

0.41

15 10 12 14 cluster number k seed

14 16 18 cluster number k pima

0.95

0.9

0.85

k-LACA

CSPA

HGPA

MCLA

0.65

0.6

k-LACA

CSPA

HGPA

MCLA

0.8

0.75

0.8

0.55

0.7

0.75

0.5

0.65

0.7

0.45

0.6

0.65

0.4

0.6

0.55

0.35

0.55

0.5

0.5

0.3

0.45

0.45

3 4 5 6 cluster number k

7 8 9

0.25

2 3 4 5 cluster number k

6 7 8

Figure 3. F-measure comparison of k-LACA, CSPA, HGPA and MCLA

10

Due to the fact that most existing cluster ensemble methods require a user-specified number of clusters, to make a fair comparison with them, we make a small modification of

LACA by accepting a user-specified cluster number k , resulting in a variant version called k -LACA. In k -LACA, when the probabilistic parameters get converged, we force the hierarchical clustering to stop merging only if there are exactly k clusters left, which are then used as the consensus clustering of k -LACA.

On each dataset of l classes, we compare k -LACA with

CSPA, HGPA, and MCLA by varying the cluster number k from l to 3× l with step size 1. Figure 2 and Figure 3 depict the

NMIs and the F-measures of k -LACA, CSPA, HGPA and

MCLA, respectively, which are also avergaed over 30 runs, with different user-specified cluster numbers. It is obvious that the curve of our method is better or at least competitive on almost all the datasets. The only exception is observed on the Pima dataset, where the NMI of our method is lower than the others. Besides, we also find that the curve of our method is more smooth and stable across these different k values.

This suggests that our method has achieved high qualities consistently on these levels of hierarchical clustering.

5 Conclusions

In this paper, we proposed a novel cluster ensemble approach by assuming that the observed clustering solutions are generated from a latent cluster model. An EM-style algorithm, called LACA, was designed and implemented to maximize the likelihood function. It has exhibited a satisfactory performance on the experimental datasets, for two reasons: firstly, it can make a stable and reliable prediction of the cluster numbers; secondly, vote of each base clustering solution is weighted which reflects the quality of the base solution.

Acknowledgments

20

20

15

15 libras

25 30 cluster number k

35 pendigits

20 cluster number k libras

25 30 cluster number k

35 pendigits

20 cluster number k

25

25 k-LACA

CSPA

HGPA

MCLA

40 45

40 k-LACA

CSPA

HGPA

MCLA k-LACA

CSPA

HGPA

MCLA k-LACA

CSPA

HGPA

MCLA

30

45

30

This work is supported by National Airplane Research

Program (MJ-Y-2011-39), National Natural Science Fund of

China (No. 61170007), Shanghai High-Tech Project (11-43) and Shanghai Leading Academic Discipline Project (No.

B114). We are grateful to the anonymous reviewers for their valuable comments.

1818

References

[Azimi and Fern, 2009] Javad Azimi, Xiaoli Z. Fern.

Adaptive cluster ensemble selection. In Proceedings of the 21 st

International Joint Conference on Artificial

Intelligence , pages 993-997, 2009.

[Fern and Brodley, 2003] Xiaoli Z. Fern, and Carla E.

Brodley. Random projection for high dimensional data clustering: a cluster ensemble approach. In Proceedings of the 20 th

International Conference on Machine

Learning , pages 186-193, 2003.

[Fern and Lin, 2008] Xiaoli Z. Fern, and Wei Lin. Cluster ensemble selection. Statistical Analysis and Data Mining ,

1(3): 379-390, 2008

[Frank and Asuncion, 2010] A. Frank, and A. Asuncion. UCI

Machine Learning Repository. Irvine, CA: University of

California, School of Information and Computer Science,

2010. [http://archive.ics.uci.edu/ml]

[Fred and Jain, 2005] Ana L.N. Fred, and Anil K. Jain.

Combining multiple clusterings using evidence accumulation. IEEE Transactions on Pattern Recognition and Machine Intelligence . 27(6): 835-850, 2005.

[Jain et al ., 1999] Anil K. Jain, M.N. Murty, P.J. Flynn. Data clustering: a review. ACM Computing Surveys, 31(3):

264-323, 1999.

[Jain, 2010] Anil K. Jain. Data clustering: 50 years beyond

K-means. Pattern Recognition Letters , 31(8): 651-666,

2010.

[MacQueen, 1967] J. MacQueen. Some methods for classifications and analysis of multivariate observations.

In Proceedings of the Fifth Berkeley Symposium on

Mathematics, Statistics and Probability , University of

California Press, pages 281-297, 1967.

[Manning et al ., 2008] Christopher D. Manning, Prabhakar

Raghavan, and Hinrich Schutze. Introduction to

Information Retrieval, Cambridge UniversityPress, 2008.

[Minaei-Bidgoli et al ., 2004] Behrouz Minaei-Bidgoli,

Alexander Topchy, William F. Punch. Ensembles of partitions via data resampling. In Proceedings of

International Conference on Information Technology:

Coding and Computing (ITCC 2004), pages 188-192,

2004.

[Strehl and Ghosh, 2002] Alexander Strehl, and Joydeep

Ghosh. Cluster ensembles - a knowledge reuse framework for combining multiple partitions. Journal of

Machine Learning Research , 3: 583-617, 2002.

[Vega-Pons and Ruiz-Shulcloper, 2011] Sandro Vega-Pons, and Jose Ruiz-Shulcloper. A survery of clustering ensemble algorithms. International Journal of Pattern

Recognition and Artificial Intelligence , 25(3): 337-372,

2011.

1819

Download