Model-based Clustering on Semantically Heterogeneous Distributed Databases on the Internet

Model-based Clustering on Semantically Heterogeneous Distributed

Databases on the Internet

Shuai Zhang, Sally McClean, and Bryan Scotney

School of Computing and Information Engineering, University of Ulster, Coleraine,

Northern Ireland, UK

{zhang-s1, si.mcclean, bw.scotney}@ulster.ac.uk

Abstract

The vision of the Semantic Web brings challenges to knowledge discovery on databases in such heterogeneous distributed open environment. The databases are developed independently with semantic information embedded, and they are heterogeneous with respect to the data granularity, ontology/scheme information etc. The Distributed knowledge discovery (DKD) methods are required to take semantic information into the process alongside the data, also to resolve heterogeneity issues among the distributed databases. Most current DKD methods fail to do so because they assume that distributed databases come from part of a virtual global table, in other words, they share the same semantics and data structure. In this paper, we propose a model-based clustering method on semantically heterogeneous distributed databases that can cope with these two requirements. It deals with data that are generated by a mixture of underlying distributions represented by a mixture model in which each component corresponds to a different cluster. It also resolves the heterogeneity caused by the heterogeneous classification schema simultaneously with the clustering process without previously homogenizing the heterogeneous local ontologies to a shared ontology.

Introduction

The enormous potential research area of Distributed

Knowledge Discovery (DKD) (Merugu et al. 2005; Park et al. 2002; Kargupta et al. 2000) is a result of further evolution of the Knowledge Discovery on Databases

(KDD) problem. It considers the knowledge discovery process in an emerging distributed open computing environment, e.g., the fast-developing Semantic Web (SW)

(Berners-Lee et al. 2001) There are a large number of application domains that could benefit from DKD, e.g., customer profiling and targeted selling, facilitation in the area of medicine, and also other decision support systems on the web. Work on learning from distributed databases by transformation of traditional machine learning algorithms has been carrying out in (Caragea et al. 2005).

Clustering is also one of the important tasks in DKD.

Copyright © 2006, American Association for Artificial Intelligence

(www.aaai.org). All rights reserved.

Clustering of distributed databases enables the learning of new concepts that characterize important common features and differences between datasets. It enables global knowledge discovery and general patterns learning, instead of limited learning on specific databases, where the rules can not be generalized. However, on this heterogeneous distributed open environment (SW), the process of knowledge discovery on databases equipped with ontology information, is facing a lot of challenges.

The first challenge of KDD on the SW is that the databases may contain different semantics according to the background, environment and purposes when they were developed. For example, for discovery patterns among several distributed different hospitals, they are biased in terms of different specialists, equipment and may treat patients differently. The semantics of the databases need to be taken into account for knowledge discovery as well as the data itself. When data distribution is an integral part of the semantics of the problem, ignoring the context and meaning of data distribution in each individual model is not appropriate and will lead to wrong results (Wirth et al.

2001). However, in the literature, the majority of DKD methods have an underlying assumption that the distributed data to be analyzed is somehow partitioned either horizontally or vertically from a single virtual global table that could be generated if the data were centralized.

They assume that the distributed databases have the same data distribution pattern, and they share the same semantics

(characteristics). This is a narrow view of DKD (Provost

2000). The assumption does not hold in most practical applications (Tsoumakas et al. 2004). In general, two or more groups of models can be proposed to analyze and capture to different semantics of distributed databases— this is known as Semantically Distributed Knowledge

Discovery (Wirth et al. 2001).

Another key issue is the semantic heterogeneity problem

(Doan et al. 2005; Caragea et al. 2004; Zhang et al. 2005) caused by different granularity of data stored, different conceptualizations and various schemas employed for the independently developed databases. Here, we are mainly concerned with heterogeneous databases that hold aggregate count data (categorical data) on a set of attributes where the semantic heterogeneity is caused via heterogeneous classification schemes. Managing these

heterogeneity problems is a fundamental issue for KDD on the SW, particularly for clustering tasks.

There are a large number of clustering methods available ranging from those that are largely heuristics to more formal procedures based on statistical models e.g., modelbased clustering (Fraley et al. 1998). But unfortunately most of they only deal with homogeneous databases. In

(Cadez et al. 2000), they proposed a general model-based framework for clustering databases with various lengths, e.g., time-series data, but still in the same schema. There is also work on clustering data bases that are different in data types, e.g., continuous, discrete, categorical etc (Christen et al. 2000). Neither do these methods take the underlying semantics of the databases into account. In our previous work, a method of agglomerative probabilistic clustering has been developed on distributed databases for knowledge discovery (McClean et al. 2005).

By considering the two challenges above, in this paper, we propose a model-based clustering on heterogeneous databases distributed on the Internet, also without prior knowledge of the number of clusters or any other information about their composition. It is a principled way to deal with data that are generated by a mixture of underlying probability distributions represented by a mixture model in which each component corresponds to a different cluster. Each cluster reflects different underlying semantics of the distributed databases from other clusters.

It uses the parameterization as the basis for a class of models that is sufficiently flexible to accommodate data with widely varying characteristics (Fraley et al. 2002).

Each dataset belongs to a cluster with a certain probability, not just 1 or 0. The resolve of the data heterogeneity problem is embedded within the clustering process for more efficiency. The method can effectively avoid the transformation and thus execute the clustering directly on the heterogeneous databases. All the information available from the original data is kept for carrying out the clustering, which will lead to better quality results. It is computational efficient for avoiding the homogenizing.

The model for the composite of the clusters is formulated by maximizing the mixture likelihood using

EM (expectation-maximization) algorithm (Dempster et al.

1977). This is a general effective approach to maximum likelihood in the presence of incomplete data, and is used to resolve the data heterogeneity issue. Another advantage of the mixture model based clustering is that it allows the use of approximate Bayes factors to compare models. This gives a systematic means of selecting the parameterization of the model and also deciding the number of clusters

(Fraley et al. 1998). The k -means method can be seen as a special case of model-based clustering method, where each cluster is represented by its estimated means. Such clustering over heterogeneous datasets assists further development of prediction models for classification, thus facilitating semantically distributed knowledge discovery.

The results obtained by using EM also provide a measure of uncertainty about the associated classification of each dataset. The classifier for each cluster captures the cluster contexts and their interoperation can provide useful knowledge with respect to each cluster. The new knowledge discovered from clustering can assist the construction of Bayesian Belief Network to help build more efficient predicting model, and also association rules.

The novelty of this paper resides in (1) the distributed databases may carry different semantic information within themselves, this information is taken into account when doing knowledge discovery over those databases on the

Internet; different clusters are obtained by using modelbased clustering methods where each group represent a certain semantic knowledge that is different from other groups; (2) it is a principled clustering method, which works on heterogeneous datasets that do not have the same ontology structure; the heterogeneity is resolved simultaneously with clustering being carried out in an effective way with no extra computation cost, and it makes full use of the information carried out by the heterogeneous datasets; the degree of a dataset belongs to a cluster brings more flexibility for this clustering method.

The reset of the paper is organized as follows: the data models is introduced first, it followed by a brief introduction about the general model-based clustering methods and clustering on homogeneous categorical data; then we describe our proposed model-based clustering method on heterogeneous datasets together with illustrative examples, the algorithm flowchart, and a practical example on data; we finish with conclusions and future work.

Terminologies and Data Models

We are concerned with clustering distributed databases that hold aggregate count data (categorical data) in the form of datacubes on a set of attributes that have been classified according to heterogeneous classification schemes. Also, the databases are typically developed independently by different organizations, thus they may have latent heterogeneous semantic information to be investigated.

Definition 1 : An ontology is defined as the Cartesianproduct of a number of attributes A

1

,…,A n

, along with their corresponding schema. The attributes we are concerned are categorical attributes.

Definition 2 : Each attribute A has its domain D, where the classes of the domain are given by {c

1

,… c m

}. These classes form a partition of the set of base values of domain

D. This partition is called a classification scheme.

Definition 3 : Two ontologies are defined to be semantically equivalent if there is a mapping between their respective schemas. Mappings between the schema values are represented in the form of correspondence matrices.

The local ontologies of the heterogeneous datasets, in the form of such classification schemes, are mapped onto a global ontology through the correspondence matrices. Here is an illustrative example about the heterogeneous clustering problem, including the heterogeneous datasets,

classification schemes for the attributes, their local ontologies, and their mappings to the global ontology.

Example:

Attribute : A=Job;

Domain : D={Full-time, Part-time, Retired, Unwaged}.

Table 1 Semantically heterogeneous datasets that have different classification schemes for attribute “Job”

Data

Job

Working NotWorking

Fulltime

Parttime

Retired Unwaged

Total

Dataset1 150 35 20

Dataset2 55 50 90 195

Dataset3 80 45 30 15 170

The classification schema for attribute “Job” in

Dataset1, Dataset2 and Dataset3 are listed as follows:

D

3

=Unwaged}, where

D

1

={c

1

=Working, c

2=

Retired, c

Working = {Full-time, Part-time}.

D

2

={Full-time, Part-time, NotWorking}, where

NotWorking = {Retired, Unwaged}.

3

={Full-time, Part-time, Retired, Unwaged}

The local ontologies of these datasets together with the mappings to the global ontology are:

JOB1

Working

Retired

Unwaged

JOB

Full-Time

Part-Time

Retired

Unwaged

JOB2

Full-Time

Part-Time

NotWorking

Local Ontology O for Dataset1

1

The ontology O

3

Global Ontology Local Ontology O

2

O

G for Dataset2

used in Dataset3 is the same as the global ontology. The correspondence matrices represented for these mappings are:

A

G1

1

1

0

0

0

0

1

0

0

0

0

1

, A

G2

1

0

0

0

0

1

0

0

0

0

1

1

, A

G3

1

0

0

0

0

1

0

0

Model-based Clustering Principles

0

0

1

0

0

0

0

1

Given data y with independent multivariate observations y

1

, .

..

, y n

, the likelihood of a mixture model with G components is n G

L

MIX

(

1

,...,

G

;

1

,...,

G y ) k f k

( y r k

) (1) where k f k

and k r 1 k 1 are the density function for continuous data (probability mass function for discrete data) and parameters of k th cluster in the mixture respectively and is the probability of an observation belonging to the k th cluster ( k

0 ;

G k 1 k

1 ) (Fraley et al. 2002). The EM algorithm is a general approach to maximum likelihood estimation for problems in which the “complete data” x r

( y r

, z r

) can be viewed as consisting of n multivariate observations x r

recoverable from ( y r

, z r

) , in which y r is observed and z r

( z r 1

, .

..

, z rG

) is unobserved portion of data with (Formula (2)). The value of z rk

is between 0 and

1, indicating the conditional probability of the observation r belongs to cluster k . z rk

1

0 if x r belongs otherwise .

to group k

(2)

For our problem, f k

is the multinomial probability distribution function in cluster k . The observations x r

are the semantically heterogeneous distributed databases; the observed data y r is aggregate count data in the datasets and the unobserved data z r is the cluster that observation x r should belong to.

Assuming that each z r is independent and identically distributed according to a probability distribution function f k

draw from G clusters with probabilities

1

, .

..

,

G

, and that the probability mass function of an observation y r given z r

is given by

G f k

( y r k

) z rk , the completek 1 data likelihood is then: l ( k

, k

, z rk x ) n G k f k

( y r k

) z rk r 1 k 1

Correspondingly, the complete-data log-likelihood is

(3) n G ll ( k

, k

, z rk x ) z rk log[ k f k

( y r k

) ] (4) r 1 k 1

The EM algorithm alternates between two steps. In the

“E step”, given the observed data and the current estimated parameters, it computes the conditional expectation z ˆ rk

(Formula (5)); in “M step”, it determines the parameters k

and the expected complete-data log-likelihood (Formula (4)) with the value of z ˆ rk k

of the k th component that maximize

fixed from the E step. z ˆ rk

ˆ k

G c 1 f k

ˆ c

( f y c r

(

ˆ k y r

)

ˆ c

)

(5)

Model-based Clustering on Multinomial

Homogeneous Data

The homogeneous datasets share the same ontology

(schema). In multinomial distribution, each of the n independent trials results in one of m possible outcomes

with probabilities p

1

,..., p m

. The variable X i

indicates the number of times outcome value x i was observed over the n trials. The multinomial distribution is defined as the distribution of the vector ( X

1

,..., X m

) with the probabilities given by and 2 belong to cluster 1 and Supermarket 3 is in another cluster. Thus, z

11 z

21 z

32

1 and z

12 z

22 z

31

0 .

In iteration 1, according to Formula (9): n

1 r

3

1 z ˆ r 1

( 1 1 0 ) 2 , ˆ

1 n

1 n

2

3

, ˆ

2

1 / 3 ;

P ( X

1 x

1

,..., X m x m

) p

1 x

1 ...

p m x m

(6)

ˆ

11

( 80

1

45 )

80 1

( 56

56 0 *

68 )

40

* ( 40 23 )

0 .

546

In the model-based clustering on multinomial data, for each of the G clusters, the probability distribution within that cluster is denoted as k 1

, .

.

.

, km

, where ki is the probability of value x i

in cluster k , i=1,..,m and k=1,..,G .

For multinomial homogeneous databases, the completedata log-likelihood is n G m ll ( ki

, k

, z rk x ) z rk log[ k ki n ri ] (7) r 1 k 1 i 1 where n ri z ˆ is the aggregate count for value

In the E-step, rk z ˆ rk

ˆ k

is given by Formula (8):

( ˆ k 1 n r 1 ...

ˆ km n rm )

G c 1

ˆ c

( ˆ c 1 n r 1 ...

ˆ ck n rm ) i in dataset r .

In the M-step, the estimates of the probability distributions for each group ˆ ci and the probability of a dataset belongs to each group ˆ k

(8) involving the data and

1 1 0

This means that in cluster 1, the probability that people buy “Cheese” is 0.546; so the probability of people who don’t buy cheese (“NoCheese”) ˆ is (1-0.546)=0.454.

In a same way, we can obtain, in cluster 2, the probability of people who buy cheese and don’t buy cheese are 0.635 and 0.365 respectively.

Then, in the E-step, we can work out the expectations: z

11

( 2 / 3 )

( 2 / 3 )

( 0 .

546

80

( 0 .

546

80

0 .

454

45

)

0 .

454

45

)

( 1 / 3 ) ( 0 .

635

80

0 .

365

45

)

0 .

174 z ˆ

11

is the probability that Supermarket1 belongs to Cluster

1. Similarly, we can obtain: z ˆ

12

0 .

826 , z ˆ

21

0 .

999 , z ˆ

22

0 .

001 ; ˆ z

31

0 .

42 , z ˆ

32

0 .

58 .

Given these parameters, the value of the log-likelihood we want to maximized is also z ˆ rk

obtained from the E-step.

ˆ k n k n where n k r n

1 z

We assume that there are two clusters with probabilities

1

and

2

indicating probabilities of a Supermarket

{ 0 .

174 [log(

2

3

0 .

999 [log(

2

3

0 .

546

0 .

546

56

80

0 .

454

45

)]

0 .

454

68

)]

0 .

826 [log(

1

3

0 .

001 [log(

1

3

0 .

635

0 .

635

56

80

0 .

365

45

0 .

365

68

)]

)] m z ˆ rc n ri

(9)

0 .

42 [log(

2

0 .

546

40

0 .

454

23

)] 0 .

58 [log(

1

3

0 .

635

40

0 .

365

23

)]}

ˆ ci r 1

3 m z ˆ k 92 .

78227 r 1 rc

( do not in the 3 supermarkets. i 1 n ri

)

The EM iteration continues until it converges or performance criteria are met.

Example: (Supermarket association data). The following table illustrates shopping data for three supermarkets. The counts are the number of people who buy cheese and who

Cheese

Supermarket1 80 45

Supermarket2 56 68

Supermarket3 40 23

With the new calculated starts. The parameters are calculated in the same way as in iteration 1. The threshold for the convergence is that the less than 10 (-6) z ˆ , the next iteration of EM difference between the log-likelihood in two iterations is

. In this example, it takes 8 iterations to converge. The final values of the parameters are z rk

6 .

989 e

0 .

99974

0 .

00726

05 0 .

99993

0 .

00026

0 .

99274 ki

0 .

4523 ,

0 .

63827

0 .

6477

0 .

36172

The log-likelihood for the clustering model is -91.35059.

The clusters are {Supermarket2} and {Supermarket1,

Supermarket3}. belonging to cluster 1 and 2 respectively. z rk

indicates the ll ( ki

, k

, z rk x )

Model-based Clustering on Semantically

conditional probability of Supermarket r belongs to cluster k . ki is the conditional probability of value i (either

Cheese or NoCheese) in cluster k . The z rk

are firstly initialized randomly as 1 or 0. For example, Supermarket 1

Heterogeneous Data

In a distributed database environment, heterogeneity may arise due to differences in the granularity of data stored in

different distributed databases or may be due to differences in the underlying concepts. An effective algorithm needs to be developed to solve the heterogeneity problem with as little information loss as possible while avoiding the homogenization process. Data integration (McClean,

Scotney et al. 2003) is one of the ways to solve heterogeneity issues with ontology mappings provided ab initio using correspondence matrices. The EM algorithm is used for the estimation for mixture models, and it also provides an intuitive approach to the integration of aggregates to handle the heterogeneity issues. This gives the opportunity to execute the clustering and the data integration for solving the heterogeneity issues simultaneously without the homogenization beforehand.

Two heterogeneity issues have to be overcome. The first one is that the datasets are heterogeneous with respect to that they are extracted from the same population but with attributes values that are at different levels of detail, i.e., different ontologies. Also, model-based clustering solves the problem of heterogeneous data in terms of the datasets being heterogeneous if they are extracted from different populations of individuals. By developing algorithms on model-based clustering on heterogeneous data, we consider both the dataset with different representation structures

(schema) and the datasets from different populations which have different underlying meaning (semantics) from their original background. The algorithm for model-based clustering on semantically heterogeneous databases using

EM algorithm is illustrated below.

Additional terms and notations

g g q r r isr is the number of partitions in dataset r; s is the index for

, s 1 ,..., g r

. is the mapping from the shared global shared ontology to local ontologies that each dataset has. For dataset r, for value i, the value of q isr

is 1 if there is mapping between them, otherwise it is 0.

EM for model-based clustering on semantically heterogeneous databases

For the EM algorithm, in the E-step: g r m

ˆ k

( ˆ ku q usr

) n rs z ˆ rk s 1 g r u 1 m

(10)

G

ˆ j

( ˆ ju q usr

) n rs j 1 s 1 u 1

In the M step: k remains the same as in Formula (9).

The probability of value i in cluster k is updated in

Formula (11). Here, the value of ˆ ki

( n )

in this iteration is calculated from the value of ˆ ki

( n 1 )

in the previous iteration. The (n) denotes the number of the iteration.

ˆ ki

( n 1 ) r n g r

1 s 1

( z ˆ rk m n rs q isr u 1

ˆ ku q usr

)

ˆ ki

( n ) (11) n g r z ˆ rk

( n rs

) r 1 s 1

The log-likelihood of the clustering model given the semantically heterogeneous data is : n G g r m ll ( ki

, k

, z rk x ) z rk log k n rs log( ki q isr

) r 1 k 1 s 1 i 1

Example:

Take the semantically heterogeneous data in Table 1 as an example. We assume there are two clusters for this heterogeneous data (the optimal number of clusters can be decided by using Bayesian Information Criteria illustrated in the next section). z rk

is randomly initialized with 0/1, e.g., cluster 1 contains dataset1 and dataset2, and cluster2 has dataset3, thus z

11 z

21 z

32

1 . Here, q is 1

is the correspondence matrix A

G1

for the mapping between the global ontology {Full-time, Part-time, Retired, Unwaged} and the local ontology in datasete1 {Working, Retired,

Unwaged}; q is 2

is actually A

G2

for dataset2, and q is 3

is

A

G3

. There are m=4 schema values for attribute “Job” in the global ontology. The probability distributions in each cluster are initialized by a uniform distribution.

ˆ

( 1 ki

)

1 / m 0 .

25 k 1 , 2 i 1 ..

4

In iteration 1:

ˆ

1

2 / 3 , ˆ

2

1 / 3

ˆ

( 1

11

) z ˆ

11

( n

11

1 n

12 z ˆ

11

( n

13

)

ˆ ( 0

11

)

ˆ ( 0

11

)

ˆ ( 0

12

) z ˆ

21

( n

21

) n

11 n

22

0 .

25

1

0 .

25

0 .

25

205 1

150

195

1

0

55

170

0 80

ˆ z

21 n

23

)

0 n

21

ˆ z

.

325

31

( z ˆ

31 n

31 n

31 n

32 n

33 n

34

)

The count ( 0 .

25

150

) is the contribution of

0 .

25 0 .

25 dataset1 for the probability of value1 (Full-time) in cluster1. We apportion the count 150 to 75 and 75 to contribute to the values Full-time and Part-time respectively. Here is another example of calculating ˆ ( 1

13

)

:

ˆ

( 1

13

) z ˆ

11

( n

11

1 35

1 z ˆ

11 n

12 n

12 n

13

)

1

0

0 .

25

205 1

.

25

0 .

25

195 0 z ˆ

21

( z ˆ

21

( n

21

ˆ ( 0

13

)

ˆ ( 0

13

)

ˆ ( 0

14

) n

22

) n

23

)

90 0

170

30

0 .

2 n

23

ˆ z

31 z ˆ

31

( n

31 n

33 n

32 n

33 n

34

)

In the same way, we can obtain:

ˆ ki

0 .

325

0 .

471

0 .

3125

0 .

265

0 .

2

0 .

176

0 .

1625

0 .

088

Then, in the E step, compute z ˆ rk estimated above in the M-step. zˆ

11

ˆ

1

[( ˆ (1)

11

ˆ (1)

12

) n

11

ˆ

1

( ˆ

[(

(1)

13

ˆ

)

(1)

11 n

12 ( ˆ

ˆ (1)

12

(1)

14

)

) n n

13

11

]

( ˆ

ˆ

2

(1)

13

) n

12

[( ˆ (1)

21 given the parameter

(

ˆ

ˆ (1)

14

(1)

22

)

) n

13 n

11

]

( ˆ (1)

23

) n

12 ( ˆ (1)

24

) n

13 ]

2

3

[( 0 .

0 .

01598

325

2

3

0 .

313 )

150

[( 0 .

325

0 .

2

35

0 .

3125 ) 150

0 .

163

20

]

In the same way, we can get

1

3

0 .

2 35

[( 0 .

471

0 .

1625 20 ]

0 .

265 )

150

0 .

177

35

0 .

088

20

]

0 .

015982 0 .

984018 z ˆ rk

0 .

999999 0 .

000001

0 .

000196 0 .

999804

The model log-likelihood in iteration 1 is: 250 .

9899

The EM iteration continues in the same way until it converges. For this example, it takes only 4 steps for the algorithm to converge. The value of the parameters after full convergence is as follows: k

( 0 .

3333 , 0 .

6667 ) ki

0 .

28205

0 .

46933

0 .

25641

0 .

264

0 .

25502

0 .

1773333

0 .

20651

0 .

09333 z rk

2

4

,

.

21

4 e e 08

0 .

9999999955

09

0 .

9999999779

4 .

5 e 09

0 .

9999999956

248 .

8713 .

The clusters are {Dataset1, Dataset3}, {Dataset2}. and n is the number of datasets to be clustered. The larger the value of BIC, the stronger the evidence is for the model with its associated number of clusters. For the clustering models dealing with semantically heterogeneous databases, the mixture model log-likelihood is:

The number of impendent parameters to be estimated in the model should be counted, but the number of clusters is not taken into account as an independent parameter.

Therefore, k

needs to be estimated, and it contains ( G -1) independent parameters where G is the number of clusters; the probability distribution in each cluster z rk

need to be estimated; in each cluster, there is ( m -1) independent probabilities, so for G number of components, there are

G ( m 1 ) number of independent parameters. Altogether there are: m

M

( G 1 ) G ( m 1 ) independent parameters.

Therefore, the BIC for the clustering model on heterogeneous databases can be obtained straightforwardly.

The data model, with an associated number of clusters, which has the biggest BIC number, is the optimal model for model-based clustering.

Example:

For the same heterogeneous example listed above, we use BIC to choose the optimal number of clusters. We start from 2 clusters. The structures of the matrix z rk

are changing according to the number of clusters. The values of z rk

for each number of clusters are dynamically initialized randomly subject to the sum of the probabilities for each dataset belonging to a cluster is 1. The result of the clustering algorithm with BIC says 2 is the optimal number of clusters for our example, with log-likelihood

248 .

8713 and BIC= 501 .

0824 . The BIC for 3 clusters is 503 .

2472 . A standard convention for calibrating BIC difference is that if the difference is between 2 and 6, then it is positive evidence (Kass et al. 1995). In our case, the

BIC difference is 2.1648, which shows positive evidence according to the convention.

Decide the optimal number of clusters by using

Bayesian Information Criteria (BIC)

Bayes factors, approximated by the Bayesian Information

Criterion (BIC) (Schwarz 1978), have been applied successfully to the problem of determining the optimal number of clusters in a mixture model by the direct comparison of different models with various numbers of clusters (Dasgupta et al. 1998). The fit of a mixture model to a given dataset can only improve (the likelihood can only increase) as more detailed clusters are generated in the model. Hence the likelihood itself can not be used as an assessment of the clustering given the number of clusters.

In the BIC, an additional term is used for penalizing the complexity of the model, so that it may be maximized for more parsimonious parameterizations and smaller number of groups. The BIC is calculated as follows:

2 log p ( x M ) const 2 ll

M

( x ,

ˆ

) m

M log( n ) BIC (12) where ll

M

( x ,

ˆ

) is the maximized mixture log-likelihood for the model illustrated in Formula (1). m

M is the number of independent parameters to be estimated in the model,

Proposed Algorithm and Real Data Experiment

Results

The proposed clustering method via multinomial mixture models using EM algorithm on semantically heterogeneous datasets is shown in Figure 1.

The algorithm was tested on state-specific national data from the US Census were clustered on the basis of “gross rent” in 2000. A homogeneous se is shown in Table 2. A

(synthetic) set of heterogeneous databases (Table 3) was then generated by combining data from homogeneous dataset for a random selection of states. The result is shown in Figure 2. We compared our results with the ranked “States Median Gross Rent” table supplied by the

US Census Bureau that year. Our results are very encouraging. It successfully finds the most expensive states

for renting in US. It includes {Hawaii, New Jersey,

California, Alaska, Nevada}, where the medians for each of these states are more than $700. The second cluster contains a bit less expensive states whose medians range from $650 - $699. It includes {Maryland, New York,

Virginia…} 7 states. The rest of the clusters include the states with median rents ranging from $555-$650, $497-

$555, and $400-$496 respectively without any exceptions.

The corresponding example states in these 3 clusters are

{Florida, Georgia, Texas…}, {Michigan, Ohio,

Tennessee…} and {Iowa, Arkansas, Mississippi…}.

Initialize repeat z ˆ rk

p is the iteration counter in the loop

M-step: n n k z ˆ rk

ˆ k r n k n

1

ˆ ki

( p )

E-step: compute

ˆ ki

( p 1 ) n g r

( z ˆ rk m n rs q isr ) r 1 s 1 u 1

ˆ ku q usr n g r z ˆ rk

( n rs

) r 1 s 1 z ˆ given the parameter estimated from the M-step

ˆ k g r

( m

ˆ ku q usr

) n rs z ˆ rk

G s

ˆ j

1 g r u

(

1 m

ˆ ju q usr

) n rs j 1 s 1 u 1

Calculating complete-data loglikelihood: ll

( p )

( ki

, k

, z rk x ) r n G

1 k 1 z rk log k s g r

1 n rs log( i m

1 ki q isr

) until ll

( p ) ll

( p 1 )

10

6

Compute BIC for the mixture models with different number of clusters, and choose the optimal parameters from the EM.

Figure 1 EM algorithm for clustering via multinomial mixture models on heterogeneous datasets. The strategy described here terminates when the relative difference between the values of the mixture log-likelihood in equation (1) falls below a small threshold 10 (-4) . It also uses

BIC to determine the optimal number of clusters.

Table 2 A sample of the US states Gross Rent dataset

States

<=

$250

$250

-$499

$500-

$749

$750-

$999

$1000-

$1499

Alabama 68008 186901

Alaska 35231 131456

Arizona 29837 135811

Arkansas 1981 10917

Wisconsin 44787 209117

… …

… …

… …

9101

5257

48722

… … 11296

… … 78955

Wyoming 5906 26685 … … 3076

Table 3 A sample heterogeneous Gross Rent dataset

<= $500 $750 $1000$1500>=

$1500-

$1999

1815

841

1124

Arizona $499 -$749 -$999 $1499 $1999 $2000

8188

2209

22527

Connecticut

Wyoming

12898 26096 18437 11296 2209 381

<= $250 $500 $750 $1000 >=

$250 -$499 -$749 -$999 -$1499 $1500

32102 63966 152735 97511 47845 16522

Delaware

<=

$499

$500

-$999

$1000

-$1499

$1500

-$1999

>=

$2000

19979 49972 5688 627 922

<=

$250

$250

-$499

$500

-$749

$750

-$999

$1000

-$1499

$1500-

$1999

>=

$2000

5906 26685 13150 3076 1124 280 95

Figure 2 Clustering States using US Gross Rent

Heterogeneous Census Data. 5 clusters are marked. The more expensive gross rent in the state, the darker it is.

Conclusions

A method has been proposed to cluster databases distributed on an open heterogeneous environment like the

Semantic Web. These distributed databases may be either homogeneous or heterogeneous with respect to their classification schemes. The databases uploaded to the web are developed independently, thus may differ in the underlying semantics or characteristics information. The proposed model-based clustering method by using EM algorithm takes both heterogeneity and semantics issues into account. The resulting clusters represent databases that share similar characteristics, which are different from the

ones in other clusters. The new knowledge discovered from the clustering can assist the construction of Bayesian

Belief Network to help build more efficient predicting model, and also association rule mining.

With burgeoning Semantic Web development, distributed databases are becoming increasingly commonplace, most of which are heterogeneous and contain semantic information. The potential of learning new knowledge from these distributed databases in such an environment is enormous. The development of our algorithm will have a great opportunity to be applied to a wide range of application domains.

Future Work

Initialization is a very important factor in model-based clustering on the heterogeneous databases. It directly affects the efficiency of EM algorithm to converge. Under fairly mild regularity conditions, EM can be shown to converge to a local maximum of the observed-data likelihood. Hierarchical agglomeration method can be used first to obtain clusters that approximately maximize the classification likelihood for each model. The parameters obtained from the hierarchical clustering method are used to initialize the conditional probabilities z ˆ in an M step of EM algorithm. This idea can be implemented to extend our current algorithm. Also, handling outliers is another aspect we will address in the future work. More evaluation work using real and simulated data need to be carried out.

Based on the derived clusters that capture the context, high performance prediction models for classification task can be built on top of this. Also our approach further facilitates the semantically distributed knowledge discovery on these heterogeneous databases on the

Internet.

References

Cadez, I., S. Gaffney, and P. Smyth, A General

Probabilistic Framework for Clustering Individuals . 2000,

Department of Information and Computer Science,

University of California, Irvine.

Caragea, D., A. Silvescu, and V. Honavar, A Framework for Learning from Distributed Data Using Sufficient

Statistics and its Application to Learning Decision Trees .

International Journal of Hybrid Intelligent Systems, 2004.

1 (2): p. 80-89.

Caragea, D., J. Pathak, and V. Honavar, Learning classifiers from Semantically Heterogeneous Data , in

CoopIS/DOA/ODBASE (2) . 2004. p. 963-980.

Christen, J.A., A. Medrano-Soto, and J. Collado-Vides,

Bayesian Clustering and Prediction in H eterogeneous data bases using multivariate Mixtures.

Journal of

Classification.

Dasgupta, A. and A.E. Raftery, Detecting features is spatial point processes with clutter via model-based clustering.

Journal of the American Statistical Society,

1998. 93 (441): p. 294-302.

Dempster, A.P., N.M. Laird, and D.B. Rubin, Maximum likelihood for incomplete data via the EM algorithm.

Journal of the American Statistical Society, Series B, 1977.

39 : p. 1-38.

Doan, A. and A.Y. Halevy, Semantic Integration Research in the Database Community: A Brief Survey.

AI Magazine,

2005. 26 (1): p. 83-94.

Engelmore, R., and Morgan, A. eds. 1986. Blackboard Systems.

Reading, Mass.: Addison-Wesley.

Fraley, C. and A.E. Raftery, How Many Clusters? Which

Clustering Methods? Answers Via Model-Based Cluster

Analysis.

Computer Journal, 1998. 41 : p. 578-588.

Fraley, C. and A.E. Raftery, Model-Based Clustering,

Discriminant Analysis, and Density Estimation.

Journal of the American Statistical Society, 2002. 97 (458): p. 611-

631.

Kargupta, H. and P. Chan, Advances in Distributed and

Parallel Knowledge Discovery . 2000, Cambridge,

Massachusetts: AAAI Press / The MIT Press. 400.

Kass, R.E. and A.E. Raftery, Bayes Factors.

Journal of

American Statistical Association, 1995. 90 : p. 773-795.

Berners-Lee, T., J. Hendler, and O. Lassila, The Semantic

Web.

Scientific American, 2001. 284 (5): p. 34-43.

McClean, S.I., B.W. Scotney, and K. Greer, A Scalable

Approach to Integrating Heterogeneous Aggregate Views of Distributed Databases.

IEEE Trans. on Knowledge and

Data Engineering, 2003. 15 (1): p. 232-235.

McClean, S.I., et al., Knowledge discovery by probabilistic clustering of distributed databases.

Data & Knowledge

Engineering, 2005. 54 (2): p. 189-210.

Merugu, S. and J. Ghosh, A Distributed Learning

Framework for Heterogeneous Data Sources , in KDD 05 .

2005, ACM Press: Chicago, Illinois, USA. p. 208-217.

Park, B. and H. Kargupta, Distributed Data Mining:

Algorithms, Systems, and Applications , in Data Mining

Handbook , N. Ye, Editor. 2002, IEA. p. 341-358.

Provost, F., Distributed Data Mining: Scaling up and

Beyond . Advances in Distributed and Parallel Knowledge

Discovery, ed. H. Kargupta and P. Chan. 2000: AAAI

Press/ The MIT Press.

Schwarz, G., Estimating the Dimensions of a Model.

The

Annals of Statistics, 1978. 6 (2): p. 461-464.

Tsoumakas, G., L. Angelis, and I. Vlahavas, Clustering classifiers for knowledge discovery from physically distributed databases.

Data & Knowledge Engineering,

2004. 49 (3): p. 223-242.

Wirth R., M. Borth, and J. Hipp, When distribution is part of the semantics: A new problem class for distributed knowledge discovery , in Proceeding of 5th ECML and

PKDD Workshop on Ubiquitous Data Mining for Mobile and Distributed Environments . 2001: Freiburg, Germany. p. 56-64.

Zhang, J., D. Caragea, and V. Honavar, Learning

Ontology-aware Classifiers , in 8th International

Conference Discovery Science . 2005, Springer: Singapore. p. 294-307.