School of Computing and Information Engineering, University of Ulster, Coleraine,
Northern Ireland, UK
{zhang-s1, si.mcclean, bw.scotney}@ulster.ac.uk
Abstract
The vision of the Semantic Web brings challenges to knowledge discovery on databases in such heterogeneous distributed open environment. The databases are developed independently with semantic information embedded, and they are heterogeneous with respect to the data granularity, ontology/scheme information etc. The Distributed knowledge discovery (DKD) methods are required to take semantic information into the process alongside the data, also to resolve heterogeneity issues among the distributed databases. Most current DKD methods fail to do so because they assume that distributed databases come from part of a virtual global table, in other words, they share the same semantics and data structure. In this paper, we propose a model-based clustering method on semantically heterogeneous distributed databases that can cope with these two requirements. It deals with data that are generated by a mixture of underlying distributions represented by a mixture model in which each component corresponds to a different cluster. It also resolves the heterogeneity caused by the heterogeneous classification schema simultaneously with the clustering process without previously homogenizing the heterogeneous local ontologies to a shared ontology.
The enormous potential research area of Distributed
Knowledge Discovery (DKD) (Merugu et al. 2005; Park et al. 2002; Kargupta et al. 2000) is a result of further evolution of the Knowledge Discovery on Databases
(KDD) problem. It considers the knowledge discovery process in an emerging distributed open computing environment, e.g., the fast-developing Semantic Web (SW)
(Berners-Lee et al. 2001) There are a large number of application domains that could benefit from DKD, e.g., customer profiling and targeted selling, facilitation in the area of medicine, and also other decision support systems on the web. Work on learning from distributed databases by transformation of traditional machine learning algorithms has been carrying out in (Caragea et al. 2005).
Clustering is also one of the important tasks in DKD.
Copyright © 2006, American Association for Artificial Intelligence
(www.aaai.org). All rights reserved.
Clustering of distributed databases enables the learning of new concepts that characterize important common features and differences between datasets. It enables global knowledge discovery and general patterns learning, instead of limited learning on specific databases, where the rules can not be generalized. However, on this heterogeneous distributed open environment (SW), the process of knowledge discovery on databases equipped with ontology information, is facing a lot of challenges.
The first challenge of KDD on the SW is that the databases may contain different semantics according to the background, environment and purposes when they were developed. For example, for discovery patterns among several distributed different hospitals, they are biased in terms of different specialists, equipment and may treat patients differently. The semantics of the databases need to be taken into account for knowledge discovery as well as the data itself. When data distribution is an integral part of the semantics of the problem, ignoring the context and meaning of data distribution in each individual model is not appropriate and will lead to wrong results (Wirth et al.
2001). However, in the literature, the majority of DKD methods have an underlying assumption that the distributed data to be analyzed is somehow partitioned either horizontally or vertically from a single virtual global table that could be generated if the data were centralized.
They assume that the distributed databases have the same data distribution pattern, and they share the same semantics
(characteristics). This is a narrow view of DKD (Provost
2000). The assumption does not hold in most practical applications (Tsoumakas et al. 2004). In general, two or more groups of models can be proposed to analyze and capture to different semantics of distributed databases— this is known as Semantically Distributed Knowledge
Discovery (Wirth et al. 2001).
Another key issue is the semantic heterogeneity problem
(Doan et al. 2005; Caragea et al. 2004; Zhang et al. 2005) caused by different granularity of data stored, different conceptualizations and various schemas employed for the independently developed databases. Here, we are mainly concerned with heterogeneous databases that hold aggregate count data (categorical data) on a set of attributes where the semantic heterogeneity is caused via heterogeneous classification schemes. Managing these
heterogeneity problems is a fundamental issue for KDD on the SW, particularly for clustering tasks.
There are a large number of clustering methods available ranging from those that are largely heuristics to more formal procedures based on statistical models e.g., modelbased clustering (Fraley et al. 1998). But unfortunately most of they only deal with homogeneous databases. In
(Cadez et al. 2000), they proposed a general model-based framework for clustering databases with various lengths, e.g., time-series data, but still in the same schema. There is also work on clustering data bases that are different in data types, e.g., continuous, discrete, categorical etc (Christen et al. 2000). Neither do these methods take the underlying semantics of the databases into account. In our previous work, a method of agglomerative probabilistic clustering has been developed on distributed databases for knowledge discovery (McClean et al. 2005).
By considering the two challenges above, in this paper, we propose a model-based clustering on heterogeneous databases distributed on the Internet, also without prior knowledge of the number of clusters or any other information about their composition. It is a principled way to deal with data that are generated by a mixture of underlying probability distributions represented by a mixture model in which each component corresponds to a different cluster. Each cluster reflects different underlying semantics of the distributed databases from other clusters.
It uses the parameterization as the basis for a class of models that is sufficiently flexible to accommodate data with widely varying characteristics (Fraley et al. 2002).
Each dataset belongs to a cluster with a certain probability, not just 1 or 0. The resolve of the data heterogeneity problem is embedded within the clustering process for more efficiency. The method can effectively avoid the transformation and thus execute the clustering directly on the heterogeneous databases. All the information available from the original data is kept for carrying out the clustering, which will lead to better quality results. It is computational efficient for avoiding the homogenizing.
The model for the composite of the clusters is formulated by maximizing the mixture likelihood using
EM (expectation-maximization) algorithm (Dempster et al.
1977). This is a general effective approach to maximum likelihood in the presence of incomplete data, and is used to resolve the data heterogeneity issue. Another advantage of the mixture model based clustering is that it allows the use of approximate Bayes factors to compare models. This gives a systematic means of selecting the parameterization of the model and also deciding the number of clusters
(Fraley et al. 1998). The k -means method can be seen as a special case of model-based clustering method, where each cluster is represented by its estimated means. Such clustering over heterogeneous datasets assists further development of prediction models for classification, thus facilitating semantically distributed knowledge discovery.
The results obtained by using EM also provide a measure of uncertainty about the associated classification of each dataset. The classifier for each cluster captures the cluster contexts and their interoperation can provide useful knowledge with respect to each cluster. The new knowledge discovered from clustering can assist the construction of Bayesian Belief Network to help build more efficient predicting model, and also association rules.
The novelty of this paper resides in (1) the distributed databases may carry different semantic information within themselves, this information is taken into account when doing knowledge discovery over those databases on the
Internet; different clusters are obtained by using modelbased clustering methods where each group represent a certain semantic knowledge that is different from other groups; (2) it is a principled clustering method, which works on heterogeneous datasets that do not have the same ontology structure; the heterogeneity is resolved simultaneously with clustering being carried out in an effective way with no extra computation cost, and it makes full use of the information carried out by the heterogeneous datasets; the degree of a dataset belongs to a cluster brings more flexibility for this clustering method.
The reset of the paper is organized as follows: the data models is introduced first, it followed by a brief introduction about the general model-based clustering methods and clustering on homogeneous categorical data; then we describe our proposed model-based clustering method on heterogeneous datasets together with illustrative examples, the algorithm flowchart, and a practical example on data; we finish with conclusions and future work.
We are concerned with clustering distributed databases that hold aggregate count data (categorical data) in the form of datacubes on a set of attributes that have been classified according to heterogeneous classification schemes. Also, the databases are typically developed independently by different organizations, thus they may have latent heterogeneous semantic information to be investigated.
Definition 1 : An ontology is defined as the Cartesianproduct of a number of attributes A
1
,…,A n
, along with their corresponding schema. The attributes we are concerned are categorical attributes.
Definition 2 : Each attribute A has its domain D, where the classes of the domain are given by {c
1
,… c m
}. These classes form a partition of the set of base values of domain
D. This partition is called a classification scheme.
Definition 3 : Two ontologies are defined to be semantically equivalent if there is a mapping between their respective schemas. Mappings between the schema values are represented in the form of correspondence matrices.
The local ontologies of the heterogeneous datasets, in the form of such classification schemes, are mapped onto a global ontology through the correspondence matrices. Here is an illustrative example about the heterogeneous clustering problem, including the heterogeneous datasets,
classification schemes for the attributes, their local ontologies, and their mappings to the global ontology.
Example:
Attribute : A=Job;
Domain : D={Full-time, Part-time, Retired, Unwaged}.
Table 1 Semantically heterogeneous datasets that have different classification schemes for attribute “Job”
Data
Job
Working NotWorking
Fulltime
Parttime
Retired Unwaged
Total
Dataset1 150 35 20
Dataset2 55 50 90 195
Dataset3 80 45 30 15 170
The classification schema for attribute “Job” in
Dataset1, Dataset2 and Dataset3 are listed as follows:
D
3
=Unwaged}, where
D
1
={c
1
=Working, c
2=
Retired, c
Working = {Full-time, Part-time}.
D
2
={Full-time, Part-time, NotWorking}, where
NotWorking = {Retired, Unwaged}.
3
={Full-time, Part-time, Retired, Unwaged}
The local ontologies of these datasets together with the mappings to the global ontology are:
JOB1
Working
Retired
Unwaged
JOB
Full-Time
Part-Time
Retired
Unwaged
JOB2
Full-Time
Part-Time
NotWorking
Local Ontology O for Dataset1
1
The ontology O
3
Global Ontology Local Ontology O
2
O
G for Dataset2
used in Dataset3 is the same as the global ontology. The correspondence matrices represented for these mappings are:
A
G1
1
1
0
0
0
0
1
0
0
0
0
1
, A
G2
1
0
0
0
0
1
0
0
0
0
1
1
, A
G3
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
Given data y with independent multivariate observations y
1
, .
..
, y n
, the likelihood of a mixture model with G components is n G
L
MIX
(
1
,...,
G
;
1
,...,
G y ) k f k
( y r k
) (1) where k f k
and k r 1 k 1 are the density function for continuous data (probability mass function for discrete data) and parameters of k th cluster in the mixture respectively and is the probability of an observation belonging to the k th cluster ( k
0 ;
G k 1 k
1 ) (Fraley et al. 2002). The EM algorithm is a general approach to maximum likelihood estimation for problems in which the “complete data” x r
( y r
, z r
) can be viewed as consisting of n multivariate observations x r
recoverable from ( y r
, z r
) , in which y r is observed and z r
( z r 1
, .
..
, z rG
) is unobserved portion of data with (Formula (2)). The value of z rk
is between 0 and
1, indicating the conditional probability of the observation r belongs to cluster k . z rk
1
0 if x r belongs otherwise .
to group k
(2)
For our problem, f k
is the multinomial probability distribution function in cluster k . The observations x r
are the semantically heterogeneous distributed databases; the observed data y r is aggregate count data in the datasets and the unobserved data z r is the cluster that observation x r should belong to.
Assuming that each z r is independent and identically distributed according to a probability distribution function f k
draw from G clusters with probabilities
1
, .
..
,
G
, and that the probability mass function of an observation y r given z r
is given by
G f k
( y r k
) z rk , the completek 1 data likelihood is then: l ( k
, k
, z rk x ) n G k f k
( y r k
) z rk r 1 k 1
Correspondingly, the complete-data log-likelihood is
(3) n G ll ( k
, k
, z rk x ) z rk log[ k f k
( y r k
) ] (4) r 1 k 1
The EM algorithm alternates between two steps. In the
“E step”, given the observed data and the current estimated parameters, it computes the conditional expectation z ˆ rk
(Formula (5)); in “M step”, it determines the parameters k
and the expected complete-data log-likelihood (Formula (4)) with the value of z ˆ rk k
of the k th component that maximize
fixed from the E step. z ˆ rk
ˆ k
G c 1 f k
ˆ c
( f y c r
(
ˆ k y r
)
ˆ c
)
(5)
The homogeneous datasets share the same ontology
(schema). In multinomial distribution, each of the n independent trials results in one of m possible outcomes
with probabilities p
1
,..., p m
. The variable X i
indicates the number of times outcome value x i was observed over the n trials. The multinomial distribution is defined as the distribution of the vector ( X
1
,..., X m
) with the probabilities given by and 2 belong to cluster 1 and Supermarket 3 is in another cluster. Thus, z
11 z
21 z
32
1 and z
12 z
22 z
31
0 .
In iteration 1, according to Formula (9): n
1 r
3
1 z ˆ r 1
( 1 1 0 ) 2 , ˆ
1 n
1 n
2
3
, ˆ
2
1 / 3 ;
P ( X
1 x
1
,..., X m x m
) p
1 x
1 ...
p m x m
(6)
ˆ
11
( 80
1
45 )
80 1
( 56
56 0 *
68 )
40
* ( 40 23 )
0 .
546
In the model-based clustering on multinomial data, for each of the G clusters, the probability distribution within that cluster is denoted as k 1
, .
.
.
, km
, where ki is the probability of value x i
in cluster k , i=1,..,m and k=1,..,G .
For multinomial homogeneous databases, the completedata log-likelihood is n G m ll ( ki
, k
, z rk x ) z rk log[ k ki n ri ] (7) r 1 k 1 i 1 where n ri z ˆ is the aggregate count for value
In the E-step, rk z ˆ rk
ˆ k
is given by Formula (8):
( ˆ k 1 n r 1 ...
ˆ km n rm )
G c 1
ˆ c
( ˆ c 1 n r 1 ...
ˆ ck n rm ) i in dataset r .
In the M-step, the estimates of the probability distributions for each group ˆ ci and the probability of a dataset belongs to each group ˆ k
(8) involving the data and
1 1 0
This means that in cluster 1, the probability that people buy “Cheese” is 0.546; so the probability of people who don’t buy cheese (“NoCheese”) ˆ is (1-0.546)=0.454.
In a same way, we can obtain, in cluster 2, the probability of people who buy cheese and don’t buy cheese are 0.635 and 0.365 respectively.
Then, in the E-step, we can work out the expectations: z
11
( 2 / 3 )
( 2 / 3 )
( 0 .
546
80
( 0 .
546
80
0 .
454
45
)
0 .
454
45
)
( 1 / 3 ) ( 0 .
635
80
0 .
365
45
)
0 .
174 z ˆ
11
is the probability that Supermarket1 belongs to Cluster
1. Similarly, we can obtain: z ˆ
12
0 .
826 , z ˆ
21
0 .
999 , z ˆ
22
0 .
001 ; ˆ z
31
0 .
42 , z ˆ
32
0 .
58 .
Given these parameters, the value of the log-likelihood we want to maximized is also z ˆ rk
obtained from the E-step.
ˆ k n k n where n k r n
1 z
We assume that there are two clusters with probabilities
1
and
2
indicating probabilities of a Supermarket
{ 0 .
174 [log(
2
3
0 .
999 [log(
2
3
0 .
546
0 .
546
56
80
0 .
454
45
)]
0 .
454
68
)]
0 .
826 [log(
1
3
0 .
001 [log(
1
3
0 .
635
0 .
635
56
80
0 .
365
45
0 .
365
68
)]
)] m z ˆ rc n ri
(9)
0 .
42 [log(
2
0 .
546
40
0 .
454
23
)] 0 .
58 [log(
1
3
0 .
635
40
0 .
365
23
)]}
ˆ ci r 1
3 m z ˆ k 92 .
78227 r 1 rc
( do not in the 3 supermarkets. i 1 n ri
)
The EM iteration continues until it converges or performance criteria are met.
Example: (Supermarket association data). The following table illustrates shopping data for three supermarkets. The counts are the number of people who buy cheese and who
Cheese
Supermarket1 80 45
Supermarket2 56 68
Supermarket3 40 23
With the new calculated starts. The parameters are calculated in the same way as in iteration 1. The threshold for the convergence is that the less than 10 (-6) z ˆ , the next iteration of EM difference between the log-likelihood in two iterations is
. In this example, it takes 8 iterations to converge. The final values of the parameters are z rk
6 .
989 e
0 .
99974
0 .
00726
05 0 .
99993
0 .
00026
0 .
99274 ki
0 .
4523 ,
0 .
63827
0 .
6477
0 .
36172
The log-likelihood for the clustering model is -91.35059.
The clusters are {Supermarket2} and {Supermarket1,
Supermarket3}. belonging to cluster 1 and 2 respectively. z rk
indicates the ll ( ki
, k
, z rk x )
conditional probability of Supermarket r belongs to cluster k . ki is the conditional probability of value i (either
Cheese or NoCheese) in cluster k . The z rk
are firstly initialized randomly as 1 or 0. For example, Supermarket 1
In a distributed database environment, heterogeneity may arise due to differences in the granularity of data stored in
different distributed databases or may be due to differences in the underlying concepts. An effective algorithm needs to be developed to solve the heterogeneity problem with as little information loss as possible while avoiding the homogenization process. Data integration (McClean,
Scotney et al. 2003) is one of the ways to solve heterogeneity issues with ontology mappings provided ab initio using correspondence matrices. The EM algorithm is used for the estimation for mixture models, and it also provides an intuitive approach to the integration of aggregates to handle the heterogeneity issues. This gives the opportunity to execute the clustering and the data integration for solving the heterogeneity issues simultaneously without the homogenization beforehand.
Two heterogeneity issues have to be overcome. The first one is that the datasets are heterogeneous with respect to that they are extracted from the same population but with attributes values that are at different levels of detail, i.e., different ontologies. Also, model-based clustering solves the problem of heterogeneous data in terms of the datasets being heterogeneous if they are extracted from different populations of individuals. By developing algorithms on model-based clustering on heterogeneous data, we consider both the dataset with different representation structures
(schema) and the datasets from different populations which have different underlying meaning (semantics) from their original background. The algorithm for model-based clustering on semantically heterogeneous databases using
EM algorithm is illustrated below.
g g q r r isr is the number of partitions in dataset r; s is the index for
, s 1 ,..., g r
. is the mapping from the shared global shared ontology to local ontologies that each dataset has. For dataset r, for value i, the value of q isr
is 1 if there is mapping between them, otherwise it is 0.
For the EM algorithm, in the E-step: g r m
ˆ k
( ˆ ku q usr
) n rs z ˆ rk s 1 g r u 1 m
(10)
G
ˆ j
( ˆ ju q usr
) n rs j 1 s 1 u 1
In the M step: k remains the same as in Formula (9).
The probability of value i in cluster k is updated in
Formula (11). Here, the value of ˆ ki
( n )
in this iteration is calculated from the value of ˆ ki
( n 1 )
in the previous iteration. The (n) denotes the number of the iteration.
ˆ ki
( n 1 ) r n g r
1 s 1
( z ˆ rk m n rs q isr u 1
ˆ ku q usr
)
ˆ ki
( n ) (11) n g r z ˆ rk
( n rs
) r 1 s 1
The log-likelihood of the clustering model given the semantically heterogeneous data is : n G g r m ll ( ki
, k
, z rk x ) z rk log k n rs log( ki q isr
) r 1 k 1 s 1 i 1
Example:
Take the semantically heterogeneous data in Table 1 as an example. We assume there are two clusters for this heterogeneous data (the optimal number of clusters can be decided by using Bayesian Information Criteria illustrated in the next section). z rk
is randomly initialized with 0/1, e.g., cluster 1 contains dataset1 and dataset2, and cluster2 has dataset3, thus z
11 z
21 z
32
1 . Here, q is 1
is the correspondence matrix A
G1
for the mapping between the global ontology {Full-time, Part-time, Retired, Unwaged} and the local ontology in datasete1 {Working, Retired,
Unwaged}; q is 2
is actually A
G2
for dataset2, and q is 3
is
A
G3
. There are m=4 schema values for attribute “Job” in the global ontology. The probability distributions in each cluster are initialized by a uniform distribution.
ˆ
( 1 ki
)
1 / m 0 .
25 k 1 , 2 i 1 ..
4
In iteration 1:
ˆ
1
2 / 3 , ˆ
2
1 / 3
ˆ
( 1
11
) z ˆ
11
( n
11
1 n
12 z ˆ
11
( n
13
)
ˆ ( 0
11
)
ˆ ( 0
11
)
ˆ ( 0
12
) z ˆ
21
( n
21
) n
11 n
22
0 .
25
1
0 .
25
0 .
25
205 1
150
195
1
0
55
170
0 80
ˆ z
21 n
23
)
0 n
21
ˆ z
.
325
31
( z ˆ
31 n
31 n
31 n
32 n
33 n
34
)
The count ( 0 .
25
150
) is the contribution of
0 .
25 0 .
25 dataset1 for the probability of value1 (Full-time) in cluster1. We apportion the count 150 to 75 and 75 to contribute to the values Full-time and Part-time respectively. Here is another example of calculating ˆ ( 1
13
)
:
ˆ
( 1
13
) z ˆ
11
( n
11
1 35
1 z ˆ
11 n
12 n
12 n
13
)
1
0
0 .
25
205 1
.
25
0 .
25
195 0 z ˆ
21
( z ˆ
21
( n
21
ˆ ( 0
13
)
ˆ ( 0
13
)
ˆ ( 0
14
) n
22
) n
23
)
90 0
170
30
0 .
2 n
23
ˆ z
31 z ˆ
31
( n
31 n
33 n
32 n
33 n
34
)
In the same way, we can obtain:
ˆ ki
0 .
325
0 .
471
0 .
3125
0 .
265
0 .
2
0 .
176
0 .
1625
0 .
088
Then, in the E step, compute z ˆ rk estimated above in the M-step. zˆ
11
ˆ
1
[( ˆ (1)
11
ˆ (1)
12
) n
11
ˆ
1
( ˆ
[(
(1)
13
ˆ
)
(1)
11 n
12 ( ˆ
ˆ (1)
12
(1)
14
)
) n n
13
11
]
( ˆ
ˆ
2
(1)
13
) n
12
[( ˆ (1)
21 given the parameter
(
ˆ
ˆ (1)
14
(1)
22
)
) n
13 n
11
]
( ˆ (1)
23
) n
12 ( ˆ (1)
24
) n
13 ]
2
3
[( 0 .
0 .
01598
325
2
3
0 .
313 )
150
[( 0 .
325
0 .
2
35
0 .
3125 ) 150
0 .
163
20
]
In the same way, we can get
1
3
0 .
2 35
[( 0 .
471
0 .
1625 20 ]
0 .
265 )
150
0 .
177
35
0 .
088
20
]
0 .
015982 0 .
984018 z ˆ rk
0 .
999999 0 .
000001
0 .
000196 0 .
999804
The model log-likelihood in iteration 1 is: 250 .
9899
The EM iteration continues in the same way until it converges. For this example, it takes only 4 steps for the algorithm to converge. The value of the parameters after full convergence is as follows: k
( 0 .
3333 , 0 .
6667 ) ki
0 .
28205
0 .
46933
0 .
25641
0 .
264
0 .
25502
0 .
1773333
0 .
20651
0 .
09333 z rk
2
4
,
.
21
4 e e 08
0 .
9999999955
09
0 .
9999999779
4 .
5 e 09
0 .
9999999956
248 .
8713 .
The clusters are {Dataset1, Dataset3}, {Dataset2}. and n is the number of datasets to be clustered. The larger the value of BIC, the stronger the evidence is for the model with its associated number of clusters. For the clustering models dealing with semantically heterogeneous databases, the mixture model log-likelihood is:
The number of impendent parameters to be estimated in the model should be counted, but the number of clusters is not taken into account as an independent parameter.
Therefore, k
needs to be estimated, and it contains ( G -1) independent parameters where G is the number of clusters; the probability distribution in each cluster z rk
need to be estimated; in each cluster, there is ( m -1) independent probabilities, so for G number of components, there are
G ( m 1 ) number of independent parameters. Altogether there are: m
M
( G 1 ) G ( m 1 ) independent parameters.
Therefore, the BIC for the clustering model on heterogeneous databases can be obtained straightforwardly.
The data model, with an associated number of clusters, which has the biggest BIC number, is the optimal model for model-based clustering.
Example:
For the same heterogeneous example listed above, we use BIC to choose the optimal number of clusters. We start from 2 clusters. The structures of the matrix z rk
are changing according to the number of clusters. The values of z rk
for each number of clusters are dynamically initialized randomly subject to the sum of the probabilities for each dataset belonging to a cluster is 1. The result of the clustering algorithm with BIC says 2 is the optimal number of clusters for our example, with log-likelihood
248 .
8713 and BIC= 501 .
0824 . The BIC for 3 clusters is 503 .
2472 . A standard convention for calibrating BIC difference is that if the difference is between 2 and 6, then it is positive evidence (Kass et al. 1995). In our case, the
BIC difference is 2.1648, which shows positive evidence according to the convention.
Bayes factors, approximated by the Bayesian Information
Criterion (BIC) (Schwarz 1978), have been applied successfully to the problem of determining the optimal number of clusters in a mixture model by the direct comparison of different models with various numbers of clusters (Dasgupta et al. 1998). The fit of a mixture model to a given dataset can only improve (the likelihood can only increase) as more detailed clusters are generated in the model. Hence the likelihood itself can not be used as an assessment of the clustering given the number of clusters.
In the BIC, an additional term is used for penalizing the complexity of the model, so that it may be maximized for more parsimonious parameterizations and smaller number of groups. The BIC is calculated as follows:
2 log p ( x M ) const 2 ll
M
( x ,
ˆ
) m
M log( n ) BIC (12) where ll
M
( x ,
ˆ
) is the maximized mixture log-likelihood for the model illustrated in Formula (1). m
M is the number of independent parameters to be estimated in the model,
The proposed clustering method via multinomial mixture models using EM algorithm on semantically heterogeneous datasets is shown in Figure 1.
The algorithm was tested on state-specific national data from the US Census were clustered on the basis of “gross rent” in 2000. A homogeneous se is shown in Table 2. A
(synthetic) set of heterogeneous databases (Table 3) was then generated by combining data from homogeneous dataset for a random selection of states. The result is shown in Figure 2. We compared our results with the ranked “States Median Gross Rent” table supplied by the
US Census Bureau that year. Our results are very encouraging. It successfully finds the most expensive states
for renting in US. It includes {Hawaii, New Jersey,
California, Alaska, Nevada}, where the medians for each of these states are more than $700. The second cluster contains a bit less expensive states whose medians range from $650 - $699. It includes {Maryland, New York,
Virginia…} 7 states. The rest of the clusters include the states with median rents ranging from $555-$650, $497-
$555, and $400-$496 respectively without any exceptions.
The corresponding example states in these 3 clusters are
{Florida, Georgia, Texas…}, {Michigan, Ohio,
Tennessee…} and {Iowa, Arkansas, Mississippi…}.
Initialize repeat z ˆ rk
p is the iteration counter in the loop
M-step: n n k z ˆ rk
ˆ k r n k n
1
ˆ ki
( p )
E-step: compute
ˆ ki
( p 1 ) n g r
( z ˆ rk m n rs q isr ) r 1 s 1 u 1
ˆ ku q usr n g r z ˆ rk
( n rs
) r 1 s 1 z ˆ given the parameter estimated from the M-step
ˆ k g r
( m
ˆ ku q usr
) n rs z ˆ rk
G s
ˆ j
1 g r u
(
1 m
ˆ ju q usr
) n rs j 1 s 1 u 1
Calculating complete-data loglikelihood: ll
( p )
( ki
, k
, z rk x ) r n G
1 k 1 z rk log k s g r
1 n rs log( i m
1 ki q isr
) until ll
( p ) ll
( p 1 )
10
6
Compute BIC for the mixture models with different number of clusters, and choose the optimal parameters from the EM.
Figure 1 EM algorithm for clustering via multinomial mixture models on heterogeneous datasets. The strategy described here terminates when the relative difference between the values of the mixture log-likelihood in equation (1) falls below a small threshold 10 (-4) . It also uses
BIC to determine the optimal number of clusters.
Table 2 A sample of the US states Gross Rent dataset
States
<=
$250
$250
-$499
$500-
$749
$750-
$999
$1000-
$1499
Alabama 68008 186901
Alaska 35231 131456
Arizona 29837 135811
Arkansas 1981 10917
…
Wisconsin 44787 209117
… …
… …
… …
9101
5257
48722
… … 11296
… … 78955
Wyoming 5906 26685 … … 3076
Table 3 A sample heterogeneous Gross Rent dataset
<= $500 $750 $1000$1500>=
$1500-
$1999
1815
841
1124
Arizona $499 -$749 -$999 $1499 $1999 $2000
8188
2209
22527
Connecticut
Wyoming
12898 26096 18437 11296 2209 381
<= $250 $500 $750 $1000 >=
$250 -$499 -$749 -$999 -$1499 $1500
32102 63966 152735 97511 47845 16522
Delaware
<=
$499
$500
-$999
$1000
-$1499
$1500
-$1999
>=
$2000
19979 49972 5688 627 922
…
…
<=
$250
$250
-$499
$500
-$749
$750
-$999
$1000
-$1499
$1500-
$1999
>=
$2000
5906 26685 13150 3076 1124 280 95
Figure 2 Clustering States using US Gross Rent
Heterogeneous Census Data. 5 clusters are marked. The more expensive gross rent in the state, the darker it is.
A method has been proposed to cluster databases distributed on an open heterogeneous environment like the
Semantic Web. These distributed databases may be either homogeneous or heterogeneous with respect to their classification schemes. The databases uploaded to the web are developed independently, thus may differ in the underlying semantics or characteristics information. The proposed model-based clustering method by using EM algorithm takes both heterogeneity and semantics issues into account. The resulting clusters represent databases that share similar characteristics, which are different from the
ones in other clusters. The new knowledge discovered from the clustering can assist the construction of Bayesian
Belief Network to help build more efficient predicting model, and also association rule mining.
With burgeoning Semantic Web development, distributed databases are becoming increasingly commonplace, most of which are heterogeneous and contain semantic information. The potential of learning new knowledge from these distributed databases in such an environment is enormous. The development of our algorithm will have a great opportunity to be applied to a wide range of application domains.
Initialization is a very important factor in model-based clustering on the heterogeneous databases. It directly affects the efficiency of EM algorithm to converge. Under fairly mild regularity conditions, EM can be shown to converge to a local maximum of the observed-data likelihood. Hierarchical agglomeration method can be used first to obtain clusters that approximately maximize the classification likelihood for each model. The parameters obtained from the hierarchical clustering method are used to initialize the conditional probabilities z ˆ in an M step of EM algorithm. This idea can be implemented to extend our current algorithm. Also, handling outliers is another aspect we will address in the future work. More evaluation work using real and simulated data need to be carried out.
Based on the derived clusters that capture the context, high performance prediction models for classification task can be built on top of this. Also our approach further facilitates the semantically distributed knowledge discovery on these heterogeneous databases on the
Internet.
Cadez, I., S. Gaffney, and P. Smyth, A General
Probabilistic Framework for Clustering Individuals . 2000,
Department of Information and Computer Science,
University of California, Irvine.
Caragea, D., A. Silvescu, and V. Honavar, A Framework for Learning from Distributed Data Using Sufficient
Statistics and its Application to Learning Decision Trees .
International Journal of Hybrid Intelligent Systems, 2004.
1 (2): p. 80-89.
Caragea, D., J. Pathak, and V. Honavar, Learning classifiers from Semantically Heterogeneous Data , in
CoopIS/DOA/ODBASE (2) . 2004. p. 963-980.
Christen, J.A., A. Medrano-Soto, and J. Collado-Vides,
Bayesian Clustering and Prediction in H eterogeneous data bases using multivariate Mixtures.
Journal of
Classification.
Dasgupta, A. and A.E. Raftery, Detecting features is spatial point processes with clutter via model-based clustering.
Journal of the American Statistical Society,
1998. 93 (441): p. 294-302.
Dempster, A.P., N.M. Laird, and D.B. Rubin, Maximum likelihood for incomplete data via the EM algorithm.
Journal of the American Statistical Society, Series B, 1977.
39 : p. 1-38.
Doan, A. and A.Y. Halevy, Semantic Integration Research in the Database Community: A Brief Survey.
AI Magazine,
2005. 26 (1): p. 83-94.
Engelmore, R., and Morgan, A. eds. 1986. Blackboard Systems.
Reading, Mass.: Addison-Wesley.
Fraley, C. and A.E. Raftery, How Many Clusters? Which
Clustering Methods? Answers Via Model-Based Cluster
Analysis.
Computer Journal, 1998. 41 : p. 578-588.
Fraley, C. and A.E. Raftery, Model-Based Clustering,
Discriminant Analysis, and Density Estimation.
Journal of the American Statistical Society, 2002. 97 (458): p. 611-
631.
Kargupta, H. and P. Chan, Advances in Distributed and
Parallel Knowledge Discovery . 2000, Cambridge,
Massachusetts: AAAI Press / The MIT Press. 400.
Kass, R.E. and A.E. Raftery, Bayes Factors.
Journal of
American Statistical Association, 1995. 90 : p. 773-795.
Berners-Lee, T., J. Hendler, and O. Lassila, The Semantic
Web.
Scientific American, 2001. 284 (5): p. 34-43.
McClean, S.I., B.W. Scotney, and K. Greer, A Scalable
Approach to Integrating Heterogeneous Aggregate Views of Distributed Databases.
IEEE Trans. on Knowledge and
Data Engineering, 2003. 15 (1): p. 232-235.
McClean, S.I., et al., Knowledge discovery by probabilistic clustering of distributed databases.
Data & Knowledge
Engineering, 2005. 54 (2): p. 189-210.
Merugu, S. and J. Ghosh, A Distributed Learning
Framework for Heterogeneous Data Sources , in KDD 05 .
2005, ACM Press: Chicago, Illinois, USA. p. 208-217.
Park, B. and H. Kargupta, Distributed Data Mining:
Algorithms, Systems, and Applications , in Data Mining
Handbook , N. Ye, Editor. 2002, IEA. p. 341-358.
Provost, F., Distributed Data Mining: Scaling up and
Beyond . Advances in Distributed and Parallel Knowledge
Discovery, ed. H. Kargupta and P. Chan. 2000: AAAI
Press/ The MIT Press.
Schwarz, G., Estimating the Dimensions of a Model.
The
Annals of Statistics, 1978. 6 (2): p. 461-464.
Tsoumakas, G., L. Angelis, and I. Vlahavas, Clustering classifiers for knowledge discovery from physically distributed databases.
Data & Knowledge Engineering,
2004. 49 (3): p. 223-242.
Wirth R., M. Borth, and J. Hipp, When distribution is part of the semantics: A new problem class for distributed knowledge discovery , in Proceeding of 5th ECML and
PKDD Workshop on Ubiquitous Data Mining for Mobile and Distributed Environments . 2001: Freiburg, Germany. p. 56-64.
Zhang, J., D. Caragea, and V. Honavar, Learning
Ontology-aware Classifiers , in 8th International
Conference Discovery Science . 2005, Springer: Singapore. p. 294-307.