Density Clustering Method for Gene Expression Data

advertisement
Is there a need to footnote P-trees
technology? Or the EIN paper?
Tree-based Clustering for Gene Expression Data
Baoying Wang, William Perrizo
Computer Science Department
North Dakota State University
Fargo, ND 58105, USA
Tel: 1-701-2316257
{baoying.wang, william.perrizo}@ndsu.nodak.edu
ABSTRACT
Data clustering methods have been proven to be a successful
data mining technique in analysis of gene expression data and
many other types of data. However, some concerns and
challenges still remain, e.g., in gene expression clustering. In
this paper, we propose an efficient clustering method using
attractor trees. The combination of the density-based approach
and the similarity-based approach considers clusters with
diverse shapes, densities, and sizes. Experiments on gene
expression datasets demonstrate that our approach is efficient
and scalable with competitive accuracy.
hierarchical clustering using attractor trees, Clustering using
Attractor trese and Merging Processes (CAMP). CAMP consists
of two processes: (1) clustering by local attractor trees (CLA)
and (2) cluster merging based on similarity (MP). The final
clustering result is an attractor tree and a set of bit indexes to
clusters corresponding to each level of the attractor tree. The
attractor tree is composed of leaf nodes, which are the local
attractors of the attractor sub-trees constructed in CLA process,
and interior nodes, which are the virtual attractors resulted from
MP process. Figure 1 is an example of an attractor tree.
Virtual
attractors
Categories and Subject Descriptors
H.3.3 [Information Storage and Retrieval]: information
Search and Retrieval – Clustering.
Local
attractor
General Terms: Algorithms
Keywords: Gene expression data, clustering, microarray.
1. INTRODUCTION
Clustering analysis of mircroarray gene expression data has
been recognized as an important method for gene expression
analysis. However, some concerns and challenges still remain
in gene expression clustering. For example, many traditional
clustering methods originating from non-biological fields may
not work well if the model is not sufficient to capture the
genuine clusters among noisy data.
There are two main groups of clustering methods: similaritybased methods and density-based clustering methods. Most
hierarchical clustering methods belong to the similarity-based
algorithm class. As a result, they can not handle clusters with
arbitrary shapes well. In this paper, we propose an efficient
agglomerative hierarchical clustering method using attractor
trees. Our method combines the features of both the densitybased clustering approach and the similarity-based clustering
approach. It takes consideration clusters with diverse shapes,
densities, and sizes, and is capable of dealing with noisy data.
Experiments on standard gene expression datasets demonstrate
that our approach is very efficient and scalable, with
competitive accuracy.
2. THE METHOD
In this section, we propose an efficient agglomerative
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
SAC’05, March 13-17, 2005, Santa Fe, New Mexico, USA.
Copyright 2005 ACM 1-58113-964-0/05/0003…$5.00.
Figure 1.
The attractor tree
The data set is first grouped into local attractor trees by means
of density-based approach in CLA process. Each local attractor
tree represents a preliminary cluster, the root of which is a
density attractor of the cluster. Then the small clusters are
merged level-by-level in MP process according to their
similarity until the whole data set becomes a cluster.
2.1 Density Function
Given a data point x, the density function of x is defined as the
sum of the influence of all data points in the data space X. If we
divided the neighborhood of x into neighborhood rings, then
points within smaller rings have more influence on x than those
in bigger rings. We define the neighborhood ring as follows:
Definition 1. Neighborhood Ring of a data point c with radii r1
and r2 is defined as the set R(c, r1, r2) = {x X | r1<|c-x| r2},
where |c-x| is the distance between x and c. The number of
neighbors falling in R(c, r1, r2) is denoted as N = || R(c, r1, r2)||.
Definition 2. Equal Interval Neighborhood Ring (EINring)
of a data point c with radii r1=k and r2=(k+1) is defined as the
kth neighborhood ring EINring(c, k, ) = R(c, r1, r2) = R (c, k,
(k+1)), where  is a constant. The number of neighbors falling
within the kth EINring is denoted as ||EINring(c, k, )||.
Let y be a data point within the kth EINring of x. The EINringbased influence function of y on x is defined as:
f(y,x) = fk(x) =
1
k
k = 1, 2, ...
(1)
The density function of x is defined as the summation of
influence within every EINring neighborhood of x.

DF(x)=
 f k ( x) || EINring( x, k ,  ) ||
k 1
(2)
Is this coorect?
2.2 Clustering by Local Attractor Trees
3. PERFORMANCE STUDY
The basic idea of clustering by local attractor trees (CLA) is to
partition the data set into clusters in terms of density attractor
trees. Given a data point x, if we follow the steepest density
ascending path, the path will finally lead to a local density
attractor. If x doesn’t have such a path, it can be either a local
attractor or a noise. All points whose steepest ascending paths
lead to the same local attractor form a cluster. The resultant
graph is a collection of local attractor trees with the local
attractor as the root. The leaves are the boundary points of
clusters. An example of a dataset and the attractor trees are
shown in Figure 2.
We used three microarray expression datasets: DS1 and DS2
and DS3 from [1,2,3]. DS1 contains expression levels of 8,613
human genes measured at 12 time-points. DS2 is a gene
expression matrix of 6221  80. DS3 is the largest dataset with
13,413 genes under 36 experimental conditions. The total run
times for different algorithms on DS1, DS2 and DS3 are shown
in Figure 3. Note that our approach outperformed k-means,
BIRCH and CAST substantially when the dataset is large. In
particular, our approach performed almost 3 times faster than kmeans and CAST for DS3.
Figure 2.
run time(s)
K-means
A dataset and the attractor trees
2.3 Cluster Merging Process
We consider both relative connectivity and relative closeness,
and define similarity between cluster i and j as follows:
CS(i, j) = (
hi
fi

hj
fj
)
1
d ( Ai , A j )
BIRCH
CAMP
10000
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
DS1
(3)
CAST
Figure 3.
DS2
DS3
Run time comparisons
where hi is the average height of the ith attractor tree; f i is the
average fan-out of the ith attractor tree; d(Ai, Aj) is the Euclidean
distance between two local attractors Ai and Aj. The calculations
of hi and f i are discussed later.
The clustering results are evaluated by means of “Hubert’s 
statistic” [3]. The accuracy experiments show CAMP and
CAST both are more accurate than the other two. However,
CAST is not efficient and scalable.
After the local attract trees are built in CLA process, the cluster
merging process (MP) starts combining the most similar subcluster pair level-by-level based on similarity measure. When
two clusters are merged, two local attractor trees are combined
into a new tree, called a virtual local attractor tree. It is called
“virtual” because the new root is not a real attractor. It is only a
virtual attractor which could attract all points of two sub-trees.
The cluster merging is processed recursively by combining
(virtual) attractor trees.
4. CONCLUSION
After merging, we need to compute the new virtual attractor Av,
the average height hv , the average fan-out f v of the new virtual
attractor tree. Take two clusters: Ci and Cj, for example, and
assume the size of Cj is greater or equal than Ci, i.e. ||Cj||  ||Ci||,
we have the following equations:
5. REFERENCES
|| Cj ||
( Ail  A jl )
Avl =
|| Ci ||  || Cj ||
hv = Max{ hi , h j }+
fv =
l = 1, 2 ... d
|| Cj ||
d ( Ai , A j )
|| Ci ||  || Cj ||
|| Ci || * f i  || Cj || * f j
|| Ci ||  || Cj ||
CAMP combines the features of both the density-based
clustering approach and the similarity-based clustering
approach. The combination of the density-based approach and
the similarity-based approach considers clusters with diverse
shapes, densities, and sizes, and is capable of dealing with
noise. Experiments on standard gene expression datasets
demonstrated that our approach is very efficient and scalable
with competitive accuracy.
1.
Ben-Dor, A., Shamir, R. & Yakhini, Z. “Clustering gene
expression patterns,” Journal of Computational Biology,
Vol. 6, 1999, pp. 281-297.
2.
Tavazoie, S. J. Hughes, D. and et al. “Systematic
determination of genetic network architecture”. Nature
Genetics, 22, pp. 281-285, 1999.
3.
Tseng, V. S. and Kao, C. “An Efficient Approach to
Identifying and Validating Clusters in Multivariate
Datasets with Applications in Gene Expression Analysis,”
Journal of Information Science and Engineering, Vol. 20
No. 4, pp. 665-677. 2004.
4.
Zhang, T., Ramakrisshnan, R. and Livny, M. BIRCH: an
efficient data clustering method for very large databases.
In Proceedings of of Int’l Conf. on Management of Data,
ACM SIGMOD 1996.
(4)
(5)
(6)
where Ail is the lth attribute of the attractor Ai. ||Ci|| is the size of
cluster Ci. d(Ai, Aj) is the distance between two local attractors A i
and Aj. hi and f i are the average height and the average fan-out of
the ith attractor tree respectively.
Download