Uploaded by xidato1693

KNN-BLOCK DBSCAN Fast Clustering for Large-Scale Data

advertisement
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 51, NO. 6, JUNE 2021
3939
KNN-BLOCK DBSCAN: Fast Clustering
for Large-Scale Data
Yewang Chen , Lida Zhou, Songwen Pei , Senior Member, IEEE, Zhiwen Yu , Senior Member, IEEE,
Yi Chen, Xin Liu, Member, IEEE, Jixiang Du , and Naixue Xiong , Senior Member, IEEE
Abstract—Large-scale data clustering is an essential key for
big data problem. However, no current existing approach is
“optimal” for big data due to high complexity, which remains
it a great challenge. In this article, a simple but fast approximate DBSCAN, namely, KNN-BLOCK DBSCAN, is proposed
based on two findings: 1) the problem of identifying whether
a point is a core point or not is, in fact, a kNN problem and
2) a point has a similar density distribution to its neighbors,
and neighbor points are highly possible to be the same type
(core point, border point, or noise). KNN-BLOCK DBSCAN uses
a fast approximate kNN algorithm, namely, FLANN, to detect
core-blocks (CBs), noncore-blocks, and noise-blocks within which
all points have the same type, then a fast algorithm for merging CBs and assigning noncore points to proper clusters is also
Manuscript received December 29, 2018; revised May 31, 2019 and July 25,
2019; accepted November 18, 2019. Date of publication December 18, 2019;
date of current version May 18, 2021. This work was supported in
part by the National Natural Science Foundation of China under Grant
61673186, Grant 61972010, Grant 61975124, Grant 61722205, Grant
61751205, Grant 61572199, and Grant U1611461, in part by the Funds
from State Key Laboratory of Computer Architecture, ICT, CAS under Grant
CARCH201807, in part by the Open Project of Provincial Key Laboratory
for Computer Information Processing Technology, Soochow University under
Grant KJS1839, in part by the Quanzhou City Science and Technology
Program of China under Grant 2018C114R, in part by the Open Project
of Beijing Key Laboratory of Big Data Technology for Food Safety under
Grant BTBD-2019KF06, in part by the Key Research and Development
Program of Guang Dong Province under Grant 2018B010107002, and in
part by the Grant from the Guang Dong Natural Science Funds under
Grant 2017A030312008. This article was recommended by Associate Editor
G. Nicosia. (Corresponding authors: Songwen Pei; Zhiwen Yu.)
Y. Chen is with the College of Computer Science and Technology, Huaqiao
University (Xiamen Campus), Xiamen 361021, China, also with the Beijing
Key Laboratory of Big Data Technology for Food Safety, Beijing Technology
and Business University, Beijing 100048, China, also with the Provincial
Key Laboratory for Computer Information Processing Technology, Soochow
University, Suzhou 215301, China, and also with the Fujian Key Laboratory
of Big Data Intelligence and Security, Huaqiao University (Xiamen Campus),
Xiamen 361021, China (e-mail: ywchen@hqu.edu.cn).
L. Zhou and X. Liu are with the College of Computer Science and
Technology, Huaqiao University, Quanzhou 362021, China.
S. Pei is with the Shanghai Key Laboratory of Modern Optical Systems,
University of Shanghai for Science and Technology, Shanghai 200093, China
(e-mail: swpei@usst.edu.cn).
Z. Yu is with the School of Computer Science and Engineering,
South China University of Technology, Guangzhou 510640, China (e-mail:
zhwyu@scut.edu.cn).
Y. Chen is with the Beijing Key Laboratory of Big Data Technology for
Food Safety, Beijing Technology and Business University, Beijing, China.
J. Du is with the College of Computer Science and Technology, Huaqiao
University, Quanzhou 362021, China, and also with the Fujian Key Laboratory
of Big Data Intelligence and Security, Huaqiao University, Quanzhou 362021,
China.
N. Xiong is with the Department of Mathematics and Computer Science,
Northeastern State University, Tahlequah, OK 74464 USA.
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TSMC.2019.2956527.
Digital Object Identifier 10.1109/TSMC.2019.2956527
invented to speedup the clustering process. The experimental
results show that KNN-BLOCK DBSCAN is an effective approximate DBSCAN algorithm with high accuracy, and outperforms
other current variants of DBSCAN, including ρ-approximate
DBSCAN and AnyDBC.
Index Terms—DBSCAN,
DBSCAN.
FLANN,
kNN,
KNN-BLOCK
I. I NTRODUCTION
LUSTERING analysis is the task of grouping objects
according to measured or perceived intrinsic characteristics or similarity, aiming to retrieve some natural groups from a
set of patterns or points. It is a fundamental technique in many
applications, such as data mining, pattern recognition, etc., and
many researchers believe that clustering is an essential key for
analyzing big data [1].
Currently, there are thousands of clustering algorithms have
been proposed, for example, k-means [2], mean shift [3],
DBSCAN [4], spectral clustering [5], [6], mixtures of dirichlet model [7], [8], clustering based on supervised learning [9], and clustering by local cores [10], [11]. According
to Jain et al. [12], different categories of these clustering
approaches are recognized: centroid-based clustering, partitioning clustering, density-based clustering etc.
The goal of density-based clustering is to identify densely
regions with arbitrary shape, which can be measured by the
density of a given point. An identified cluster is usually a
region with high density, while outliers are regions with low
densities. Hence, density-based clustering is one of the most
popular paradigms. There are many algorithms of this kind,
such as DBSCAN [4], OPTICS [13], DPeak [14]–[16], mean
shift [3], DCore [11], etc., where DBSCAN [4] is the most
famous one and has been widely used.
Unfortunately, most of the current existing clustering
approaches do not work well for large-scale data, due to their
high complexities. For example, the complexity of k-means
is O(ktn) where t is the iterations times, DBSCAN runs in
O(n2 ). In this article, a fast approximate algorithm named
KNN-BLOCK DBSCAN,1 is proposed to speedup DBSCAN,
which is able to deal with large-scale data. We also concentrate
on comparing our algorithm with DBSCAN, ρ-approximate
DBSCAN [17], and AnyDBC.
C
1 https://github.com/XFastDataLab/KNN-BLOCK-DBSCAN
c 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
2168-2216 See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on March 20,2023 at 14:57:36 UTC from IEEE Xplore. Restrictions apply.
3940
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 51, NO. 6, JUNE 2021
TABLE I
D ESCRIPTION OF M AIN VARIABLES AND S YMBOLS
U SED IN T HIS A RTICLE
The main contributions of this article are listed as follows.
1) We find that the key problem in DBSCAN of identifying
the type of each point is a kNN problem in essence.
Therefore, many techniques of this field, such as
FLANN [18], kd-tree [19], cover tree [20], etc., can be
utilized.
2) According to a general rule that a point has similar density distribution to its neighbors, and neighbor points are
likely to be the same type (core, border, or noise). Then,
a technique is proposed to identify blocks within which
all points have the same type, such as CBs, noncore
blocks, and noise blocks.
3) A fast algorithm is also invented for merging CBs and
assigning noncore points to corresponding clusters.
Before introducing the proposed algorithm, we would like to
present the main variables and symbols used in this article as
follows. Let P be a set of n points in D-dimensional space RD ;
pi ∈ P be the ith point of P; dp,q (or dist(p, q)) be the distance
between points p and q, where the distance can be Euclidean
or Chebychev distance; be the scanning radius of DBSCAN;
dp,(i) be the distance from p to its ith nearest neighbor, and
p(i) be the ith nearest neighbor of p. More symbols are shown
in Table I.
The remainder of this article is organized as follows. Section II introduces the related work of DBSCAN
and nearest neighbor query. Section III revisits FLANN,
DBSCAN, and ρ-approximate DBSCAN. Section IV
addresses the proposed method, KNN-BLOCK DBSCAN,
in detail, including basic ideas, processes, and algorithms.
Section V shows experiments and makes comparison with
ρ-approximate DBSCAN on some data sets. Section VI gives
the final conclusion, and our future work that could improve
the proposed method.
II. R ELATED W ORK
A. Variants of DBSCAN
DBSCAN is designed to discover clusters of arbitrary shape.
It needs two parameters, one is scanning radius , and the other
is MinPts which is used as a density threshold for deciding
whether a point is a core point or not.
If a tree-based spatial index is used, the average complexity is reduced to O(n log(n)) [4]. However, this turns out to
be a misclaim: as pointed out by Gunawan and de Berg [21],
DBSCAN actually runs in O(n2 ) time, regardless of and
MinPts. Unfortunately, this misclaim is widely accepted as
a building brick in many research papers and textbooks,
e.g., [22]–[24], etc. Furthermore, DBSCAN is almost useless in high dimension, due to the so-called “curse of
dimensionality.”
Mahran and Mahar [25] introduced an algorithm named
GriDBSCAN to enhance the performance of DBSCAN
by using grid partitioning and merging, yielding a high
performance with the advantage of a high degree of
parallelism. But this technique is inappropriate for highdimensional data because the effect of redundancy in this
algorithm increases exponentially with dimension. Similarly,
Gunawan and de Berg [21] proposed an algorithm named
Fast-DBSCAN to improve DBSCAN for two-dimensional
(2-D) data, which also imposes an arbitrary grid √
T on 2D space, where each cell of T has side length / 2. If a
nonempty cell c contains at least MinPts points, then this
cell is called core cell, and all points in this cell are core
points because the maximum distance within this cell is .
Therefore, it is unnecessary to compute densities for each
point in a core cell. Gan and Tao [17] proposed an algorithm
named ρ-approximate DBSCAN also based on grid technique
for large data set, and achieved an excellent complexity O(n)
in low dimension. But it degenerates to an O(n2 ) algorithm
in high even relative high-dimensional data space. Besides,
parallel GridDBSCAN [26] and GMDBSCAN [27] are also
grid-based DBSCAN.
AnyDBC [28] compresses the data into smaller densityconnected subsets called primitive clusters and labels objects
based on connected components of these primitive clusters
for reducing the label propagation time. To speedup the range
query process, it uses kd-trees [14] for indexing data, and performs substantially fewer range queries compared to DBSCAN
while still guaranteeing the exact final result of DBSCAN.
There are some other variants of DBSCAN as following. IDBSCAN [29] is a sampling-based DBSCAN, which
is able to handle large spatial databases with minimum
I/O cost by incorporating a better sampling technique, and
reduces the memory requirement for clustering dramatically.
KIDBSCAN [30] presents a new technique based on the concept of IDBSCAN, in which k-means is used to find the
high-density center points and then IDBSCAN is used to
expand clusters from these high-density center points. Based
on IDBSCAN, Quick IDBSCAN [31] (QIDBSCAN) uses
four marked boundary objects (MBOs) to expand computing
directly.
Moreover, because exact clustering is too costly, this has
generated interest in many approximate methods, including our algorithm, to speed up original DBSCAN in the
past two decades. Here, the approximation means that the
clustering result may be different from that of the original
DBSCAN. For example, in original DBSCAN, a data point
p may be classified into one cluster, while in approximate
Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on March 20,2023 at 14:57:36 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: KNN-BLOCK DBSCAN: FAST CLUSTERING FOR LARGE-SCALE DATA
Algorithm 1 [18] SearchKmeansTree
1: Input: query point q; the K value of kNN; the maximum
number of examined points L; k-means tree T;
2: count := 0;
3: PQ := empty priority queue
4: R := empty priority queue
5: curNode := T
6: TraverseKmeansTree(curNode, PQ, R, count, q)
7: while PQ <> NULL and count < L do
8:
curNode := top of PQ
9:
TraverseKmeansTree(curNode, PQ, R, count, q)
10: end while
11: Return K top points from R
DBSCAN, it may be designated into another cluster. A scalable RNN-DBSCAN [32] solution was investigated to improve
DBSCAN by using an approximate kNN algorithm. NGDBSCAN [33] is an approximate density-based clustering
algorithm that operates on arbitrary data and any symmetric distance measure. The distributed design of this algorithm
makes it scalable to very large data sets; its approximate nature
makes it fast, yet capable of producing high-quality clustering
results.
B. Nearest Neighbors Searching Algorithms
In the past few decades, many researchers have launched
large amounts of fruitful researches in the field of nearest
neighbor query, many techniques have been proposed and
successfully applied to accelerate the processes of searching neighbors. For example, partition trees (kd-tree [34], [35],
semi-convex hull tree [36]), hashing techniques such as ANN
based on trinary-project tree [37].
Because the exact search is time-consuming for many
applications, then the approximate nearest neighbor query is
optional in some cases, which returns nonoptimal results, but
runs much faster. For example, FLANN [18], [38] uses the
priority search k-means tree or the multiple randomized kd
forest [39] which can give the best performance on a wide
range of dimensional data space. In this article, we mainly
use it to improve the performance of DBSCAN.
III. FLANN, ρ-A PPROXIMATE DBSCAN REVISITED
FLANN: In this article, we use FLANN with the priority
search k-means tree to perform the nearest neighbor query,
where the priority k-means tree is constructed by k-means
(see [18, Algorithm 1]) that partition the data points at each
level into χ (in [18], it is denoted as K which represents
to the cluster number K of k-means tree. In this article,
we use character χ to replace with K in order to make it
different from the K value of kNN), which is called the
branching factor with default value 512, distinct regions recursively, until the total number of points in a region is less
than χ .
As Algorithm 1 shows, given a query point q, the priority
k-means the tree is searched by the following steps.
3941
Algorithm 2 [18] TraverseKmeansTree
1: Input: current node curNode; priority queue PQ; priority
queue R count; query point q
2: if curNode is leaf then
3:
search all points in curNode and add them to R
4:
count := count + |curNode|
5: else
6:
subNodes := sub nodes of curNode
7:
nearestSubNode := nearest node of subNodes to q
8:
subNodes := subNodes -nearestSubNode
9:
PQ := PQ + subNodes
10:
TraverseKmeansTree(nearestSubNode, PQ, R)
11: end if
Fig. 1. Example of core cells. Core cells are shown in gray, and each point
in the core cell is a core point [17].
1) Initially traverse the tree from root to the q s nearest leaf
node, followed by nonleaf node with the closest cluster
center to q, and add all unexplored branches along the
path to a priority queue [(PQ): lines 7–9 in Algorithm 2],
which is sorted in increasing distance from q to the
boundary of the branch being added to the queue.
2) Restart to traverse the tree in the queue from the top
branch (line 10 in Algorithm 2).
Let I be the maximum iteration times of k-means, and L be
the number of examined points by FLANN. The height of the
tree is about log(n)/ log(χ ) if the tree is balanced. During each
traversal from top to down, there are about O(log(n)/ log(χ ))
inner nodes and one leaf node should be checked. Thus, the
complexity of FLANN is about O(LD(log(n)/ log(χ ))), where
L is the number of examined points.
ρ-Approximate DBSCAN: For simplicity, the basic concepts and terms of DBSCAN [4] (e.g., core points, densityreachable, cluster, noise, etc.) are not presented here. Aiming
to improve DBSCAN, ρ-approximate algorithm imposes a
simple quadtree-like hierarchical grid T on D-dimensional
space, and divides the data space into a set of nonempty cells.
Each
√ cell is a D-dimensional hyper-square with side length
/ D. Fig. 1 shows an example in 2-D space. Then, it builds
a graph by redefining the definition and computation of the
graph G = (V, E): 1) each vertex is a core cell and 2) given
two different core cells c1 , c2 .
1) ∃p1 ∈ c1 , p2 ∈ c2 , such that dist(p1 , p2 ) ≤ , there is an
edge between c1 and c2 .
2) If ∃p1 ∈ c1 is within the (1 + ρ)-neighborhood of
p2 ∈ c2 , there is no edge between c1 and c2 .
3) Otherwise, don’t care.
Based on the graph G and the quadtree-like hierarchical
grid, an approximate range counting algorithm is designed to
solving the problem of DBSCAN.
Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on March 20,2023 at 14:57:36 UTC from IEEE Xplore. Restrictions apply.
3942
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 51, NO. 6, JUNE 2021
IV. P ROPOSED A LGORITHM
A. Drawbacks of DBSCAN Analysis
DBSCAN runs in O(n2 ), and most of its variants still do not
work well for large-scale data. In order to find the underlying
causes, we analyzed fundamental techniques used in traditional
clustering approaches, and find that there are some significant
deficiencies as follows.
1) Brute force algorithm is used in original DBSCAN to
compute density for an arbitrary data point, the complexity is O(n). However, there are many redundancies.
Suppose di,k and dj,k are already known, while di,j is
unknown. Suppose |di,k −dj,k | > or di,k +dj,k ≤ , then
we can infer di,j > or di,j ≤ according to triangle
inequality, respectively. Then, the distance computation
for di,j is also unnecessary.
2) In the case of the grid technique
is used, the side length
√
of each cell is fixed to / D, which implies that it is
almost useless in high dimension [40].
eps
eps/2
eps
Fig. 2. Example of a CB. MinPts = 8, = eps and there are eight red points
which are within N(/2) (p), then all red points are core points.
r−eps
eps
q p eps
r
Fig. 3. Example of an NCB. MinPts = 22, = eps, r > eps, and the total
number of points within Nr (p) (the outer circle) is 21, then all red points are
noncore points, because they are all within Nr− (p).
B. Basic Ideas
As mentioned above, DBSCAN cannot deal with large-scale
data due to its high complexity. According to our observation
and analysis on DBSCAN, there are two findings as follows.
1) The key problem of DBSCAN is to find core points,
which is a kNN problem in essence, because the density
defined in DBSCAN is the total number of points within
a specified neighborhood, and all neighbors of a core
point should be reported for merging.
2) Point p and point q should have similar neighborhoods,
provided p and q are close; the closer they are, the
more similar neighborhood they have. Thus, it is highly
possible that a point has the same type as its neighbors.
Hence, it is reasonable to utilize kNN technique to solve
the problem of DBSCAN. Formally, let K = MinPts and
p(1) , . . . , p(K) be the first K nearest neighbor points of p, where
1 ≤ i ≤ K, then we have the following.
Theorem 1:
1) If dp,(K) ≤ , then p is a core point.
2) p is a noncore point if dp,(i) > , where 1 ≤ i ≤ K.
Proof: 1) Because dp,(K) ≤ , which means dp,(1) ≤ dp,(2)
≤, . . . , ≤ dp,(K) ≤ , |N (p)| ≥ K = MinPts, p is a core point.
2) Because 1 ≤ i ≤ K and dp,(i) > , < dp,(i) ≤ dp,(K) .
Thus, |N (p)| < K = MinPts, i.e., p is a noncore point.
As a result of Theorem 1, we argue that the problem of
identifying whether a point is a core point or not is a kNN
problem.
Theorem 2: If dp,(K) ≤ (/2), p(1) , p(2) , . . . , p(K) are all
core points.
Proof: Because dp,(K) ≤ (/2) ≤ , according to triangle
inequality, we have ∀i, j ∈ [1, K] dist(p(i) , p(j) ) ≤ . Therefore,
∀i ∈ [1, K] we have |N (p(i) )| ≥ K, i.e., p(1) , p(2) , . . . , p(K)
are all core points.
Definition 1 (Core-Block (CB)): Nξ (p) is a CB with respect
to p and ξ , if ∀q ∈ Nξ (p) is core point. It is noted as CB(p, ξ ),
and p is called the center of CB(p, ξ ).
As Fig. 2 shows, all red points are within N(/2) (p), and
the total number of red points is 8 which is equal to MinPts,
p
pi
r−eps
p
r−2*eps
eps
r
Fig. 4. Example of a noise-CB. MinPts = 22, = eps and r > 2, then all
red points within green circle are noise, because Nr− (p) is noncore block
which implies there is no core point within the red circle.
then according to Theorem 2 all red points are core points.
Therefore, N(/2) (p) is a CB.
Theorem 3: Let dp,(K) = r, (1) if r > , then ∀q ∈ Nr− (p)
is noncore point. (2) if r > 2, then ∀q ∈ Nr−2 (p) is noise.
Proof:
1) Because dp,(K) = r > , which means ∀q ∈ Nr− (p),
N (q) ∈ Nr (p), therefore, |N (q)| < |Nr (p)| = MinPts. Thus,
q is a noncore point.
2) Because dp,(K) = r > 2, then ∀q ∈ Nr−2 (p), we have
N (q) ∈ Nr− (p), and because Nr− (p) is a noncore-block
(NCB), which implies there is no core point in N (q), then q
is noise.
Definition 2 (None-Core Block (NCB)): Nξ (p) is an NCB
with respect to p and ξ , if ∀q ∈ Nξ (p) is noncore point. It is
noted as NCB(p, ξ ), and p is called the center of NCB(p, ξ ).
Definition 3 (Noise-Block (NOB)): Nξ (p) is an NOB with
respect to p and ξ , if ∀q ∈ Nξ (p) is noise. It is noted as
NOB(p, ξ ), and p is called the center of NOB(p, ξ ).
Obviously, an NOB is NCB, but an NCB may not be NOB;
neither NCB nor NOB is CB, and vice versa.
Fig. 3 addresses an example of Theorem 3 (1). Because
MinPts = 22, = eps and r > eps, it is impossible for each
point within the blue circle to find enough neighbors within
its -neighborhood, (because the total number of points within
Nr (p), i.e., the outer circle, is 21). Thus, all points within the
blue circle are noncore points, i.e., Nr− (p) is an NCB.
Fig. 4 is another example to explain Theorem 3 (2). Because
r > 2, all points within green circle are noncore points, and
it is also impossible for any point p within green circle to
Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on March 20,2023 at 14:57:36 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: KNN-BLOCK DBSCAN: FAST CLUSTERING FOR LARGE-SCALE DATA
Fig. 5. Framework of KNN-BLOCK DBSCAN. It uses FLANN to identify
CBs, NCBs, and NOB, then merges CBs, assigns points in NCBs to proper
clusters and discards noises.
Algorithm 3 KNN-BLOCK DBSCAN(P, , MinPts)
1: Input: P is input data; [, MinPts];
2: Output: cluster id of each point;
3: Initialize core-blocks set CBs = {φ}
4: Initialize non-core-blocks set NCBs = {φ}
5: K := MinPts, cur_cid := 0
// current cluster id
6: for each unvisited point p ∈ P do
7:
{p(1) , . . . , p(K) } := FLANN :: kNN(p, P)
8:
ξ := dp,(K) , Nξ (p) := {p(1) , p(2) , . . . , p(K) }
9:
if ξ ≤ then
10:
cur_cid := cur_cid + 1
11:
if ξ ≤ 2 then
12:
push Nξ (p) into CBs //a core block found
13:
∀s ∈ Nξ (p) mark s as core-point and visited
14:
else
15:
push N0 (p) into CBs //single core point
16:
mark p as core-point and visited
17:
end if
18:
curCorePts:= core points already found in Nξ (p)
19:
exist_cids:= clusters found in curCorePts
20:
merge exist_cids into cur_cid
21:
assign Nξ (p) to cluster cur_cid
22:
else if < ξ ≤ 2 then
23:
push Nξ − (p) into NCBs
24:
mark all points within Nξ − (p) as visited
25:
else if ξ > 2 then
26:
mark ∀q ∈ Nξ −2 (p) as noise and visited
27:
end if
28: end for
29: CBCENT := extract all center points from CBs
30: Create a index tree by FLANN from CBCENT
31: MergeCoreBlocks(CBs, CBCENT, cbIDs, )
32: AssignNonCoreBlocks(NCBs, CBs, CBCENT, )
find any core point from which p is directly density-reachable,
because Nr− (p) is noncore block which implies there is no
core point within the red circle. Thus, points within Nr−2 (p)
are all outliers, i.e., Nr−2 (p) is an NOB.
Definition 4: A core block CB(p, ξ1 ) is density-reachable
from another core block CB(q, ξ2 ), if ∃s ∈ CB(p, ξ1 ) and
w ∈ CB(p, ξ2 ), such that s is density-reachable from w.
Definition 5: A point p is density-reachable from core block
CB(q, ξ ), if ∃s ∈ CB(q, ξ ) such that p is density-reachable
from q.
Comprehensively, based on the two findings mentioned
above, the difference of between this article and other variants
3943
Algorithm 4 MergeCoreBlocks(CBs, )
1: Input: CBs: core-blocks; CBCENT: core-block centers
set; is the parameter of DBSCAN ;
2: for each core-block CB(p, ξ1 ) do
3:
Neibs := FLANN::RangeSearch(p, 2, CBCENT)
4:
for each q ∈ Neibs do
5:
CB(q, ξ2 ) be the core-block of q
6:
if p and q are in different cluster then
7:
if dp,q > ξ1 + ξ2 + then
8:
BruteForceMerge(CB(p, ξ1 ), CB(q, ξ2 ))
9:
end if
10:
end if
11:
end for
12: end for
Algorithm 5 AssignNonCoreBlocks(NCBs, CBs, )
1: Input: NCBs: non-core-blocks; CBs: core blocks; is the
parameter of DBSCAN;
2: for each non-core-block NCB(p, ξ1 ) do
3:
r := ξ1 + 1.5;
4:
Neibs := FLANN::RangeSearch(p, r,CBCENT)
5:
if ∃q ∈ Neibs s.t. dp,q ≤ ( − ξ1 ) then
6:
merge NCB(p, ξ1 ) into the cluster of q
7:
process next non-core-block
8:
else
9:
for each unclassified o ∈ NCB(p, ξ1 ) do
10:
if ∃q ∈ Neibs s.t. dp,q ≤ ( + ξ1 + ξ2 ) then
11:
if ∃s ∈ CB(q, ξ2 ) s.t.do,s ≤ then
12:
assign o to the cluster of q
13:
process next unclassified point o
14:
end if
15:
end if
16:
end for
17:
end if
18: end for
of DBSCAN mainly lies in: 1) kNN is used, instead of using
range query algorithm, to identify core points and noncore
points by block (CBs, NCBs, and NOBs); 2) each block has a
dynamic range, while the width of grid used in ρ-approximate
DBSCAN and fast DBSCAN is a constant; and 3) CBs can
be processed by a simple way which is far more efficient than
grid.
C. Algorithms
In this section, we outline the proposed method. The framework of KNN-BLOCK DBSCAN is shown in Fig. 5. First, it
uses FLANN to identify CBs, NCBs, and NOB. Second, for
any two pairs of CBs, it merges them into the same cluster
provided they are density-reachable from each other. Third, for
each point p in NCBs, KNN-BLOCK DBSCAN may assign
p to a cluster if there exists a core point from which it is
density-reachable. The details are shown in Algorithm 3, 4, 5,
and 6, respectively.
1) Types and Blocks Identification: As Algorithm 3 shows,
for each an unvisited point p in P, it uses FLANN::kNN to
Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on March 20,2023 at 14:57:36 UTC from IEEE Xplore. Restrictions apply.
3944
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 51, NO. 6, JUNE 2021
Algorithm 6 BruteForceMerge(CB(p, ξ1 ), CB(q, ξ2 ))
1: Input: CB(p, ξ1 ): a core-block; CB(q, ξ2 ): another coreblock;
2: Initialize two points set O = {φ} and S = {φ}
3: for each point o in CB(q, ξ2 ) do
4:
push o to O if do,p < + ξ1
5: end for
6: for each point s in CB(p, ξ1 ) do
7:
push s to S if ds,q < + ξ2
8: end for
9: if ∃o ∈ O, s ∈ S, s.t. do,s ≤ then
10:
merge CB(p, ξ1 ) and CB(q, ξ2 ))
11: end if
Fig. 6. Three cases of two CBs. (a) Two CBs can be merged directly. (b) Is
a case that can skip directly for they are far from each other. (c) Addresses
the third case that is necessary to check in detail.
retrieve the first K (K = MinPts) nearest neighbors of p.
According to Theorem 1, the type of p can be identified.
If p is a core point, we may find a core block according
to Theorem 2 (lines 11–13). If p is not a core point, we may
find an NCB (lines 22–24) or noise block (lines 25 and 26)
according to Theorem 3.
2) Blocks Merging: Let CB(p, ξ1 ) and CB(q, ξ2 ) be two
CBs, there are three cases as described below.
Case 1 (dp,q ≤ ): As image (a) in Fig. 6 shows, because p
is directly density-reachable from q, both CBs can be merged
into a same cluster directly.
As shown from lines 20 and 21 in Algorithm 3, suppose
CB(p, ξ1 ) is a newly identified CB, and if there are some
points that have already been assigned to other clusters within
CB(p, ξ1 ), then these clusters can be merged directly.
Case 2 (dp,q > ( + ξ1 + ξ2 )): As illustrated in Fig. 6 (b),
they are far away from each other, there is no need to merge
them, because according to triangle inequality, there is no point
in CB(p, ξ1 ) that is density-reachable from another point in
CB(q, ξ2 ).
Case 3 ( < dp,q ≤(ξ1 + ξ2 + )): As Fig. 6(c)
addresses, CB(p, ξ1 ) and CB(q, ξ2 ) have no intersection,
and they can be merged if there exists a pair of points
(o1 , o2 ) where dist(o1 , o2 ) ≤ , o1 ∈ CB(p, ξ1 ) and
o2 ∈ CB(q, ξ2 ).
In order to detect this case effectively, a simple method is
proposed as Algorithm 6 illustrates. First, we select point set
O ⊆ CB(q, ξ2 ) such that ∀m ∈ O s.t. dp,m ≤ + ξ1 , and
point set S ⊆ CB(p, ξ1 ) such that ∀s ∈ S s.t. dp,m ≤ + ξ2 .
Then, we simply utilize brute force algorithm to check whether
there exist two points o ∈ O, s ∈ S that are directly
density-reachable from each other, and merge two CBs if
yes. As Fig. 7 shows, set O is within the right shadow
region, while S is within the left shadow region. Only points
ξ2
ξ1
ε+ξ 1
p
q
ε+ξ 2
Fig. 7. Example of case (3) for merging CBs. CB(p, ξ1 ) is a CB, CB(q, ξ2 )
is another CB, only points in the two shadow region are possible directly
density-reachable from each other.
in the two shadow regions are checked, instead of whole
two CBs.
3) Borders Identification: At last, given a CB CB(p, ξ1 ) and
an NCB NCB(q, ξ2 ), Algorithm 5 (AssignNonCoreBlocks) is
called to identify border points in NCB(q, ξ2 ) that are densityreachable from CB(p, ξ1 ). Similar to Fig. 6, there are also three
cases as described below.
Case 1 (dp,q >( + ξ1 + ξ2 )): NCB(q, ξ2 ) is far from
CB(p, ξ1 ), then it is unnecessary to merge them.
Case 2 (dp,q ≤( − ξ2 )): Because NCB(q, ξ2 ) is totally
contained in N (p), all points within NCB(q, ξ2 ) are densityreachable from p. Therefore, all points in NCB(q, ξ2 ) are
assigned to the cluster of p directly.
Case 3 (( − ξ2 )< dp,q ≤( + ξ2 )): it is necessary to check
whether each point within NCB(q, ξ2 ) is density-reachable
from p. Similar to Fig. 7, only points within two shadow
regions are checked.
D. Complexity Analysis
Let n be the cardinality of data set, b0 = b1 + b2 + b3 be the
total number of all blocks, where b1 , b2 , and b3 are the total
number of CBs, NCBs, and NOBs, respectively. Averagely,
b0 = β(n/MinPts), where β is a factor about the distribution of the data, and b0 is usually far less than n provided
[, MinPts] are well chosen (how to choose good parameters
for DBSCAN is another big topic, such as OPTICS [13] and
others [41]–[43], which is out of the scope of this article). The
complexity of Algorithm 3 is analyzed as follows.
Space Complexity: As shown in the above algorithms, we
can see that each block should be saved, thus the space cost
is about O(MinPts ∗ b0 ) = O(βn).
Time Complexity:
1) From lines 6–29 of Algorithm 3, we can infer that
FLANN::kNN will be called about b0 times. As we
know, in the case of using priority search k-means tree,
FLANN::kNN runs in O(L D log(n)/ log(χ )) expected
time [18] for each query, where L is a data points examined by FLANN, D is dimension, and χ is a branching
factor of the tree used in FLANN. Thus, the complexity
of finding blocks is about O(b0 [L D log(n)/ log(χ )]).
2) The complexity of creating a tree by FLANN from
CBCENT is about O(b1 D log(b1 )).
3) The complexity of Algorithm 4: There are two main
parts as follows.
CBs,
for
each
CB
a) There
are
b1
FLANN::RangeSearch is called to find its
2-neighbors from CBCENT, the complexity is
about O(b1 [L d log(b1 )/log(χ )]).
Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on March 20,2023 at 14:57:36 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: KNN-BLOCK DBSCAN: FAST CLUSTERING FOR LARGE-SCALE DATA
b) For each CB, the total number of points in a CB
is usually far less than n, i.e., MinPts << n, then
the complexity of Algorithm 6 is averagely about
O(MinPts).
Hence, since MinPts << n can be regarded as
a constant, the complexity of Algorithm 4 is about
O(b1 [L D log(b1 )/ log(χ )]).
4) The complexity of Algorithm 5: there are also two main
parts as follows.
a) There are b2 NCBs. For each NCB we call
FLANN::RangeSearch to find its (ξ1 + 1.5)neighbors from CBCENT, the complexity is about
O(b2 [L D log(b1 )/log(χ )]).
b) The average complexity of assigning an unclassified point in NCBs to a cluster (from line 5 to line
17) is about O(MinPts[L D log(b1 )/ log(χ )]).
Hence, the complexity of Algorithm 5 is less
log(b1 )/ log(χ )])
<
than O(b2 MinPts [L D
O(UCPtsNum [L D log(b1 )/ log(χ )]), where UCPtsNum is
the total number of unclassified points in all NCBs.
As mentioned above, b0 = b1 + b2 + b3 =
(βn/MinPts) is far less than n provided [, MinPts]
are well chosen, then the overall time complexity is
about as O(b0 [L D log(n)/ log(χ )]) = O ([βn/MinPts]
[L D log(n)/ log(χ )]) < O(L D n log(n)/ log(χ )).
In the case of dealing with very high dimensional data
sets, FLANN::kNN degenerates to be an O(n) algorithm, and
then the complexity of KNN-BLOCK DBSCAN is about
O(b0 [L D n/ log(χ )]).
In the worst case, if there is none CB and FLANN::kNN
runs in O(n), the complexity of KNN-BLOCK DBSCAN is
O(n2 ).
V. E XPERIMENTS
A. Algorithms and Set Up
In this section, to evaluate the correctness and effectiveness
of the proposed approach, several experiments are conducted
on different data sets at Intel Core i7-3630 CPU @2.50 GHz,
8G RAM. We mainly compare the proposed algorithm with
ρ-approximate DBSCAN, AnyDBC [28] and pure kNN-based
DBSCAN.
1) “KNN-BLOCK” is KNN-BLOCK DBSCAN which is
coded in C++ and runs on Windows 10 64-bit operating system, the tree used in FLANN is priority search
k-means tree, and the cluster number χ of k-means is
10.
2) Approx is ρ-approximate DBSCAN which is also written in C++ and runs on Linux (Ubuntu 14.04 LTS)
operating system.
3) AnyDBC is the efficient anytime density-based clustering algorithm [28].
4) kNN-based DBSCAN is an algorithm which only uses
FLANN::kNN technique to accelerate DBSCAN, as
shown in Algorithm 7, and the complexity is about
O(L D n log(n)/ log(χ )), where L is a data points examined by FLANN, D is dimension, and χ is a branching
factor of the tree used in FLANN.
3945
Algorithm 7 Pure kNN-Based DBSCAN
1: Input: data set P, and , MinPts;
2: coreSet := {φ}
3: for each unclassified p ∈ P do
4:
neibors:= FLANN::kNN(p,MinPts);
5:
if dp,(MinPts) ≤ then
6:
push p into coreSet
7:
end if
8: end for
9: for each core point p ∈ coreSet do
10:
neibCores := find core points from k-neighbors of p
11:
merge neibCores and p into one cluster
12: end for
13: for each two pair of clusters c1 and c2 do
14:
merge c1 and c2 if ∃p1 ∈ c1 and p2 ∈ c2 s.t. p1 is
density reachable p2
15: end for
16: find border points and assign them
B. Data Sets
Data sets come from UCI (https://archive.ics.uci.edu/ml/ind
ex.php), including PAM (PAMPA2), HOUSE (household),
USCENCUS (USCensus 1990), gas sensor, FMA (dataset for
music analysis), AAS-1K (Amazon access samples), HIGGS,
etc., where AAS-1K is a 1000-dimensional data set which
is extracted from 20 000-dimensional data set AAS. For each
data set, all duplicate points are removed to make each point
unique, all missing values are set to 0, and each dimension of
these data sets is normalized to [0, 105 ]. The following part
of this section lists brief descriptions of these data sets.
PAM39D is a real 39-dimensional dataset, PAMAP2, with
cardinality n = 3, 850, 505; PAM4D is a real dataset obtained
by taking the first four principle components (PCA) of
PAMPA2; Household: dim = 7, n = 2049280; USCENCUS:
dim = 36, n = 365100; GasSensor (Ethylene-CO): dim = 16,
n = 4208261; MoCap: dim = 36, n = 65536; APS (APS
Failure at Scania Trucks): dim = 170, n = 30000; Font
(CALIBRI): dim = 36, n = 19068; HIGGS: dim = 28,
n = 11000000; FMA: dim = 512, n = 106574; AAS − 1K:
AAS is a large sparse data set, and AAS-1K is a subset extracted
from AAS with dim = 1000, n = 30000.
C. Two Examples of Clustering
We benchmark KNN-BLOCK DBSCAN on two 2-D test
cases to reveal the processes in detail, as shown in Fig. 8.
The left case is aggregation [44], and the right case comes
from [1].
Specifically, in Fig. 8(a) presents the original data distribution. Fig. 8(b) draws CBs, NCBs, and NOBs plotted by black,
green, and red circles, respectively. The radius of each circle is
different, which means each block has a different size. We also
can see that NCBs usually distribute along the border of CBs,
and NOBs appears far from CBs. Fig. 8(c) illustrates the result
of merging CBs, which is the most important step to identify
clusters; in Fig. 8(d), as mentioned in Section IV-C3, there
are three cases to process NCBs: the green circles represent
Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on March 20,2023 at 14:57:36 UTC from IEEE Xplore. Restrictions apply.
3946
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 51, NO. 6, JUNE 2021
30
30
30
20
20
20
10
10
10
0
0
20
40
0
0
20
40
0
0
30
30
30
20
20
20
10
10
10
0
0
20
(d)
40
0
0
20
(e)
20
40
(c)
(b)
(a)
40
0
0
20
40
(f)
0.5
0.5
0.5
0
0
0
−0.5
−0.5
0
(a)
0.5
−0.5
−0.5
0
(b)
0.5
−0.5
−0.5
0.5
0.5
0.5
0
0
0
−0.5
−0.5
0
(d)
0.5
−0.5
−0.5
0
(e)
0.5
−0.5
−0.5
0
0.5
0
0.5
(c)
(f)
Fig. 8. Two examples present the processes of KNN-BLOCK DBSCAN. (a) is the original data distribution; (b) shows 3 kinds of blocks found by
KNN-BLOCK DBSCAN, where black circles are core blocks, green circles are NCBs and red are NOBs; (c) illustrates clusters found after merging CBs;
(d) addresses the assignment of NCBs to corresponding clusters, the red balls are NCBs that can be assigned to their nearest clusters, and green circles in
(d) are those who find no cluster to assign; (e) exhibits the final result of KNN-BLOCK DBSCAN, where black points are noise; (f) demonstrates the result
of original DBSCAN.
TABLE II
RUNTIME C OMPARISONS ON S UBSETS OF HOUSE AND PAM W ITH n I NCREASING . T HE S PEEDUP OF KNN-BLOCK
DBSCAN OVER I TS C OMPETITOR I S G IVEN IN B RACKETS . (U NIT: S ECOND )
case (1), because they are far from all core-points, all points
within these NCBs are classified as noise; the red balls illustrate case (2), each of them is assigned to one cluster from
which it is density-reachable; in case (3), for each point p
within unclassified NCBs, if q is identified as a core from
which p is density-reachable, then p is classified to the cluster of q. Fig. 8(e) exhibits the final result of KNN-BLOCK
DBSCAN, where black points are noise; and Fig. 8(f) shows
the result obtained by original DBSCAN.
It is observed that KNN-BLOCK DBSCAN nearly obtains
the same result as DBSCAN with high efficiency, because
it processes data by blocks, and reduces a large number of
redundant distance computations.
D. Runtime Comparisons With ρ-Approximate DBSCAN
The first experiment is conducted on a set of subsets
of HOUSE and PAM4D to observe the complexities of the
proposed algorithm and ρ-approximate DBSCAN with different [, MinPts]. Figs. 9 and 10 present the results of two
algorithms, and Table II reveals more details. We also conduct experiments on the whole data sets of HOUSE, PAM4D,
KDD04, USCENCUS, REACTION, MOPCAP, BODONI,
HIGGS, FMA, and AAS1K, respectively, and Table III shows
the comparison of runtime with different [, MinPts].
From two figures and two tables, we can observe as follows.
1) Both algorithms prefer large and small MinPts. For
example, on data set HouseHolod, both KNN-BLOCK
DBSCAN and ρ-approximate DBSCAN run best when
[, MinPts] = [5000, 100], and the worst case happens
when [, MinPts] = [1000, 200]. On other data sets,
things are similar as shown in Table III.
2) Both algorithms run in linear expected time in low
dimensional data sets.
3) We can see that on large-scale data sets
PAM4D, HOUSEHOLD, and HIGGS, our algorithm
Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on March 20,2023 at 14:57:36 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: KNN-BLOCK DBSCAN: FAST CLUSTERING FOR LARGE-SCALE DATA
3947
TABLE III
RUNTIME C OMPARISONS ON D IFFERENT DATA S ETS W ITH D IFFERENT AND M IN P TS . T HE S PEEDUP OF
KNN-BLOCK DBSCAN OVER I TS C OMPETITOR I S G IVEN IN B RACKETS . (U NIT: S ECOND )
is much better, the speedup of KNN-BLOCK
DBSCAN over its competitor is about 2.5–6 times
on HOUSEHOLD, 1.4–3 times on PAM4D, and 16
and 17 times on HIGGS (28 dim). On other relative
high-dimensional data sets, e.g., MOPCAP (36 dim)
APS (170 dim), BODONI (256 dim), FMA (512 dim),
and AAS-1K(1000 dim) KNN-BLOCK DBSCAN still
performs well, while ρ-approximate degenerates to be
an O(n2 ) algorithm which conforms to our analysis
mentioned in Section II. It is also notable that the
performance of KNN-BLOCK DBSCAN drops with
the dimension, e.g., the proposed algorithm spends
much more time on HIGGS than that on PAM4D, and
the should be relatively larger in high dimension than
that of low dimension.
From these experiments, we can see that KNN-BLOCK
DBSCAN accelerates ρ-approximate DBSCAN greatly, and
is promising for processing such large-scale data.
Fig. 9.
Runtime comparisons on subsets of HOUSE with n increasing.
E. Runtime Comparisons With AnyDBC
To make comparisons with AnyDBC, we conduct experiments on the two same data sets, namely, GasSensor
(Ethylence-co) and PAM39D, as shown in Fig. 11 (the result
of AnyDBC is obtained by running the binary program
provided by the authors on our machine). It is observed
Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on March 20,2023 at 14:57:36 UTC from IEEE Xplore. Restrictions apply.
3948
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 51, NO. 6, JUNE 2021
Fig. 10.
Runtime comparisons on subsets of PAM4D with n increasing.
TABLE IV
RUNTIME C OMPARISONS W ITH P URE K NN-BASED DBSCAN
Fig. 11. Runtime comparisons with AnyDBC and ρ-approximate DBCAN
on Gas Sensor and PAM39D, MinPts is fixed to 50.
that KNN-BLOCK DBSCAN outperforms AnyDBC and
ρ-approximate DBSCAN, especially, on PAM39D KNNBLOCK DBSCAN runs far faster than AnyDBC.
F. Runtime Comparisons With Pure kNN-Based DBSCAN
In this part, KNN-BLOCK DBSCAN is compared with
pure kNN-based DBSCAN on some data sets, and the results
are shown in Table IV. From this table, we can see that
KNN-BLOCK DBSCAN runs far faster than pure kNN-based
algorithm, and the speedup varies from 1.42 to 5.48. Clearly,
in most cases, the speedup is larger than 2, which proves that
the block techniques plays an important role in our algorithm,
and greatly speedup DBSCAN.
G. Effect of and MinPts
In this section, we check the effect of [, MinPts] on the
proposed algorithm. PAM4D is used in this experiment with
Fig. 12. Runtime distributions with the changing of and MinPts on PAM4D
and HOUSEHOLD, respectively.
cardinality 3 850 505 and the dimension is 4. Table V reveals
the execution details of kNN, MergeCB (Algorithm 2) and
AssignNCB (Algorithm 3), as well as the numbers of CBs,
NCBs, and NOBs.
Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on March 20,2023 at 14:57:36 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: KNN-BLOCK DBSCAN: FAST CLUSTERING FOR LARGE-SCALE DATA
3949
TABLE V
E XECUTION T IMES OF K NN, M ERGE CB, AND A SSIGN NCB, AS W ELL AS B LOCKS F OUND ON PAM W ITH D IFFERENT [, MinPts]
TABLE VI
T OTAL N UMBERS OF CB S , NCB S , AND NOB S F OUND ON D IFFERENT
DATA S ETS
TABLE VIII
C OMPARISONS OF NMI FOR KNN-BLOCK DBSCAN AND
ρ-A PPROXIMATE DBSCAN (n = 5000)
TABLE VII
C OMPARISONS OF Omega-index FOR KNN-BLOCK DBSCAN AND
ρ-A PPROXIMATE DBSCAN (n = 5000)
TABLE IX
E XAMPLE OF C OMPUTING P RECISION FOR KNN-BLOCK DBSCAN
BASED ON T HREE M ATCHED L ABEL PAIRS : (“A1,” “B2”), (“A2,” “B1”),
AND (“A3,” “B4”) F OUND BY K UHN –M UNKRAS
As the two bold columns show, the execution times of kNN
is the same as the number of blocks found by KNN-BLOCK
DBSCAN. It is observed: 1) the runtime and execution times
of kNN linearly increase with MinPts; 2) while the execution
times of MergeCB rapidly decrease with MinPts; and 3) the
less CBs the more NCBs and NOBs.
Fig. 12 also provides more details of the runtime distribution
on PAM4D and HOUSEHOLD with the changing of and
MinPts, respectively. Hence, we can infer: 1) the complexity
of KNN-BLOCK DBSCAN mainly depends on the execution
times of kNN and 2) KNN-BLOCK DBSCAN prefers large
and small MinPts, which yields less executions of kNN due
to larger number of CBs identified.
1) In the case of MinPts is small and is large, most
blocks will be identified as CBs, and the number is about
N/MinPts. For example, as Table V shows, [7000, 100]
runs fastest, followed by [5000, 100], then [3000, 100],
and then [1000, 100].
2) When MinPts is large and is small, few CBs are
found, thus kNN will be called more frequently, and
it will degenerate to be an O(n2 ) algorithm in the worst
case. As shown in Table III, when the parameters are
[1000, 30 000] and [3000, 30 000], the runtime is much
longer than others.
Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on March 20,2023 at 14:57:36 UTC from IEEE Xplore. Restrictions apply.
3950
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 51, NO. 6, JUNE 2021
TABLE X
ACCURACY, R ECALL , AND F1-S CORE OF KNN-BLOCK DBSCAN AND ρ-A PPROXIMATE
DBSCAN ON S UBSETS OF HOUSE, PAM4D, MOPCAP, AND APS
H. Statistics of Three Kinds of Blocks
J. Accuracy of KNN-BLOCK DBSCAN
In this section, to observe the numbers of three kinds
of blocks with respect to different and MinPts, some
experiments are conducted on some whole data sets, including HOUSE, PAM4D, KDD04, USCENCUS, REACTION,
MOPCAP, and BODONI, respectively.
Table VI exhibits some statistics of CBs, NCBs, and NOBs
with respect to different and MinPts on all data sets. From
this table, we can see: the total number of blocks, especially
CBs, is far less than the cardinality n, which reveals that many
distance computations are filtered.
To evaluate the accuracy of KNN-BLOCK DBSCAN, some
experiments are conducted based on an assumption that the
clustering labels obtained by DBSCAN are ground truth. The
reason is as follows.
1) This article is only motivated to accelerate the speed
of DBSCAN, without concerning the clustering result
is good or not which is another topic out of the scope
of this article. It is expected that the clustering results
should be the same as original DBSCAN provided the
parameters (, MinPts) are the same.
2) Both KNN-BLOCK DBSCAN and ρ-approximate
DBSCAN are approximate algorithms, the more similar of their clustering results to the original DBSCAN,
the better. Hence, we argue it is reasonable to use the
clustering results of DBSCAN as ground truth.
Specifically, the idea is that each data point belongs to a
unique predefined cluster and its predicted cluster should correspond either to only one predefined cluster or to none [11].
Any pair of data points in the same predefined cluster is considered to be incorrectly clustered if the predicted cluster does
not match the predefined cluster to which they belong, even
if both points appear in the same predicted cluster. Therefore,
we evaluate the precision of two approaches as follows.
Step 1 (Clustering): Given a data set and [, MinPts], suppose Lab1 = {A1 , A2 , . . . , Ak } and Lab2 = {B1 , B2 , . . . , Bm }
I. Omega-Index and NMI Evaluations
Omega-Index [45] and normalized mutual information
(NMI) [46] are two well known methods to evaluate clustering
result, then similar to [47], we use them to make comparisons
for KNN-BLOCK DBSCAN and ρ-approximate DBSCAN.
Because the complexities of Omega-index and NMI are high
(O(n2 )), we only conduct experiments on sub sets of HOUSE,
PAM4D, MOPCAP, and APS with n = 5000.
In these experiments, we compute the Omega-index
and NMI scores of both algorithms by comparing the
results with those obtained from the original DBSCAN. As
Tables VII and VIII show the performances of both algorithms
are similar, and the results are all close to 1, which indicate
that both algorithms nearly agree on original DBSCAN.
Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on March 20,2023 at 14:57:36 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: KNN-BLOCK DBSCAN: FAST CLUSTERING FOR LARGE-SCALE DATA
are clustering labels obtained by the DBSCAN and KNNBLOCK DBSCAN.
Step 2 (Matching): It is well known that different clustering
algorithms may yield different labels on the same data set. For
example, cluster “A1” labeled by DBSCAN may be the same
as “B2” obtained by KNN-BLOCK DBSCAN. Hence, it is
reasonable to match labels first, and use the matched labels
to compute Accuracy. In this article, Kuhn–Munkras [48] performs the task of maximum matching two different cluster
label sets, which has been used in [11] and [49].
Step 3 (Computing Accuracy): Suppose there are three clusters with labels “A1,” “A2,” and “A3” obtained by DBSCAN
on one data set, but KNN-BLOCK DBSCAN labels them
with “B1,” “B2,” “B3,” and “B4,” and Kuhn–Munkras finds
there are three matched pairs: (“A1,”‘ “B2,”) (“A2,” “B1,”) and
(“A3,” “B4.”) If the labels of point p obtained by DBSCAN
and KNN-BLOCK DBSCAN match, then the prediction of p
is correct, e.g., (“A1” and “B2,”) otherwise it is wrong, e.g.,
(“A1” and “B1”). Table IX shows more details. Suppose there
are eight points in the data set, the second row lists labels
obtained by DBSCAN, and the third line is the clustering
result of KNN-BLOCK DBSCAN. We can see that there are
two cases that are wrongly predicated because (A1, B4) and
(A2, B3) are not matched pairs. Therefore, the total precision
is (8 − 2)/8 = 75%.
Because the original DBSCAN has high complexity, we
only test on small data sets. Here, we extract four subsets from
HOUSE, PAM4D, APS, and MOCAP, and use them as test
cases. Also because DBSCAN is nondeterministic (sensitive
to iteration order), some border points may be assigned to different clusters according to the order they appear. Therefore,
the accuracy is computed only by comparing core points.
Table X shows that both algorithms achieve high accuracy.
In low-dimensional data sets (HOUSE and PAM4D), the
precision, recall, and F1-score of both approximate algorithms are about 98%–100%, and there is only a little drop
in high-dimensional data sets (MOCAP and APS) which are
about 94.5%–97.7%.
VI. C ONCLUSION
DBSCAN runs in O(n2 ) expected time and is not suitable
for large-scale data. ρ-approximate DBSCAN is designed to
replace with DBSCAN for big data, however, it only can work
in a very low dimension. In this article, we analyze the underlying causes that current approaches fail in clustering large
scale data, and find that the grid technique is nearly useless
for high-dimensional data.
Aiming to tame problems mentioned above, an approximate approach named KNN-BLOCK DBSCAN is proposed
for large-scale data based on two findings: 1) the key
of DBSCAN to find core points is a kNN problem
in essence and 2) a point has similar density distribution to its neighbors, which implies it is highly possible
that a point has the same type (core/border/noise) as its
neighbors.
Therefore, we argue that the kNN technique, e.g., FLANN,
can be utilized to identify CBs, NCBs, and NOBs, which only
3951
includes core points, border points, and noises, respectively.
Then, we proposed an algorithm to merge CBs that are densityreachable from each other and assign each point in NCBs to
a proper cluster.
The superiority of KNN-BLOCK DBSCAN to
ρ-approximate DBSCAN is that it processes data by blocks,
each of which has a dynamic range, instead of grids used
in ρ-approximate DBSCAN with a fixed width, and fast the
kNN technique is used to identify the types of points. Given a
fixed intrinsic dimensionality, the complexity of the proposed
algorithm is about O([βn/MinPts][L D log(n)/ log(χ )])
where L is a constant, D is dimension, β is a factor of data
distribution, and χ is the branching factor of the tree used in
FLANN.
Experiments address that KNN-BLOCK DBSCAN runs
faster than ρ-approximate DBSCAN and pure kNN-based
DBSCAN with high accuracy, even on some relative highdimensional data sets, e.g., APS (170 dim), BONONI
(256 dim), FMA (512 dim), and AAS-1K (1000 dim), where
ρ-approximate DBSCAN degenerates to be an O(n2 ) algorithm, KNN-BLOCK DBSCAN can still run very fast.
Our future work is to improve the proposed algorithm and
apply it in real applications in the following aspects.
1) Try to use other precise the kNN technique, such as
cover tree, semi-convex hull tree [36], etc., to improve
the accuracy of KNN-BLOCK DBSCAN.
2) Parallelize KNN-BLOCK DBSCAN on GPUs with a
highly efficient strategy for scheduling data to make the
proposed algorithm faster.
3) Apply it in our other researches, such as image
retrieval [50], vehicle reidentification [51], [52], vehicle crushing analysis [53], and auditing for shared cloud
data [54]–[56].
R EFERENCES
[1] A. K. Jain, “Data clustering: 50 years beyond K-means,” Pattern
Recognit. Lett., vol. 31, no. 8, pp. 651–666, 2010.
[2] A. Likas, N. Vlassis, and J. J. Verbeek, “The global k-means clustering
algorithm,” Pattern Recognit., vol. 36, no. 2, pp. 451–461, 2003.
[3] Y. Cheng, “Mean shift, mode seeking, and clustering,” IEEE
Trans. Pattern Anal. Mach. Intell., vol. 17, no. 8, pp. 790–799,
Aug. 1995.
[4] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A density-based algorithm
for discovering clusters in large spatial databases with noise,” in Proc.
KDD, vol. 96, 1996, pp. 226–231.
[5] U. Von Luxburg, “A tutorial on spectral clustering,” Stat. Comput.,
vol. 17, no. 4, pp. 395–416, 2007.
[6] H. Chang and D.-Y. Yeung, “Robust path-based spectral clustering,”
Pattern Recognit., vol. 41, no. 1, pp. 191–203, 2008.
[7] W. Fan, H. Sallay, and N. Bouguila, “Online learning of hierarchical
Pitman–Yor process mixture of generalized Dirichlet distributions with
feature selection,” IEEE Trans. Neural Netw. Learn. Syst., vol. 28, no. 9,
pp. 2048–2061, Sep. 2017.
[8] W. Fan, N. Bouguila, J. Du, and X. Liu, “Axially symmetric data clustering through Dirichlet process mixture models of Watson distributions,”
IEEE Trans. Neural Netw. Learn. Syst., vol. 30, no. 6, pp. 1683–1694,
Jun. 2019.
[9] L. Duan, S. Cui, Y. Qiao, and B. Yuan, “Clustering based on supervised learning of exemplar discriminative information,” IEEE Trans.
Syst., Man, Cybern., Syst., to be published.
[10] D. Cheng, Q. Zhu, J. Huang, Q. Wu, and L. Yang, “A novel cluster
validity index based on local cores,” IEEE Trans. Neural Netw. Learn.
Syst., vol. 30, no. 4, pp. 985–999, Apr. 2019.
[11] Y. Chen et al., “Decentralized clustering by finding loose and distributed
density cores,” Inf. Sci., vols. 433–434, pp. 649–660, Apr. 2018.
Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on March 20,2023 at 14:57:36 UTC from IEEE Xplore. Restrictions apply.
3952
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 51, NO. 6, JUNE 2021
[12] A. K. Jain, M. N. Murty, and P. J. Flynn, “Data clustering: A review,”
ACM Comput. Surveys, vol. 31, no. 3, pp. 264–323, 1999.
[13] M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander, “Optics:
Ordering points to identify the clustering structure,” in Proc. ACM
SIGMOD Rec., vol. 28, 1999, pp. 49–60.
[14] A. Rodriguez and A. Laio, “Clustering by fast search and find of density
peaks,” Science, vol. 344, no. 6191, pp. 1492–1496, 2014.
[15] Y. Chen et al., “Fast density peak clustering for large scale data based
on KNN,” Knowl. Based Syst., vol. 187, Jan. 2020, Art. no. 104824.
[16] D. Cheng, Q. Zhu, J. Huang, Q. Wu, and Y. Lijun, “Clustering with
local density peaks-based minimum spanning tree,” IEEE Trans. Knowl.
Data Eng., to be published.
[17] J. Gan and Y. Tao, “DBSCAN revisited: Mis-claim, un-fixability, and
approximation,” in Proc. ACM SIGMOD Int. Conf. Manag. Data, 2015,
pp. 519–530.
[18] M. Muja and D. G. Lowe, “Scalable nearest neighbor algorithms for
high dimensional data,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36,
no. 11, pp. 2227–2240, Nov. 2014.
[19] J. L. Bentley, “Multidimensional binary search trees used for associative
searching,” Commun. ACM, vol. 18, no. 9, pp. 509–517, 1975.
[20] A. Beygelzimer, S. Kakade, and J. Langford, “Cover trees for nearest
neighbor,” in Proc. 23rd Int. Conf. Mach. Learn., 2006, pp. 97–104.
[21] A. Gunawan and M. de Berg, “A faster algorithm for DBSCAN,” Ph.D.
dissertation, Dept. Math. Comput. Sci., Univ. Eindhoven, Eindhoven,
The Netherlands, 2013.
[22] V. Chaoji, M. Al Hasan, S. Salem, and M. J. Zaki, “SPARCL: Efficient
and effective shape-based clustering,” in Proc. 8th IEEE Int. Conf. Data
Min., 2008, pp. 93–102.
[23] E. H.-C. Lu, V. S. Tseng, and P. S. Yu, “Mining cluster-based temporal mobile sequential patterns in location-based service environments,”
IEEE Trans. Knowl. Data Eng., vol. 23, no. 6, pp. 914–927, Jun. 2011.
[24] S. K. Pal and P. Mitra, Pattern Recognition Algorithms for Data Mining.
Boston, MA, USA: CRC Press, 2004.
[25] S. Mahran and K. Mahar, “Using grid for accelerating density-based
clustering,” in Proc. 8th IEEE Int. Conf. Comput. Inf. Technol. (CIT),
2008, pp. 35–40.
[26] K. Sonal, G. Poonam, S. Ankit, K. Dhruv, S. Balasubramaniam, and
N. Goyal, “Exact, fast and scalable parallel DBSCAN for commodity
platforms,” in Proc. 18th Int. Conf. Distrib. Comput. Netw., 2017, p. 14.
[27] X. Chen, Y. Min, Y. Zhao, and P. Wang, “GMDBSCAN: Multi-density
DBSCAN cluster based on grid,” in Proc. IEEE Int. Conf. e-Bus. Eng.,
2008, pp. 780–783.
[28] S. T. Mai, I. Assent, and M. Storgaard, “AnyDBC: An efficient anytime
density-based clustering algorithm for very large complex datasets,” in
Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Disc. Data Min., 2016,
pp. 1025–1034.
[29] B. Borah and D. K. Bhattacharyya, “An improved sampling-based
DBSCAN for large spatial databases,” in Proc. Int. Conf. Intell. Sens.
Inf. Process., 2004, pp. 92–96.
[30] C.-F. Tsai and C.-W. Liu, “KIDBSCAN: A new efficient data clustering algorithm,” in Proc. Artif. Intell. Soft Comput. (ICAISC), 2006,
pp. 702–711.
[31] C. Tsai and T. Huang, “QIDBSCAN: A quick density-based clustering technique,” in Proc. Int. Symp. Comput. Consum. Control, 2012,
pp. 638–641.
[32] A. Bryant and K. Cios, “RNN-DBSCAN: A density-based clustering
algorithm using reverse nearest neighbor density estimates,” IEEE Trans.
Knowl. Data Eng., vol. 30, no. 6, pp. 1109–1121, Jun. 2018.
[33] A. Lulli, M. Dell’Amico, P. Michiardi, and L. Ricci, “NG-DBSCAN:
Scalable density-based clustering for arbitrary data,” Proc. VLDB
Endow., vol. 10, no. 3, pp. 157–168, 2016.
[34] F. Gieseke, J. Heinermann, C. E. Oancea, and C. Igel, “Buffer KD trees:
Processing massive nearest neighbor queries on GPUs,” in Proc. ICML,
2014, pp. 172–180.
[35] Y. Chen, L. Zhou, Y. Tang, N. Bouguila, and H. Wang, “Fast neighbor
search by using revised k-d tree,” Inf. Sci., vol. 472, pp. 145–162, 2019.
[36] Y. Chen, L. Zhou, and N. Bouguila, “Semi-convex hull tree: Fast nearest
neighbor queries for large scale data on GPUs,” in Proc. IEEE Int. Conf.
Data Min., 2018, pp. 911–916.
[37] J. Wang et al., “Trinary-projection trees for approximate nearest neighbor search,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 2,
pp. 388–403, Feb. 2014.
[38] M. Muja and D. G. Lowe, “Fast approximate nearest neighbors with
automatic algorithm configuration,” in Proc. Int. Conf. Comput. Vis.
Theory Appl. (VISSAPP), 2009, pp. 331–340.
[39] C. Silpa-Anan and R. Hartley, “Optimised KD-trees for fast image
descriptor matching,” in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit. (CVPR), 2008, pp. 1–8.
[40] Y. Chen, S. Tang, N. Bouguila, C. Wang, J. Du, and H. L. Li, “A fast
clustering algorithm based on pruning unnecessary distance computations in DBSCAN for high-dimensional data,” Pattern Recognit., vol. 83,
pp. 375–387, Nov. 2018.
[41] A. Karami and R. Johansson, “Choosing DBSCAN parameters automatically using differential evolution,” Int. J. Comput. Appl., vol. 91, no. 7,
pp. 1–11, 2014.
[42] H. Zhou, P. Wang, and H. Li, “Research on adaptive parameters determination in DBSCAN algorithm,” J. Xian Univ. Technol., vol. 9, no. 7,
pp. 1967–1973, 2012.
[43] F. O. Ozkok and M. Celik, “A new approach to determine eps parameter
of DBSCAN algorithm,” Int. J. Intell. Syst. Appl. Eng., vol. 4, no. 5,
pp. 247–251, 2017.
[44] A. Gionis, H. Mannila, and P. Tsaparas, “Clustering aggregation,” in
Proc. Int. Conf. Data Eng. (ICDE), 2005, pp. 341–352.
[45] L. M. Collins and C. W. Dent, “Omega: A general formulation of
the rand index of cluster recovery suitable for non-disjoint solutions,”
Multivariate Behav. Res., vol. 23, no. 2, pp. 231–242, 1988.
[46] A. Strehl and J. Ghosh, “Cluster ensembles: A knowledge reuse framework for combining partitionings,” in Proc. 18th Nat. Conf. Artif. Intell.,
2002, pp. 93–99.
[47] M. A. Patwary, D. Palsetia, A. Agrawal, W.-K. Liao, F. Manne, and
A. Choudhary, “Scalable parallel optics data clustering using graph algorithmic techniques,” in Proc. Int. Conf. High Perform. Comput. Netw.
Storage Anal. (SC), 2013, pp. 1–12.
[48] H. W. Kuhn, “The Hungarian method for the assignment problem,”
Naval Res. Logist. Quart., vol. 2, nos. 1–2, pp. 83–97, 1955.
[49] Y. Chen, S. Tang, S. Pei, C. Wang, J. Du, and N. Xiong, “DHeat: A
density heat-based algorithm for clustering with effective radius,” IEEE
Trans. Syst., Man, Cybern., Syst., vol. 48, no. 4, pp. 649–660, Apr. 2018.
[50] X. Liu, Z. Hu, H. Ling, and Y. Cheung, “MTFH: A matrix trifactorization hashing framework for efficient cross-modal retrieval,”
IEEE Trans. Pattern Anal. Mach. Intell., to be published.
[51] J. Hou, H. Zeng, L. Cai, J. Zhu, J. Chen, and K.-K. Ma,
“Multi-label learning with multi-label smoothing regularization
for vehicle re-identification,” Neurocomputing, vol. 345, pp. 15–22,
Jun. 2019.
[52] J. Zhu et al., “Vehicle re-identification using quadruple directional deep
learning features,” IEEE Trans. Intell. Transp. Syst., to be published.
[53] Y. Zhang, X. Xu, J. Wang, T. Chen, and C. H. Wang, “Crushing
analysis for novel bio-inspired hierarchical circular structures subjected to axial load,” Int. J. Mech. Sci., vol. 140, pp. 407–431,
May 2018.
[54] H. Tian, F. Nan, C.-C. Chang, Y. Huang, J. Lu, and Y. Du,
“Privacy-preserving public auditing for secure data storage in fogto-cloud computing,” J. Netw. Comput. Appl., vol. 127, pp. 59–69,
Feb. 2019.
[55] H. Tian, F. Nan, H. Jiang, C.-C. Chang, J. Ning, and Y. Huang,
“Public auditing for shared cloud data with efficient and secure group
management,” Inf. Sci., vol. 472, pp. 107–125, Jan. 2019.
[56] H. Tian et al., “Public audit for operation behavior logs with error locating in cloud storage,” Soft Comput., vol. 23, no. 11, pp. 3779–3792,
Jun. 2019.
Yewang Chen received the B.S. degree in management of information system from Huaqiao
University, Quanzhou, China, in 2001, and the
Ph.D. degree in software engineering from Fudan
University, Shanghai, China, in 2009.
He is currently an Associate Professor with
the School of Computer Science and Technology,
Huaqiao University, and the Fujian Key Laboratory
of Big Data Intelligence and Security, Huaqiao
University (Xiamen Campus), Xiamen, China. He
is also with the Beijing Key Laboratory of Big Data
Technology for Food Safety, Beijing Technology and Business University,
Beijing, China, and the Provincial Key Laboratory for Computer Information
Processing Technology, Soochow University, Suzhou, China. His current
research interests include machine learning and data mining.
Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on March 20,2023 at 14:57:36 UTC from IEEE Xplore. Restrictions apply.
CHEN et al.: KNN-BLOCK DBSCAN: FAST CLUSTERING FOR LARGE-SCALE DATA
Lida Zhou received the B.S. degree in computer
science from the College of Computer Science
and Technology, Central China Normal University,
Wuhan, China, in 2012. He is currently pursuing the Post-Graduation degree with the School
of Computer Science and Technology, Huaqiao
University (Xiamen Campus), Xiamen, China.
His current research interests is machine learning
and pattern recognition.
Songwen Pei (SM’19) received the B.S. degree in
computer science from the National University of
Defence and Technology, Changsha, China, in 2003,
the M.S. degree in computer science from Guizhou
University, Guiyang, China, in 2006, and the Ph.D.
degree in computer science from Fudan University,
Shanghai, China, in 2009.
He is currently an Associate Professor with
the University of Shanghai for Science and
Technology, Shanghai. Since 2011, he has been a
Guest Researcher with the Institute of Computing
Technology, Chinese Academy of Sciences, Beijing, China, a Research
Scientist with the University of California at Irvine, Irvine, CA, USA from
2013 to 2015 and the Queensland University of Technology, Brisbane, QLD,
Australia, in 2017. His research interests include heterogeneous multicore
system, cloud computing, and big data.
Dr. Pei is a board member of CCF-TCCET and CCF-TCARCH. He is a
member of ACM and CCF in China.
Zhiwen Yu (SM’14) received the Ph.D. degree
in computer science from the City University of
Hong Kong, Hong Kong, in 2008.
He is a Professor with the School of Computer
Science and Engineering, South China University
of Technology, Guangzhou, China, from 2015 to
2019. He has been published more than 140 referred
journal papers and international conference papers,
including 40 IEEE T RANSACTIONS papers. His
research areas focus on data mining, machine learning, pattern recognition, and intelligent computing.
Prof. Yu is a Distinguishable Member of China Computer Federation and
the Vice Chair of ACM Guangzhou Chapter. He is a Senior Member of ACM.
Yi Chen received the Ph.D. degree in computer
science from the Beijing Institute of Technology,
Beijing, China, in 2002.
She is currently a Professor of computer science
with Beijing Technology and Business University,
Beijing, where she is the Director of Beijing
Key Laboratory of Big Data Technology for Food
Safety. Her research interests mainly focuses on
information visualization, visual analytics and big
data technology for food quality and safety, including high-dimensional, hierarchical, spatio-temporal,
and graph data visual analytics.
3953
Xin Liu (M’08) received the M.S. degree in applied
mathematics from Hubei University, Wuhan, China,
in 2009, and the Ph.D. degree in computer science
from Hong Kong Baptist University, Hong Kong, in
2013.
He was a Visiting Scholar with the Computer
and Information Sciences Department, Temple
University, Philadelphia, PA, USA, from 2017 to
2018. He is currently an Associate Professor with the
Department of Computer Science and Technology,
Huaqiao University, Quanzhou, China, and also with
the State Key Laboratory of Integrated Services Networks, Xidian University,
Xi’an, China. His present research interests include multimedia analysis,
computer vision, pattern recognition, and machine learning.
Jixiang Du received the B.Sc. and M.Sc. degrees
in vehicle engineering from the Hefei University
of Technology, Hefei, China, in September 1999
and July 2002, respectively, and the Ph.D. degree
in pattern recognition and intelligent system from
the University of Science and Technology of China,
Hefei, in December 2005.
He is currently a Professor with the College
of Computer Science and Technology, Huaqiao
University, Quanzhou, China.
Naixue Xiong (SM’12) received the first Ph.D.
degree in software engineering from Wuhan
University, Wuhan, China, in 2007, and the second
Ph.D. degree in dependable networks from the Japan
Advanced Institute of Science and Technology,
Nomi, Japan, in 2007.
He worked with Colorado Technical University,
Colorado Springs, CO, USA, Wentworth Technology
Institution, Boston, MA, USA, and Georgia State
University, Atlanta, GA, USA, for many years. He
is currently a Professor with Northeastern State
University, Tahlequah, OK, USA. His research interests include cloud computing, security and dependability, parallel and distributed computing, networks,
and optimization theory.
Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on March 20,2023 at 14:57:36 UTC from IEEE Xplore. Restrictions apply.
Download