KNN-BLOCK DBSCAN: Fast Clustering for Big Data

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 51, NO. 6, JUNE 2021 3939 KNN-BLOCK DBSCAN: Fast Clustering for Large-Scale Data Yewang Chen , Lida Zhou, Songwen Pei , Senior Member, IEEE, Zhiwen Yu , Senior Member, IEEE, Yi Chen, Xin Liu, Member, IEEE, Jixiang Du , and Naixue Xiong , Senior Member, IEEE Abstract—Large-scale data clustering is an essential key for big data problem. However, no current existing approach is “optimal” for big data due to high complexity, which remains it a great challenge. In this article, a simple but fast approximate DBSCAN, namely, KNN-BLOCK DBSCAN, is proposed based on two findings: 1) the problem of identifying whether a point is a core point or not is, in fact, a kNN problem and 2) a point has a similar density distribution to its neighbors, and neighbor points are highly possible to be the same type (core point, border point, or noise). KNN-BLOCK DBSCAN uses a fast approximate kNN algorithm, namely, FLANN, to detect core-blocks (CBs), noncore-blocks, and noise-blocks within which all points have the same type, then a fast algorithm for merging CBs and assigning noncore points to proper clusters is also Manuscript received December 29, 2018; revised May 31, 2019 and July 25, 2019; accepted November 18, 2019. Date of publication December 18, 2019; date of current version May 18, 2021. This work was supported in part by the National Natural Science Foundation of China under Grant 61673186, Grant 61972010, Grant 61975124, Grant 61722205, Grant 61751205, Grant 61572199, and Grant U1611461, in part by the Funds from State Key Laboratory of Computer Architecture, ICT, CAS under Grant CARCH201807, in part by the Open Project of Provincial Key Laboratory for Computer Information Processing Technology, Soochow University under Grant KJS1839, in part by the Quanzhou City Science and Technology Program of China under Grant 2018C114R, in part by the Open Project of Beijing Key Laboratory of Big Data Technology for Food Safety under Grant BTBD-2019KF06, in part by the Key Research and Development Program of Guang Dong Province under Grant 2018B010107002, and in part by the Grant from the Guang Dong Natural Science Funds under Grant 2017A030312008. This article was recommended by Associate Editor G. Nicosia. (Corresponding authors: Songwen Pei; Zhiwen Yu.) Y. Chen is with the College of Computer Science and Technology, Huaqiao University (Xiamen Campus), Xiamen 361021, China, also with the Beijing Key Laboratory of Big Data Technology for Food Safety, Beijing Technology and Business University, Beijing 100048, China, also with the Provincial Key Laboratory for Computer Information Processing Technology, Soochow University, Suzhou 215301, China, and also with the Fujian Key Laboratory of Big Data Intelligence and Security, Huaqiao University (Xiamen Campus), Xiamen 361021, China (e-mail: ywchen@hqu.edu.cn). L. Zhou and X. Liu are with the College of Computer Science and Technology, Huaqiao University, Quanzhou 362021, China. S. Pei is with the Shanghai Key Laboratory of Modern Optical Systems, University of Shanghai for Science and Technology, Shanghai 200093, China (e-mail: swpei@usst.edu.cn). Z. Yu is with the School of Computer Science and Engineering, South China University of Technology, Guangzhou 510640, China (e-mail: zhwyu@scut.edu.cn). Y. Chen is with the Beijing Key Laboratory of Big Data Technology for Food Safety, Beijing Technology and Business University, Beijing, China. J. Du is with the College of Computer Science and Technology, Huaqiao University, Quanzhou 362021, China, and also with the Fujian Key Laboratory of Big Data Intelligence and Security, Huaqiao University, Quanzhou 362021, China. N. Xiong is with the Department of Mathematics and Computer Science, Northeastern State University, Tahlequah, OK 74464 USA. Color versions of one or more figures in this article are available at https://doi.org/10.1109/TSMC.2019.2956527. Digital Object Identifier 10.1109/TSMC.2019.2956527 invented to speedup the clustering process. The experimental results show that KNN-BLOCK DBSCAN is an effective approximate DBSCAN algorithm with high accuracy, and outperforms other current variants of DBSCAN, including ρ-approximate DBSCAN and AnyDBC. Index Terms—DBSCAN, DBSCAN. FLANN, kNN, KNN-BLOCK I. I NTRODUCTION LUSTERING analysis is the task of grouping objects according to measured or perceived intrinsic characteristics or similarity, aiming to retrieve some natural groups from a set of patterns or points. It is a fundamental technique in many applications, such as data mining, pattern recognition, etc., and many researchers believe that clustering is an essential key for analyzing big data [1]. Currently, there are thousands of clustering algorithms have been proposed, for example, k-means [2], mean shift [3], DBSCAN [4], spectral clustering [5], [6], mixtures of dirichlet model [7], [8], clustering based on supervised learning [9], and clustering by local cores [10], [11]. According to Jain et al. [12], different categories of these clustering approaches are recognized: centroid-based clustering, partitioning clustering, density-based clustering etc. The goal of density-based clustering is to identify densely regions with arbitrary shape, which can be measured by the density of a given point. An identified cluster is usually a region with high density, while outliers are regions with low densities. Hence, density-based clustering is one of the most popular paradigms. There are many algorithms of this kind, such as DBSCAN [4], OPTICS [13], DPeak [14]–[16], mean shift [3], DCore [11], etc., where DBSCAN [4] is the most famous one and has been widely used. Unfortunately, most of the current existing clustering approaches do not work well for large-scale data, due to their high complexities. For example, the complexity of k-means is O(ktn) where t is the iterations times, DBSCAN runs in O(n2 ). In this article, a fast approximate algorithm named KNN-BLOCK DBSCAN,1 is proposed to speedup DBSCAN, which is able to deal with large-scale data. We also concentrate on comparing our algorithm with DBSCAN, ρ-approximate DBSCAN [17], and AnyDBC. C 1 https://github.com/XFastDataLab/KNN-BLOCK-DBSCAN c 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. 2168-2216 See https://www.ieee.org/publications/rights/index.html for more information. Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on March 20,2023 at 14:57:36 UTC from IEEE Xplore. Restrictions apply. 3940 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 51, NO. 6, JUNE 2021 TABLE I D ESCRIPTION OF M AIN VARIABLES AND S YMBOLS U SED IN T HIS A RTICLE The main contributions of this article are listed as follows. 1) We find that the key problem in DBSCAN of identifying the type of each point is a kNN problem in essence. Therefore, many techniques of this field, such as FLANN [18], kd-tree [19], cover tree [20], etc., can be utilized. 2) According to a general rule that a point has similar density distribution to its neighbors, and neighbor points are likely to be the same type (core, border, or noise). Then, a technique is proposed to identify blocks within which all points have the same type, such as CBs, noncore blocks, and noise blocks. 3) A fast algorithm is also invented for merging CBs and assigning noncore points to corresponding clusters. Before introducing the proposed algorithm, we would like to present the main variables and symbols used in this article as follows. Let P be a set of n points in D-dimensional space RD ; pi ∈ P be the ith point of P; dp,q (or dist(p, q)) be the distance between points p and q, where the distance can be Euclidean or Chebychev distance; be the scanning radius of DBSCAN; dp,(i) be the distance from p to its ith nearest neighbor, and p(i) be the ith nearest neighbor of p. More symbols are shown in Table I. The remainder of this article is organized as follows. Section II introduces the related work of DBSCAN and nearest neighbor query. Section III revisits FLANN, DBSCAN, and ρ-approximate DBSCAN. Section IV addresses the proposed method, KNN-BLOCK DBSCAN, in detail, including basic ideas, processes, and algorithms. Section V shows experiments and makes comparison with ρ-approximate DBSCAN on some data sets. Section VI gives the final conclusion, and our future work that could improve the proposed method. II. R ELATED W ORK A. Variants of DBSCAN DBSCAN is designed to discover clusters of arbitrary shape. It needs two parameters, one is scanning radius , and the other is MinPts which is used as a density threshold for deciding whether a point is a core point or not. If a tree-based spatial index is used, the average complexity is reduced to O(n log(n)) [4]. However, this turns out to be a misclaim: as pointed out by Gunawan and de Berg [21], DBSCAN actually runs in O(n2 ) time, regardless of and MinPts. Unfortunately, this misclaim is widely accepted as a building brick in many research papers and textbooks, e.g., [22]–[24], etc. Furthermore, DBSCAN is almost useless in high dimension, due to the so-called “curse of dimensionality.” Mahran and Mahar [25] introduced an algorithm named GriDBSCAN to enhance the performance of DBSCAN by using grid partitioning and merging, yielding a high performance with the advantage of a high degree of parallelism. But this technique is inappropriate for highdimensional data because the effect of redundancy in this algorithm increases exponentially with dimension. Similarly, Gunawan and de Berg [21] proposed an algorithm named Fast-DBSCAN to improve DBSCAN for two-dimensional (2-D) data, which also imposes an arbitrary grid √ T on 2D space, where each cell of T has side length / 2. If a nonempty cell c contains at least MinPts points, then this cell is called core cell, and all points in this cell are core points because the maximum distance within this cell is . Therefore, it is unnecessary to compute densities for each point in a core cell. Gan and Tao [17] proposed an algorithm named ρ-approximate DBSCAN also based on grid technique for large data set, and achieved an excellent complexity O(n) in low dimension. But it degenerates to an O(n2 ) algorithm in high even relative high-dimensional data space. Besides, parallel GridDBSCAN [26] and GMDBSCAN [27] are also grid-based DBSCAN. AnyDBC [28] compresses the data into smaller densityconnected subsets called primitive clusters and labels objects based on connected components of these primitive clusters for reducing the label propagation time. To speedup the range query process, it uses kd-trees [14] for indexing data, and performs substantially fewer range queries compared to DBSCAN while still guaranteeing the exact final result of DBSCAN. There are some other variants of DBSCAN as following. IDBSCAN [29] is a sampling-based DBSCAN, which is able to handle large spatial databases with minimum I/O cost by incorporating a better sampling technique, and reduces the memory requirement for clustering dramatically. KIDBSCAN [30] presents a new technique based on the concept of IDBSCAN, in which k-means is used to find the high-density center points and then IDBSCAN is used to expand clusters from these high-density center points. Based on IDBSCAN, Quick IDBSCAN [31] (QIDBSCAN) uses four marked boundary objects (MBOs) to expand computing directly. Moreover, because exact clustering is too costly, this has generated interest in many approximate methods, including our algorithm, to speed up original DBSCAN in the past two decades. Here, the approximation means that the clustering result may be different from that of the original DBSCAN. For example, in original DBSCAN, a data point p may be classified into one cluster, while in approximate Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on March 20,2023 at 14:57:36 UTC from IEEE Xplore. Restrictions apply. CHEN et al.: KNN-BLOCK DBSCAN: FAST CLUSTERING FOR LARGE-SCALE DATA Algorithm 1 [18] SearchKmeansTree 1: Input: query point q; the K value of kNN; the maximum number of examined points L; k-means tree T; 2: count := 0; 3: PQ := empty priority queue 4: R := empty priority queue 5: curNode := T 6: TraverseKmeansTree(curNode, PQ, R, count, q) 7: while PQ <> NULL and count < L do 8: curNode := top of PQ 9: TraverseKmeansTree(curNode, PQ, R, count, q) 10: end while 11: Return K top points from R DBSCAN, it may be designated into another cluster. A scalable RNN-DBSCAN [32] solution was investigated to improve DBSCAN by using an approximate kNN algorithm. NGDBSCAN [33] is an approximate density-based clustering algorithm that operates on arbitrary data and any symmetric distance measure. The distributed design of this algorithm makes it scalable to very large data sets; its approximate nature makes it fast, yet capable of producing high-quality clustering results. B. Nearest Neighbors Searching Algorithms In the past few decades, many researchers have launched large amounts of fruitful researches in the field of nearest neighbor query, many techniques have been proposed and successfully applied to accelerate the processes of searching neighbors. For example, partition trees (kd-tree [34], [35], semi-convex hull tree [36]), hashing techniques such as ANN based on trinary-project tree [37]. Because the exact search is time-consuming for many applications, then the approximate nearest neighbor query is optional in some cases, which returns nonoptimal results, but runs much faster. For example, FLANN [18], [38] uses the priority search k-means tree or the multiple randomized kd forest [39] which can give the best performance on a wide range of dimensional data space. In this article, we mainly use it to improve the performance of DBSCAN. III. FLANN, ρ-A PPROXIMATE DBSCAN REVISITED FLANN: In this article, we use FLANN with the priority search k-means tree to perform the nearest neighbor query, where the priority k-means tree is constructed by k-means (see [18, Algorithm 1]) that partition the data points at each level into χ (in [18], it is denoted as K which represents to the cluster number K of k-means tree. In this article, we use character χ to replace with K in order to make it different from the K value of kNN), which is called the branching factor with default value 512, distinct regions recursively, until the total number of points in a region is less than χ . As Algorithm 1 shows, given a query point q, the priority k-means the tree is searched by the following steps. 3941 Algorithm 2 [18] TraverseKmeansTree 1: Input: current node curNode; priority queue PQ; priority queue R count; query point q 2: if curNode is leaf then 3: search all points in curNode and add them to R 4: count := count + |curNode| 5: else 6: subNodes := sub nodes of curNode 7: nearestSubNode := nearest node of subNodes to q 8: subNodes := subNodes -nearestSubNode 9: PQ := PQ + subNodes 10: TraverseKmeansTree(nearestSubNode, PQ, R) 11: end if Fig. 1. Example of core cells. Core cells are shown in gray, and each point in the core cell is a core point [17]. 1) Initially traverse the tree from root to the q s nearest leaf node, followed by nonleaf node with the closest cluster center to q, and add all unexplored branches along the path to a priority queue [(PQ): lines 7–9 in Algorithm 2], which is sorted in increasing distance from q to the boundary of the branch being added to the queue. 2) Restart to traverse the tree in the queue from the top branch (line 10 in Algorithm 2). Let I be the maximum iteration times of k-means, and L be the number of examined points by FLANN. The height of the tree is about log(n)/ log(χ ) if the tree is balanced. During each traversal from top to down, there are about O(log(n)/ log(χ )) inner nodes and one leaf node should be checked. Thus, the complexity of FLANN is about O(LD(log(n)/ log(χ ))), where L is the number of examined points. ρ-Approximate DBSCAN: For simplicity, the basic concepts and terms of DBSCAN [4] (e.g., core points, densityreachable, cluster, noise, etc.) are not presented here. Aiming to improve DBSCAN, ρ-approximate algorithm imposes a simple quadtree-like hierarchical grid T on D-dimensional space, and divides the data space into a set of nonempty cells. Each √ cell is a D-dimensional hyper-square with side length / D. Fig. 1 shows an example in 2-D space. Then, it builds a graph by redefining the definition and computation of the graph G = (V, E): 1) each vertex is a core cell and 2) given two different core cells c1 , c2 . 1) ∃p1 ∈ c1 , p2 ∈ c2 , such that dist(p1 , p2 ) ≤ , there is an edge between c1 and c2 . 2) If ∃p1 ∈ c1 is within the (1 + ρ)-neighborhood of p2 ∈ c2 , there is no edge between c1 and c2 . 3) Otherwise, don’t care. Based on the graph G and the quadtree-like hierarchical grid, an approximate range counting algorithm is designed to solving the problem of DBSCAN. Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on March 20,2023 at 14:57:36 UTC from IEEE Xplore. Restrictions apply. 3942 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 51, NO. 6, JUNE 2021 IV. P ROPOSED A LGORITHM A. Drawbacks of DBSCAN Analysis DBSCAN runs in O(n2 ), and most of its variants still do not work well for large-scale data. In order to find the underlying causes, we analyzed fundamental techniques used in traditional clustering approaches, and find that there are some significant deficiencies as follows. 1) Brute force algorithm is used in original DBSCAN to compute density for an arbitrary data point, the complexity is O(n). However, there are many redundancies. Suppose di,k and dj,k are already known, while di,j is unknown. Suppose |di,k −dj,k | > or di,k +dj,k ≤ , then we can infer di,j > or di,j ≤ according to triangle inequality, respectively. Then, the distance computation for di,j is also unnecessary. 2) In the case of the grid technique is used, the side length √ of each cell is fixed to / D, which implies that it is almost useless in high dimension [40]. eps eps/2 eps Fig. 2. Example of a CB. MinPts = 8, = eps and there are eight red points which are within N(/2) (p), then all red points are core points. r−eps eps q p eps r Fig. 3. Example of an NCB. MinPts = 22, = eps, r > eps, and the total number of points within Nr (p) (the outer circle) is 21, then all red points are noncore points, because they are all within Nr− (p). B. Basic Ideas As mentioned above, DBSCAN cannot deal with large-scale data due to its high complexity. According to our observation and analysis on DBSCAN, there are two findings as follows. 1) The key problem of DBSCAN is to find core points, which is a kNN problem in essence, because the density defined in DBSCAN is the total number of points within a specified neighborhood, and all neighbors of a core point should be reported for merging. 2) Point p and point q should have similar neighborhoods, provided p and q are close; the closer they are, the more similar neighborhood they have. Thus, it is highly possible that a point has the same type as its neighbors. Hence, it is reasonable to utilize kNN technique to solve the problem of DBSCAN. Formally, let K = MinPts and p(1) , . . . , p(K) be the first K nearest neighbor points of p, where 1 ≤ i ≤ K, then we have the following. Theorem 1: 1) If dp,(K) ≤ , then p is a core point. 2) p is a noncore point if dp,(i) > , where 1 ≤ i ≤ K. Proof: 1) Because dp,(K) ≤ , which means dp,(1) ≤ dp,(2) ≤, . . . , ≤ dp,(K) ≤ , |N (p)| ≥ K = MinPts, p is a core point. 2) Because 1 ≤ i ≤ K and dp,(i) > , < dp,(i) ≤ dp,(K) . Thus, |N (p)| < K = MinPts, i.e., p is a noncore point. As a result of Theorem 1, we argue that the problem of identifying whether a point is a core point or not is a kNN problem. Theorem 2: If dp,(K) ≤ (/2), p(1) , p(2) , . . . , p(K) are all core points. Proof: Because dp,(K) ≤ (/2) ≤ , according to triangle inequality, we have ∀i, j ∈ [1, K] dist(p(i) , p(j) ) ≤ . Therefore, ∀i ∈ [1, K] we have |N (p(i) )| ≥ K, i.e., p(1) , p(2) , . . . , p(K) are all core points. Definition 1 (Core-Block (CB)): Nξ (p) is a CB with respect to p and ξ , if ∀q ∈ Nξ (p) is core point. It is noted as CB(p, ξ ), and p is called the center of CB(p, ξ ). As Fig. 2 shows, all red points are within N(/2) (p), and the total number of red points is 8 which is equal to MinPts, p pi r−eps p r−2*eps eps r Fig. 4. Example of a noise-CB. MinPts = 22, = eps and r > 2, then all red points within green circle are noise, because Nr− (p) is noncore block which implies there is no core point within the red circle. then according to Theorem 2 all red points are core points. Therefore, N(/2) (p) is a CB. Theorem 3: Let dp,(K) = r, (1) if r > , then ∀q ∈ Nr− (p) is noncore point. (2) if r > 2, then ∀q ∈ Nr−2 (p) is noise. Proof: 1) Because dp,(K) = r > , which means ∀q ∈ Nr− (p), N (q) ∈ Nr (p), therefore, |N (q)| < |Nr (p)| = MinPts. Thus, q is a noncore point. 2) Because dp,(K) = r > 2, then ∀q ∈ Nr−2 (p), we have N (q) ∈ Nr− (p), and because Nr− (p) is a noncore-block (NCB), which implies there is no core point in N (q), then q is noise. Definition 2 (None-Core Block (NCB)): Nξ (p) is an NCB with respect to p and ξ , if ∀q ∈ Nξ (p) is noncore point. It is noted as NCB(p, ξ ), and p is called the center of NCB(p, ξ ). Definition 3 (Noise-Block (NOB)): Nξ (p) is an NOB with respect to p and ξ , if ∀q ∈ Nξ (p) is noise. It is noted as NOB(p, ξ ), and p is called the center of NOB(p, ξ ). Obviously, an NOB is NCB, but an NCB may not be NOB; neither NCB nor NOB is CB, and vice versa. Fig. 3 addresses an example of Theorem 3 (1). Because MinPts = 22, = eps and r > eps, it is impossible for each point within the blue circle to find enough neighbors within its -neighborhood, (because the total number of points within Nr (p), i.e., the outer circle, is 21). Thus, all points within the blue circle are noncore points, i.e., Nr− (p) is an NCB. Fig. 4 is another example to explain Theorem 3 (2). Because r > 2, all points within green circle are noncore points, and it is also impossible for any point p within green circle to Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on March 20,2023 at 14:57:36 UTC from IEEE Xplore. Restrictions apply. CHEN et al.: KNN-BLOCK DBSCAN: FAST CLUSTERING FOR LARGE-SCALE DATA Fig. 5. Framework of KNN-BLOCK DBSCAN. It uses FLANN to identify CBs, NCBs, and NOB, then merges CBs, assigns points in NCBs to proper clusters and discards noises. Algorithm 3 KNN-BLOCK DBSCAN(P, , MinPts) 1: Input: P is input data; [, MinPts]; 2: Output: cluster id of each point; 3: Initialize core-blocks set CBs = {φ} 4: Initialize non-core-blocks set NCBs = {φ} 5: K := MinPts, cur_cid := 0 // current cluster id 6: for each unvisited point p ∈ P do 7: {p(1) , . . . , p(K) } := FLANN :: kNN(p, P) 8: ξ := dp,(K) , Nξ (p) := {p(1) , p(2) , . . . , p(K) } 9: if ξ ≤ then 10: cur_cid := cur_cid + 1 11: if ξ ≤ 2 then 12: push Nξ (p) into CBs //a core block found 13: ∀s ∈ Nξ (p) mark s as core-point and visited 14: else 15: push N0 (p) into CBs //single core point 16: mark p as core-point and visited 17: end if 18: curCorePts:= core points already found in Nξ (p) 19: exist_cids:= clusters found in curCorePts 20: merge exist_cids into cur_cid 21: assign Nξ (p) to cluster cur_cid 22: else if < ξ ≤ 2 then 23: push Nξ − (p) into NCBs 24: mark all points within Nξ − (p) as visited 25: else if ξ > 2 then 26: mark ∀q ∈ Nξ −2 (p) as noise and visited 27: end if 28: end for 29: CBCENT := extract all center points from CBs 30: Create a index tree by FLANN from CBCENT 31: MergeCoreBlocks(CBs, CBCENT, cbIDs, ) 32: AssignNonCoreBlocks(NCBs, CBs, CBCENT, ) find any core point from which p is directly density-reachable, because Nr− (p) is noncore block which implies there is no core point within the red circle. Thus, points within Nr−2 (p) are all outliers, i.e., Nr−2 (p) is an NOB. Definition 4: A core block CB(p, ξ1 ) is density-reachable from another core block CB(q, ξ2 ), if ∃s ∈ CB(p, ξ1 ) and w ∈ CB(p, ξ2 ), such that s is density-reachable from w. Definition 5: A point p is density-reachable from core block CB(q, ξ ), if ∃s ∈ CB(q, ξ ) such that p is density-reachable from q. Comprehensively, based on the two findings mentioned above, the difference of between this article and other variants 3943 Algorithm 4 MergeCoreBlocks(CBs, ) 1: Input: CBs: core-blocks; CBCENT: core-block centers set; is the parameter of DBSCAN ; 2: for each core-block CB(p, ξ1 ) do 3: Neibs := FLANN::RangeSearch(p, 2, CBCENT) 4: for each q ∈ Neibs do 5: CB(q, ξ2 ) be the core-block of q 6: if p and q are in different cluster then 7: if dp,q > ξ1 + ξ2 + then 8: BruteForceMerge(CB(p, ξ1 ), CB(q, ξ2 )) 9: end if 10: end if 11: end for 12: end for Algorithm 5 AssignNonCoreBlocks(NCBs, CBs, ) 1: Input: NCBs: non-core-blocks; CBs: core blocks; is the parameter of DBSCAN; 2: for each non-core-block NCB(p, ξ1 ) do 3: r := ξ1 + 1.5; 4: Neibs := FLANN::RangeSearch(p, r,CBCENT) 5: if ∃q ∈ Neibs s.t. dp,q ≤ ( − ξ1 ) then 6: merge NCB(p, ξ1 ) into the cluster of q 7: process next non-core-block 8: else 9: for each unclassified o ∈ NCB(p, ξ1 ) do 10: if ∃q ∈ Neibs s.t. dp,q ≤ ( + ξ1 + ξ2 ) then 11: if ∃s ∈ CB(q, ξ2 ) s.t.do,s ≤ then 12: assign o to the cluster of q 13: process next unclassified point o 14: end if 15: end if 16: end for 17: end if 18: end for of DBSCAN mainly lies in: 1) kNN is used, instead of using range query algorithm, to identify core points and noncore points by block (CBs, NCBs, and NOBs); 2) each block has a dynamic range, while the width of grid used in ρ-approximate DBSCAN and fast DBSCAN is a constant; and 3) CBs can be processed by a simple way which is far more efficient than grid. C. Algorithms In this section, we outline the proposed method. The framework of KNN-BLOCK DBSCAN is shown in Fig. 5. First, it uses FLANN to identify CBs, NCBs, and NOB. Second, for any two pairs of CBs, it merges them into the same cluster provided they are density-reachable from each other. Third, for each point p in NCBs, KNN-BLOCK DBSCAN may assign p to a cluster if there exists a core point from which it is density-reachable. The details are shown in Algorithm 3, 4, 5, and 6, respectively. 1) Types and Blocks Identification: As Algorithm 3 shows, for each an unvisited point p in P, it uses FLANN::kNN to Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on March 20,2023 at 14:57:36 UTC from IEEE Xplore. Restrictions apply. 3944 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 51, NO. 6, JUNE 2021 Algorithm 6 BruteForceMerge(CB(p, ξ1 ), CB(q, ξ2 )) 1: Input: CB(p, ξ1 ): a core-block; CB(q, ξ2 ): another coreblock; 2: Initialize two points set O = {φ} and S = {φ} 3: for each point o in CB(q, ξ2 ) do 4: push o to O if do,p < + ξ1 5: end for 6: for each point s in CB(p, ξ1 ) do 7: push s to S if ds,q < + ξ2 8: end for 9: if ∃o ∈ O, s ∈ S, s.t. do,s ≤ then 10: merge CB(p, ξ1 ) and CB(q, ξ2 )) 11: end if Fig. 6. Three cases of two CBs. (a) Two CBs can be merged directly. (b) Is a case that can skip directly for they are far from each other. (c) Addresses the third case that is necessary to check in detail. retrieve the first K (K = MinPts) nearest neighbors of p. According to Theorem 1, the type of p can be identified. If p is a core point, we may find a core block according to Theorem 2 (lines 11–13). If p is not a core point, we may find an NCB (lines 22–24) or noise block (lines 25 and 26) according to Theorem 3. 2) Blocks Merging: Let CB(p, ξ1 ) and CB(q, ξ2 ) be two CBs, there are three cases as described below. Case 1 (dp,q ≤ ): As image (a) in Fig. 6 shows, because p is directly density-reachable from q, both CBs can be merged into a same cluster directly. As shown from lines 20 and 21 in Algorithm 3, suppose CB(p, ξ1 ) is a newly identified CB, and if there are some points that have already been assigned to other clusters within CB(p, ξ1 ), then these clusters can be merged directly. Case 2 (dp,q > ( + ξ1 + ξ2 )): As illustrated in Fig. 6 (b), they are far away from each other, there is no need to merge them, because according to triangle inequality, there is no point in CB(p, ξ1 ) that is density-reachable from another point in CB(q, ξ2 ). Case 3 ( < dp,q ≤(ξ1 + ξ2 + )): As Fig. 6(c) addresses, CB(p, ξ1 ) and CB(q, ξ2 ) have no intersection, and they can be merged if there exists a pair of points (o1 , o2 ) where dist(o1 , o2 ) ≤ , o1 ∈ CB(p, ξ1 ) and o2 ∈ CB(q, ξ2 ). In order to detect this case effectively, a simple method is proposed as Algorithm 6 illustrates. First, we select point set O ⊆ CB(q, ξ2 ) such that ∀m ∈ O s.t. dp,m ≤ + ξ1 , and point set S ⊆ CB(p, ξ1 ) such that ∀s ∈ S s.t. dp,m ≤ + ξ2 . Then, we simply utilize brute force algorithm to check whether there exist two points o ∈ O, s ∈ S that are directly density-reachable from each other, and merge two CBs if yes. As Fig. 7 shows, set O is within the right shadow region, while S is within the left shadow region. Only points ξ2 ξ1 ε+ξ 1 p q ε+ξ 2 Fig. 7. Example of case (3) for merging CBs. CB(p, ξ1 ) is a CB, CB(q, ξ2 ) is another CB, only points in the two shadow region are possible directly density-reachable from each other. in the two shadow regions are checked, instead of whole two CBs. 3) Borders Identification: At last, given a CB CB(p, ξ1 ) and an NCB NCB(q, ξ2 ), Algorithm 5 (AssignNonCoreBlocks) is called to identify border points in NCB(q, ξ2 ) that are densityreachable from CB(p, ξ1 ). Similar to Fig. 6, there are also three cases as described below. Case 1 (dp,q >( + ξ1 + ξ2 )): NCB(q, ξ2 ) is far from CB(p, ξ1 ), then it is unnecessary to merge them. Case 2 (dp,q ≤( − ξ2 )): Because NCB(q, ξ2 ) is totally contained in N (p), all points within NCB(q, ξ2 ) are densityreachable from p. Therefore, all points in NCB(q, ξ2 ) are assigned to the cluster of p directly. Case 3 (( − ξ2 )< dp,q ≤( + ξ2 )): it is necessary to check whether each point within NCB(q, ξ2 ) is density-reachable from p. Similar to Fig. 7, only points within two shadow regions are checked. D. Complexity Analysis Let n be the cardinality of data set, b0 = b1 + b2 + b3 be the total number of all blocks, where b1 , b2 , and b3 are the total number of CBs, NCBs, and NOBs, respectively. Averagely, b0 = β(n/MinPts), where β is a factor about the distribution of the data, and b0 is usually far less than n provided [, MinPts] are well chosen (how to choose good parameters for DBSCAN is another big topic, such as OPTICS [13] and others [41]–[43], which is out of the scope of this article). The complexity of Algorithm 3 is analyzed as follows. Space Complexity: As shown in the above algorithms, we can see that each block should be saved, thus the space cost is about O(MinPts ∗ b0 ) = O(βn). Time Complexity: 1) From lines 6–29 of Algorithm 3, we can infer that FLANN::kNN will be called about b0 times. As we know, in the case of using priority search k-means tree, FLANN::kNN runs in O(L D log(n)/ log(χ )) expected time [18] for each query, where L is a data points examined by FLANN, D is dimension, and χ is a branching factor of the tree used in FLANN. Thus, the complexity of finding blocks is about O(b0 [L D log(n)/ log(χ )]). 2) The complexity of creating a tree by FLANN from CBCENT is about O(b1 D log(b1 )). 3) The complexity of Algorithm 4: There are two main parts as follows. CBs, for each CB a) There are b1 FLANN::RangeSearch is called to find its 2-neighbors from CBCENT, the complexity is about O(b1 [L d log(b1 )/log(χ )]). Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on March 20,2023 at 14:57:36 UTC from IEEE Xplore. Restrictions apply. CHEN et al.: KNN-BLOCK DBSCAN: FAST CLUSTERING FOR LARGE-SCALE DATA b) For each CB, the total number of points in a CB is usually far less than n, i.e., MinPts << n, then the complexity of Algorithm 6 is averagely about O(MinPts). Hence, since MinPts << n can be regarded as a constant, the complexity of Algorithm 4 is about O(b1 [L D log(b1 )/ log(χ )]). 4) The complexity of Algorithm 5: there are also two main parts as follows. a) There are b2 NCBs. For each NCB we call FLANN::RangeSearch to find its (ξ1 + 1.5)neighbors from CBCENT, the complexity is about O(b2 [L D log(b1 )/log(χ )]). b) The average complexity of assigning an unclassified point in NCBs to a cluster (from line 5 to line 17) is about O(MinPts[L D log(b1 )/ log(χ )]). Hence, the complexity of Algorithm 5 is less log(b1 )/ log(χ )]) < than O(b2 MinPts [L D O(UCPtsNum [L D log(b1 )/ log(χ )]), where UCPtsNum is the total number of unclassified points in all NCBs. As mentioned above, b0 = b1 + b2 + b3 = (βn/MinPts) is far less than n provided [, MinPts] are well chosen, then the overall time complexity is about as O(b0 [L D log(n)/ log(χ )]) = O ([βn/MinPts] [L D log(n)/ log(χ )]) < O(L D n log(n)/ log(χ )). In the case of dealing with very high dimensional data sets, FLANN::kNN degenerates to be an O(n) algorithm, and then the complexity of KNN-BLOCK DBSCAN is about O(b0 [L D n/ log(χ )]). In the worst case, if there is none CB and FLANN::kNN runs in O(n), the complexity of KNN-BLOCK DBSCAN is O(n2 ). V. E XPERIMENTS A. Algorithms and Set Up In this section, to evaluate the correctness and effectiveness of the proposed approach, several experiments are conducted on different data sets at Intel Core i7-3630 CPU @2.50 GHz, 8G RAM. We mainly compare the proposed algorithm with ρ-approximate DBSCAN, AnyDBC [28] and pure kNN-based DBSCAN. 1) “KNN-BLOCK” is KNN-BLOCK DBSCAN which is coded in C++ and runs on Windows 10 64-bit operating system, the tree used in FLANN is priority search k-means tree, and the cluster number χ of k-means is 10. 2) Approx is ρ-approximate DBSCAN which is also written in C++ and runs on Linux (Ubuntu 14.04 LTS) operating system. 3) AnyDBC is the efficient anytime density-based clustering algorithm [28]. 4) kNN-based DBSCAN is an algorithm which only uses FLANN::kNN technique to accelerate DBSCAN, as shown in Algorithm 7, and the complexity is about O(L D n log(n)/ log(χ )), where L is a data points examined by FLANN, D is dimension, and χ is a branching factor of the tree used in FLANN. 3945 Algorithm 7 Pure kNN-Based DBSCAN 1: Input: data set P, and , MinPts; 2: coreSet := {φ} 3: for each unclassified p ∈ P do 4: neibors:= FLANN::kNN(p,MinPts); 5: if dp,(MinPts) ≤ then 6: push p into coreSet 7: end if 8: end for 9: for each core point p ∈ coreSet do 10: neibCores := find core points from k-neighbors of p 11: merge neibCores and p into one cluster 12: end for 13: for each two pair of clusters c1 and c2 do 14: merge c1 and c2 if ∃p1 ∈ c1 and p2 ∈ c2 s.t. p1 is density reachable p2 15: end for 16: find border points and assign them B. Data Sets Data sets come from UCI (https://archive.ics.uci.edu/ml/ind ex.php), including PAM (PAMPA2), HOUSE (household), USCENCUS (USCensus 1990), gas sensor, FMA (dataset for music analysis), AAS-1K (Amazon access samples), HIGGS, etc., where AAS-1K is a 1000-dimensional data set which is extracted from 20 000-dimensional data set AAS. For each data set, all duplicate points are removed to make each point unique, all missing values are set to 0, and each dimension of these data sets is normalized to [0, 105 ]. The following part of this section lists brief descriptions of these data sets. PAM39D is a real 39-dimensional dataset, PAMAP2, with cardinality n = 3, 850, 505; PAM4D is a real dataset obtained by taking the first four principle components (PCA) of PAMPA2; Household: dim = 7, n = 2049280; USCENCUS: dim = 36, n = 365100; GasSensor (Ethylene-CO): dim = 16, n = 4208261; MoCap: dim = 36, n = 65536; APS (APS Failure at Scania Trucks): dim = 170, n = 30000; Font (CALIBRI): dim = 36, n = 19068; HIGGS: dim = 28, n = 11000000; FMA: dim = 512, n = 106574; AAS − 1K: AAS is a large sparse data set, and AAS-1K is a subset extracted from AAS with dim = 1000, n = 30000. C. Two Examples of Clustering We benchmark KNN-BLOCK DBSCAN on two 2-D test cases to reveal the processes in detail, as shown in Fig. 8. The left case is aggregation [44], and the right case comes from [1]. Specifically, in Fig. 8(a) presents the original data distribution. Fig. 8(b) draws CBs, NCBs, and NOBs plotted by black, green, and red circles, respectively. The radius of each circle is different, which means each block has a different size. We also can see that NCBs usually distribute along the border of CBs, and NOBs appears far from CBs. Fig. 8(c) illustrates the result of merging CBs, which is the most important step to identify clusters; in Fig. 8(d), as mentioned in Section IV-C3, there are three cases to process NCBs: the green circles represent Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on March 20,2023 at 14:57:36 UTC from IEEE Xplore. Restrictions apply. 3946 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 51, NO. 6, JUNE 2021 30 30 30 20 20 20 10 10 10 0 0 20 40 0 0 20 40 0 0 30 30 30 20 20 20 10 10 10 0 0 20 (d) 40 0 0 20 (e) 20 40 (c) (b) (a) 40 0 0 20 40 (f) 0.5 0.5 0.5 0 0 0 −0.5 −0.5 0 (a) 0.5 −0.5 −0.5 0 (b) 0.5 −0.5 −0.5 0.5 0.5 0.5 0 0 0 −0.5 −0.5 0 (d) 0.5 −0.5 −0.5 0 (e) 0.5 −0.5 −0.5 0 0.5 0 0.5 (c) (f) Fig. 8. Two examples present the processes of KNN-BLOCK DBSCAN. (a) is the original data distribution; (b) shows 3 kinds of blocks found by KNN-BLOCK DBSCAN, where black circles are core blocks, green circles are NCBs and red are NOBs; (c) illustrates clusters found after merging CBs; (d) addresses the assignment of NCBs to corresponding clusters, the red balls are NCBs that can be assigned to their nearest clusters, and green circles in (d) are those who find no cluster to assign; (e) exhibits the final result of KNN-BLOCK DBSCAN, where black points are noise; (f) demonstrates the result of original DBSCAN. TABLE II RUNTIME C OMPARISONS ON S UBSETS OF HOUSE AND PAM W ITH n I NCREASING . T HE S PEEDUP OF KNN-BLOCK DBSCAN OVER I TS C OMPETITOR I S G IVEN IN B RACKETS . (U NIT: S ECOND ) case (1), because they are far from all core-points, all points within these NCBs are classified as noise; the red balls illustrate case (2), each of them is assigned to one cluster from which it is density-reachable; in case (3), for each point p within unclassified NCBs, if q is identified as a core from which p is density-reachable, then p is classified to the cluster of q. Fig. 8(e) exhibits the final result of KNN-BLOCK DBSCAN, where black points are noise; and Fig. 8(f) shows the result obtained by original DBSCAN. It is observed that KNN-BLOCK DBSCAN nearly obtains the same result as DBSCAN with high efficiency, because it processes data by blocks, and reduces a large number of redundant distance computations. D. Runtime Comparisons With ρ-Approximate DBSCAN The first experiment is conducted on a set of subsets of HOUSE and PAM4D to observe the complexities of the proposed algorithm and ρ-approximate DBSCAN with different [, MinPts]. Figs. 9 and 10 present the results of two algorithms, and Table II reveals more details. We also conduct experiments on the whole data sets of HOUSE, PAM4D, KDD04, USCENCUS, REACTION, MOPCAP, BODONI, HIGGS, FMA, and AAS1K, respectively, and Table III shows the comparison of runtime with different [, MinPts]. From two figures and two tables, we can observe as follows. 1) Both algorithms prefer large and small MinPts. For example, on data set HouseHolod, both KNN-BLOCK DBSCAN and ρ-approximate DBSCAN run best when [, MinPts] = [5000, 100], and the worst case happens when [, MinPts] = [1000, 200]. On other data sets, things are similar as shown in Table III. 2) Both algorithms run in linear expected time in low dimensional data sets. 3) We can see that on large-scale data sets PAM4D, HOUSEHOLD, and HIGGS, our algorithm Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on March 20,2023 at 14:57:36 UTC from IEEE Xplore. Restrictions apply. CHEN et al.: KNN-BLOCK DBSCAN: FAST CLUSTERING FOR LARGE-SCALE DATA 3947 TABLE III RUNTIME C OMPARISONS ON D IFFERENT DATA S ETS W ITH D IFFERENT AND M IN P TS . T HE S PEEDUP OF KNN-BLOCK DBSCAN OVER I TS C OMPETITOR I S G IVEN IN B RACKETS . (U NIT: S ECOND ) is much better, the speedup of KNN-BLOCK DBSCAN over its competitor is about 2.5–6 times on HOUSEHOLD, 1.4–3 times on PAM4D, and 16 and 17 times on HIGGS (28 dim). On other relative high-dimensional data sets, e.g., MOPCAP (36 dim) APS (170 dim), BODONI (256 dim), FMA (512 dim), and AAS-1K(1000 dim) KNN-BLOCK DBSCAN still performs well, while ρ-approximate degenerates to be an O(n2 ) algorithm which conforms to our analysis mentioned in Section II. It is also notable that the performance of KNN-BLOCK DBSCAN drops with the dimension, e.g., the proposed algorithm spends much more time on HIGGS than that on PAM4D, and the should be relatively larger in high dimension than that of low dimension. From these experiments, we can see that KNN-BLOCK DBSCAN accelerates ρ-approximate DBSCAN greatly, and is promising for processing such large-scale data. Fig. 9. Runtime comparisons on subsets of HOUSE with n increasing. E. Runtime Comparisons With AnyDBC To make comparisons with AnyDBC, we conduct experiments on the two same data sets, namely, GasSensor (Ethylence-co) and PAM39D, as shown in Fig. 11 (the result of AnyDBC is obtained by running the binary program provided by the authors on our machine). It is observed Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on March 20,2023 at 14:57:36 UTC from IEEE Xplore. Restrictions apply. 3948 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 51, NO. 6, JUNE 2021 Fig. 10. Runtime comparisons on subsets of PAM4D with n increasing. TABLE IV RUNTIME C OMPARISONS W ITH P URE K NN-BASED DBSCAN Fig. 11. Runtime comparisons with AnyDBC and ρ-approximate DBCAN on Gas Sensor and PAM39D, MinPts is fixed to 50. that KNN-BLOCK DBSCAN outperforms AnyDBC and ρ-approximate DBSCAN, especially, on PAM39D KNNBLOCK DBSCAN runs far faster than AnyDBC. F. Runtime Comparisons With Pure kNN-Based DBSCAN In this part, KNN-BLOCK DBSCAN is compared with pure kNN-based DBSCAN on some data sets, and the results are shown in Table IV. From this table, we can see that KNN-BLOCK DBSCAN runs far faster than pure kNN-based algorithm, and the speedup varies from 1.42 to 5.48. Clearly, in most cases, the speedup is larger than 2, which proves that the block techniques plays an important role in our algorithm, and greatly speedup DBSCAN. G. Effect of and MinPts In this section, we check the effect of [, MinPts] on the proposed algorithm. PAM4D is used in this experiment with Fig. 12. Runtime distributions with the changing of and MinPts on PAM4D and HOUSEHOLD, respectively. cardinality 3 850 505 and the dimension is 4. Table V reveals the execution details of kNN, MergeCB (Algorithm 2) and AssignNCB (Algorithm 3), as well as the numbers of CBs, NCBs, and NOBs. Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on March 20,2023 at 14:57:36 UTC from IEEE Xplore. Restrictions apply. CHEN et al.: KNN-BLOCK DBSCAN: FAST CLUSTERING FOR LARGE-SCALE DATA 3949 TABLE V E XECUTION T IMES OF K NN, M ERGE CB, AND A SSIGN NCB, AS W ELL AS B LOCKS F OUND ON PAM W ITH D IFFERENT [, MinPts] TABLE VI T OTAL N UMBERS OF CB S , NCB S , AND NOB S F OUND ON D IFFERENT DATA S ETS TABLE VIII C OMPARISONS OF NMI FOR KNN-BLOCK DBSCAN AND ρ-A PPROXIMATE DBSCAN (n = 5000) TABLE VII C OMPARISONS OF Omega-index FOR KNN-BLOCK DBSCAN AND ρ-A PPROXIMATE DBSCAN (n = 5000) TABLE IX E XAMPLE OF C OMPUTING P RECISION FOR KNN-BLOCK DBSCAN BASED ON T HREE M ATCHED L ABEL PAIRS : (“A1,” “B2”), (“A2,” “B1”), AND (“A3,” “B4”) F OUND BY K UHN –M UNKRAS As the two bold columns show, the execution times of kNN is the same as the number of blocks found by KNN-BLOCK DBSCAN. It is observed: 1) the runtime and execution times of kNN linearly increase with MinPts; 2) while the execution times of MergeCB rapidly decrease with MinPts; and 3) the less CBs the more NCBs and NOBs. Fig. 12 also provides more details of the runtime distribution on PAM4D and HOUSEHOLD with the changing of and MinPts, respectively. Hence, we can infer: 1) the complexity of KNN-BLOCK DBSCAN mainly depends on the execution times of kNN and 2) KNN-BLOCK DBSCAN prefers large and small MinPts, which yields less executions of kNN due to larger number of CBs identified. 1) In the case of MinPts is small and is large, most blocks will be identified as CBs, and the number is about N/MinPts. For example, as Table V shows, [7000, 100] runs fastest, followed by [5000, 100], then [3000, 100], and then [1000, 100]. 2) When MinPts is large and is small, few CBs are found, thus kNN will be called more frequently, and it will degenerate to be an O(n2 ) algorithm in the worst case. As shown in Table III, when the parameters are [1000, 30 000] and [3000, 30 000], the runtime is much longer than others. Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on March 20,2023 at 14:57:36 UTC from IEEE Xplore. Restrictions apply. 3950 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 51, NO. 6, JUNE 2021 TABLE X ACCURACY, R ECALL , AND F1-S CORE OF KNN-BLOCK DBSCAN AND ρ-A PPROXIMATE DBSCAN ON S UBSETS OF HOUSE, PAM4D, MOPCAP, AND APS H. Statistics of Three Kinds of Blocks J. Accuracy of KNN-BLOCK DBSCAN In this section, to observe the numbers of three kinds of blocks with respect to different and MinPts, some experiments are conducted on some whole data sets, including HOUSE, PAM4D, KDD04, USCENCUS, REACTION, MOPCAP, and BODONI, respectively. Table VI exhibits some statistics of CBs, NCBs, and NOBs with respect to different and MinPts on all data sets. From this table, we can see: the total number of blocks, especially CBs, is far less than the cardinality n, which reveals that many distance computations are filtered. To evaluate the accuracy of KNN-BLOCK DBSCAN, some experiments are conducted based on an assumption that the clustering labels obtained by DBSCAN are ground truth. The reason is as follows. 1) This article is only motivated to accelerate the speed of DBSCAN, without concerning the clustering result is good or not which is another topic out of the scope of this article. It is expected that the clustering results should be the same as original DBSCAN provided the parameters (, MinPts) are the same. 2) Both KNN-BLOCK DBSCAN and ρ-approximate DBSCAN are approximate algorithms, the more similar of their clustering results to the original DBSCAN, the better. Hence, we argue it is reasonable to use the clustering results of DBSCAN as ground truth. Specifically, the idea is that each data point belongs to a unique predefined cluster and its predicted cluster should correspond either to only one predefined cluster or to none [11]. Any pair of data points in the same predefined cluster is considered to be incorrectly clustered if the predicted cluster does not match the predefined cluster to which they belong, even if both points appear in the same predicted cluster. Therefore, we evaluate the precision of two approaches as follows. Step 1 (Clustering): Given a data set and [, MinPts], suppose Lab1 = {A1 , A2 , . . . , Ak } and Lab2 = {B1 , B2 , . . . , Bm } I. Omega-Index and NMI Evaluations Omega-Index [45] and normalized mutual information (NMI) [46] are two well known methods to evaluate clustering result, then similar to [47], we use them to make comparisons for KNN-BLOCK DBSCAN and ρ-approximate DBSCAN. Because the complexities of Omega-index and NMI are high (O(n2 )), we only conduct experiments on sub sets of HOUSE, PAM4D, MOPCAP, and APS with n = 5000. In these experiments, we compute the Omega-index and NMI scores of both algorithms by comparing the results with those obtained from the original DBSCAN. As Tables VII and VIII show the performances of both algorithms are similar, and the results are all close to 1, which indicate that both algorithms nearly agree on original DBSCAN. Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on March 20,2023 at 14:57:36 UTC from IEEE Xplore. Restrictions apply. CHEN et al.: KNN-BLOCK DBSCAN: FAST CLUSTERING FOR LARGE-SCALE DATA are clustering labels obtained by the DBSCAN and KNNBLOCK DBSCAN. Step 2 (Matching): It is well known that different clustering algorithms may yield different labels on the same data set. For example, cluster “A1” labeled by DBSCAN may be the same as “B2” obtained by KNN-BLOCK DBSCAN. Hence, it is reasonable to match labels first, and use the matched labels to compute Accuracy. In this article, Kuhn–Munkras [48] performs the task of maximum matching two different cluster label sets, which has been used in [11] and [49]. Step 3 (Computing Accuracy): Suppose there are three clusters with labels “A1,” “A2,” and “A3” obtained by DBSCAN on one data set, but KNN-BLOCK DBSCAN labels them with “B1,” “B2,” “B3,” and “B4,” and Kuhn–Munkras finds there are three matched pairs: (“A1,”‘ “B2,”) (“A2,” “B1,”) and (“A3,” “B4.”) If the labels of point p obtained by DBSCAN and KNN-BLOCK DBSCAN match, then the prediction of p is correct, e.g., (“A1” and “B2,”) otherwise it is wrong, e.g., (“A1” and “B1”). Table IX shows more details. Suppose there are eight points in the data set, the second row lists labels obtained by DBSCAN, and the third line is the clustering result of KNN-BLOCK DBSCAN. We can see that there are two cases that are wrongly predicated because (A1, B4) and (A2, B3) are not matched pairs. Therefore, the total precision is (8 − 2)/8 = 75%. Because the original DBSCAN has high complexity, we only test on small data sets. Here, we extract four subsets from HOUSE, PAM4D, APS, and MOCAP, and use them as test cases. Also because DBSCAN is nondeterministic (sensitive to iteration order), some border points may be assigned to different clusters according to the order they appear. Therefore, the accuracy is computed only by comparing core points. Table X shows that both algorithms achieve high accuracy. In low-dimensional data sets (HOUSE and PAM4D), the precision, recall, and F1-score of both approximate algorithms are about 98%–100%, and there is only a little drop in high-dimensional data sets (MOCAP and APS) which are about 94.5%–97.7%. VI. C ONCLUSION DBSCAN runs in O(n2 ) expected time and is not suitable for large-scale data. ρ-approximate DBSCAN is designed to replace with DBSCAN for big data, however, it only can work in a very low dimension. In this article, we analyze the underlying causes that current approaches fail in clustering large scale data, and find that the grid technique is nearly useless for high-dimensional data. Aiming to tame problems mentioned above, an approximate approach named KNN-BLOCK DBSCAN is proposed for large-scale data based on two findings: 1) the key of DBSCAN to find core points is a kNN problem in essence and 2) a point has similar density distribution to its neighbors, which implies it is highly possible that a point has the same type (core/border/noise) as its neighbors. Therefore, we argue that the kNN technique, e.g., FLANN, can be utilized to identify CBs, NCBs, and NOBs, which only 3951 includes core points, border points, and noises, respectively. Then, we proposed an algorithm to merge CBs that are densityreachable from each other and assign each point in NCBs to a proper cluster. The superiority of KNN-BLOCK DBSCAN to ρ-approximate DBSCAN is that it processes data by blocks, each of which has a dynamic range, instead of grids used in ρ-approximate DBSCAN with a fixed width, and fast the kNN technique is used to identify the types of points. Given a fixed intrinsic dimensionality, the complexity of the proposed algorithm is about O([βn/MinPts][L D log(n)/ log(χ )]) where L is a constant, D is dimension, β is a factor of data distribution, and χ is the branching factor of the tree used in FLANN. Experiments address that KNN-BLOCK DBSCAN runs faster than ρ-approximate DBSCAN and pure kNN-based DBSCAN with high accuracy, even on some relative highdimensional data sets, e.g., APS (170 dim), BONONI (256 dim), FMA (512 dim), and AAS-1K (1000 dim), where ρ-approximate DBSCAN degenerates to be an O(n2 ) algorithm, KNN-BLOCK DBSCAN can still run very fast. Our future work is to improve the proposed algorithm and apply it in real applications in the following aspects. 1) Try to use other precise the kNN technique, such as cover tree, semi-convex hull tree [36], etc., to improve the accuracy of KNN-BLOCK DBSCAN. 2) Parallelize KNN-BLOCK DBSCAN on GPUs with a highly efficient strategy for scheduling data to make the proposed algorithm faster. 3) Apply it in our other researches, such as image retrieval [50], vehicle reidentification [51], [52], vehicle crushing analysis [53], and auditing for shared cloud data [54]–[56]. R EFERENCES [1] A. K. Jain, “Data clustering: 50 years beyond K-means,” Pattern Recognit. Lett., vol. 31, no. 8, pp. 651–666, 2010. [2] A. Likas, N. Vlassis, and J. J. Verbeek, “The global k-means clustering algorithm,” Pattern Recognit., vol. 36, no. 2, pp. 451–461, 2003. [3] Y. Cheng, “Mean shift, mode seeking, and clustering,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 17, no. 8, pp. 790–799, Aug. 1995. [4] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A density-based algorithm for discovering clusters in large spatial databases with noise,” in Proc. KDD, vol. 96, 1996, pp. 226–231. [5] U. Von Luxburg, “A tutorial on spectral clustering,” Stat. Comput., vol. 17, no. 4, pp. 395–416, 2007. [6] H. Chang and D.-Y. Yeung, “Robust path-based spectral clustering,” Pattern Recognit., vol. 41, no. 1, pp. 191–203, 2008. [7] W. Fan, H. Sallay, and N. Bouguila, “Online learning of hierarchical Pitman–Yor process mixture of generalized Dirichlet distributions with feature selection,” IEEE Trans. Neural Netw. Learn. Syst., vol. 28, no. 9, pp. 2048–2061, Sep. 2017. [8] W. Fan, N. Bouguila, J. Du, and X. Liu, “Axially symmetric data clustering through Dirichlet process mixture models of Watson distributions,” IEEE Trans. Neural Netw. Learn. Syst., vol. 30, no. 6, pp. 1683–1694, Jun. 2019. [9] L. Duan, S. Cui, Y. Qiao, and B. Yuan, “Clustering based on supervised learning of exemplar discriminative information,” IEEE Trans. Syst., Man, Cybern., Syst., to be published. [10] D. Cheng, Q. Zhu, J. Huang, Q. Wu, and L. Yang, “A novel cluster validity index based on local cores,” IEEE Trans. Neural Netw. Learn. Syst., vol. 30, no. 4, pp. 985–999, Apr. 2019. [11] Y. Chen et al., “Decentralized clustering by finding loose and distributed density cores,” Inf. Sci., vols. 433–434, pp. 649–660, Apr. 2018. Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on March 20,2023 at 14:57:36 UTC from IEEE Xplore. Restrictions apply. 3952 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS: SYSTEMS, VOL. 51, NO. 6, JUNE 2021 [12] A. K. Jain, M. N. Murty, and P. J. Flynn, “Data clustering: A review,” ACM Comput. Surveys, vol. 31, no. 3, pp. 264–323, 1999. [13] M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander, “Optics: Ordering points to identify the clustering structure,” in Proc. ACM SIGMOD Rec., vol. 28, 1999, pp. 49–60. [14] A. Rodriguez and A. Laio, “Clustering by fast search and find of density peaks,” Science, vol. 344, no. 6191, pp. 1492–1496, 2014. [15] Y. Chen et al., “Fast density peak clustering for large scale data based on KNN,” Knowl. Based Syst., vol. 187, Jan. 2020, Art. no. 104824. [16] D. Cheng, Q. Zhu, J. Huang, Q. Wu, and Y. Lijun, “Clustering with local density peaks-based minimum spanning tree,” IEEE Trans. Knowl. Data Eng., to be published. [17] J. Gan and Y. Tao, “DBSCAN revisited: Mis-claim, un-fixability, and approximation,” in Proc. ACM SIGMOD Int. Conf. Manag. Data, 2015, pp. 519–530. [18] M. Muja and D. G. Lowe, “Scalable nearest neighbor algorithms for high dimensional data,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 11, pp. 2227–2240, Nov. 2014. [19] J. L. Bentley, “Multidimensional binary search trees used for associative searching,” Commun. ACM, vol. 18, no. 9, pp. 509–517, 1975. [20] A. Beygelzimer, S. Kakade, and J. Langford, “Cover trees for nearest neighbor,” in Proc. 23rd Int. Conf. Mach. Learn., 2006, pp. 97–104. [21] A. Gunawan and M. de Berg, “A faster algorithm for DBSCAN,” Ph.D. dissertation, Dept. Math. Comput. Sci., Univ. Eindhoven, Eindhoven, The Netherlands, 2013. [22] V. Chaoji, M. Al Hasan, S. Salem, and M. J. Zaki, “SPARCL: Efficient and effective shape-based clustering,” in Proc. 8th IEEE Int. Conf. Data Min., 2008, pp. 93–102. [23] E. H.-C. Lu, V. S. Tseng, and P. S. Yu, “Mining cluster-based temporal mobile sequential patterns in location-based service environments,” IEEE Trans. Knowl. Data Eng., vol. 23, no. 6, pp. 914–927, Jun. 2011. [24] S. K. Pal and P. Mitra, Pattern Recognition Algorithms for Data Mining. Boston, MA, USA: CRC Press, 2004. [25] S. Mahran and K. Mahar, “Using grid for accelerating density-based clustering,” in Proc. 8th IEEE Int. Conf. Comput. Inf. Technol. (CIT), 2008, pp. 35–40. [26] K. Sonal, G. Poonam, S. Ankit, K. Dhruv, S. Balasubramaniam, and N. Goyal, “Exact, fast and scalable parallel DBSCAN for commodity platforms,” in Proc. 18th Int. Conf. Distrib. Comput. Netw., 2017, p. 14. [27] X. Chen, Y. Min, Y. Zhao, and P. Wang, “GMDBSCAN: Multi-density DBSCAN cluster based on grid,” in Proc. IEEE Int. Conf. e-Bus. Eng., 2008, pp. 780–783. [28] S. T. Mai, I. Assent, and M. Storgaard, “AnyDBC: An efficient anytime density-based clustering algorithm for very large complex datasets,” in Proc. 22nd ACM SIGKDD Int. Conf. Knowl. Disc. Data Min., 2016, pp. 1025–1034. [29] B. Borah and D. K. Bhattacharyya, “An improved sampling-based DBSCAN for large spatial databases,” in Proc. Int. Conf. Intell. Sens. Inf. Process., 2004, pp. 92–96. [30] C.-F. Tsai and C.-W. Liu, “KIDBSCAN: A new efficient data clustering algorithm,” in Proc. Artif. Intell. Soft Comput. (ICAISC), 2006, pp. 702–711. [31] C. Tsai and T. Huang, “QIDBSCAN: A quick density-based clustering technique,” in Proc. Int. Symp. Comput. Consum. Control, 2012, pp. 638–641. [32] A. Bryant and K. Cios, “RNN-DBSCAN: A density-based clustering algorithm using reverse nearest neighbor density estimates,” IEEE Trans. Knowl. Data Eng., vol. 30, no. 6, pp. 1109–1121, Jun. 2018. [33] A. Lulli, M. Dell’Amico, P. Michiardi, and L. Ricci, “NG-DBSCAN: Scalable density-based clustering for arbitrary data,” Proc. VLDB Endow., vol. 10, no. 3, pp. 157–168, 2016. [34] F. Gieseke, J. Heinermann, C. E. Oancea, and C. Igel, “Buffer KD trees: Processing massive nearest neighbor queries on GPUs,” in Proc. ICML, 2014, pp. 172–180. [35] Y. Chen, L. Zhou, Y. Tang, N. Bouguila, and H. Wang, “Fast neighbor search by using revised k-d tree,” Inf. Sci., vol. 472, pp. 145–162, 2019. [36] Y. Chen, L. Zhou, and N. Bouguila, “Semi-convex hull tree: Fast nearest neighbor queries for large scale data on GPUs,” in Proc. IEEE Int. Conf. Data Min., 2018, pp. 911–916. [37] J. Wang et al., “Trinary-projection trees for approximate nearest neighbor search,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 2, pp. 388–403, Feb. 2014. [38] M. Muja and D. G. Lowe, “Fast approximate nearest neighbors with automatic algorithm configuration,” in Proc. Int. Conf. Comput. Vis. Theory Appl. (VISSAPP), 2009, pp. 331–340. [39] C. Silpa-Anan and R. Hartley, “Optimised KD-trees for fast image descriptor matching,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2008, pp. 1–8. [40] Y. Chen, S. Tang, N. Bouguila, C. Wang, J. Du, and H. L. Li, “A fast clustering algorithm based on pruning unnecessary distance computations in DBSCAN for high-dimensional data,” Pattern Recognit., vol. 83, pp. 375–387, Nov. 2018. [41] A. Karami and R. Johansson, “Choosing DBSCAN parameters automatically using differential evolution,” Int. J. Comput. Appl., vol. 91, no. 7, pp. 1–11, 2014. [42] H. Zhou, P. Wang, and H. Li, “Research on adaptive parameters determination in DBSCAN algorithm,” J. Xian Univ. Technol., vol. 9, no. 7, pp. 1967–1973, 2012. [43] F. O. Ozkok and M. Celik, “A new approach to determine eps parameter of DBSCAN algorithm,” Int. J. Intell. Syst. Appl. Eng., vol. 4, no. 5, pp. 247–251, 2017. [44] A. Gionis, H. Mannila, and P. Tsaparas, “Clustering aggregation,” in Proc. Int. Conf. Data Eng. (ICDE), 2005, pp. 341–352. [45] L. M. Collins and C. W. Dent, “Omega: A general formulation of the rand index of cluster recovery suitable for non-disjoint solutions,” Multivariate Behav. Res., vol. 23, no. 2, pp. 231–242, 1988. [46] A. Strehl and J. Ghosh, “Cluster ensembles: A knowledge reuse framework for combining partitionings,” in Proc. 18th Nat. Conf. Artif. Intell., 2002, pp. 93–99. [47] M. A. Patwary, D. Palsetia, A. Agrawal, W.-K. Liao, F. Manne, and A. Choudhary, “Scalable parallel optics data clustering using graph algorithmic techniques,” in Proc. Int. Conf. High Perform. Comput. Netw. Storage Anal. (SC), 2013, pp. 1–12. [48] H. W. Kuhn, “The Hungarian method for the assignment problem,” Naval Res. Logist. Quart., vol. 2, nos. 1–2, pp. 83–97, 1955. [49] Y. Chen, S. Tang, S. Pei, C. Wang, J. Du, and N. Xiong, “DHeat: A density heat-based algorithm for clustering with effective radius,” IEEE Trans. Syst., Man, Cybern., Syst., vol. 48, no. 4, pp. 649–660, Apr. 2018. [50] X. Liu, Z. Hu, H. Ling, and Y. Cheung, “MTFH: A matrix trifactorization hashing framework for efficient cross-modal retrieval,” IEEE Trans. Pattern Anal. Mach. Intell., to be published. [51] J. Hou, H. Zeng, L. Cai, J. Zhu, J. Chen, and K.-K. Ma, “Multi-label learning with multi-label smoothing regularization for vehicle re-identification,” Neurocomputing, vol. 345, pp. 15–22, Jun. 2019. [52] J. Zhu et al., “Vehicle re-identification using quadruple directional deep learning features,” IEEE Trans. Intell. Transp. Syst., to be published. [53] Y. Zhang, X. Xu, J. Wang, T. Chen, and C. H. Wang, “Crushing analysis for novel bio-inspired hierarchical circular structures subjected to axial load,” Int. J. Mech. Sci., vol. 140, pp. 407–431, May 2018. [54] H. Tian, F. Nan, C.-C. Chang, Y. Huang, J. Lu, and Y. Du, “Privacy-preserving public auditing for secure data storage in fogto-cloud computing,” J. Netw. Comput. Appl., vol. 127, pp. 59–69, Feb. 2019. [55] H. Tian, F. Nan, H. Jiang, C.-C. Chang, J. Ning, and Y. Huang, “Public auditing for shared cloud data with efficient and secure group management,” Inf. Sci., vol. 472, pp. 107–125, Jan. 2019. [56] H. Tian et al., “Public audit for operation behavior logs with error locating in cloud storage,” Soft Comput., vol. 23, no. 11, pp. 3779–3792, Jun. 2019. Yewang Chen received the B.S. degree in management of information system from Huaqiao University, Quanzhou, China, in 2001, and the Ph.D. degree in software engineering from Fudan University, Shanghai, China, in 2009. He is currently an Associate Professor with the School of Computer Science and Technology, Huaqiao University, and the Fujian Key Laboratory of Big Data Intelligence and Security, Huaqiao University (Xiamen Campus), Xiamen, China. He is also with the Beijing Key Laboratory of Big Data Technology for Food Safety, Beijing Technology and Business University, Beijing, China, and the Provincial Key Laboratory for Computer Information Processing Technology, Soochow University, Suzhou, China. His current research interests include machine learning and data mining. Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on March 20,2023 at 14:57:36 UTC from IEEE Xplore. Restrictions apply. CHEN et al.: KNN-BLOCK DBSCAN: FAST CLUSTERING FOR LARGE-SCALE DATA Lida Zhou received the B.S. degree in computer science from the College of Computer Science and Technology, Central China Normal University, Wuhan, China, in 2012. He is currently pursuing the Post-Graduation degree with the School of Computer Science and Technology, Huaqiao University (Xiamen Campus), Xiamen, China. His current research interests is machine learning and pattern recognition. Songwen Pei (SM’19) received the B.S. degree in computer science from the National University of Defence and Technology, Changsha, China, in 2003, the M.S. degree in computer science from Guizhou University, Guiyang, China, in 2006, and the Ph.D. degree in computer science from Fudan University, Shanghai, China, in 2009. He is currently an Associate Professor with the University of Shanghai for Science and Technology, Shanghai. Since 2011, he has been a Guest Researcher with the Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China, a Research Scientist with the University of California at Irvine, Irvine, CA, USA from 2013 to 2015 and the Queensland University of Technology, Brisbane, QLD, Australia, in 2017. His research interests include heterogeneous multicore system, cloud computing, and big data. Dr. Pei is a board member of CCF-TCCET and CCF-TCARCH. He is a member of ACM and CCF in China. Zhiwen Yu (SM’14) received the Ph.D. degree in computer science from the City University of Hong Kong, Hong Kong, in 2008. He is a Professor with the School of Computer Science and Engineering, South China University of Technology, Guangzhou, China, from 2015 to 2019. He has been published more than 140 referred journal papers and international conference papers, including 40 IEEE T RANSACTIONS papers. His research areas focus on data mining, machine learning, pattern recognition, and intelligent computing. Prof. Yu is a Distinguishable Member of China Computer Federation and the Vice Chair of ACM Guangzhou Chapter. He is a Senior Member of ACM. Yi Chen received the Ph.D. degree in computer science from the Beijing Institute of Technology, Beijing, China, in 2002. She is currently a Professor of computer science with Beijing Technology and Business University, Beijing, where she is the Director of Beijing Key Laboratory of Big Data Technology for Food Safety. Her research interests mainly focuses on information visualization, visual analytics and big data technology for food quality and safety, including high-dimensional, hierarchical, spatio-temporal, and graph data visual analytics. 3953 Xin Liu (M’08) received the M.S. degree in applied mathematics from Hubei University, Wuhan, China, in 2009, and the Ph.D. degree in computer science from Hong Kong Baptist University, Hong Kong, in 2013. He was a Visiting Scholar with the Computer and Information Sciences Department, Temple University, Philadelphia, PA, USA, from 2017 to 2018. He is currently an Associate Professor with the Department of Computer Science and Technology, Huaqiao University, Quanzhou, China, and also with the State Key Laboratory of Integrated Services Networks, Xidian University, Xi’an, China. His present research interests include multimedia analysis, computer vision, pattern recognition, and machine learning. Jixiang Du received the B.Sc. and M.Sc. degrees in vehicle engineering from the Hefei University of Technology, Hefei, China, in September 1999 and July 2002, respectively, and the Ph.D. degree in pattern recognition and intelligent system from the University of Science and Technology of China, Hefei, in December 2005. He is currently a Professor with the College of Computer Science and Technology, Huaqiao University, Quanzhou, China. Naixue Xiong (SM’12) received the first Ph.D. degree in software engineering from Wuhan University, Wuhan, China, in 2007, and the second Ph.D. degree in dependable networks from the Japan Advanced Institute of Science and Technology, Nomi, Japan, in 2007. He worked with Colorado Technical University, Colorado Springs, CO, USA, Wentworth Technology Institution, Boston, MA, USA, and Georgia State University, Atlanta, GA, USA, for many years. He is currently a Professor with Northeastern State University, Tahlequah, OK, USA. His research interests include cloud computing, security and dependability, parallel and distributed computing, networks, and optimization theory. Authorized licensed use limited to: Thangal Kunju Musaliar College of Engineering. Downloaded on March 20,2023 at 14:57:36 UTC from IEEE Xplore. Restrictions apply.

KNN-BLOCK DBSCAN: Fast Clustering for Big Data

Related documents

Products

Support

KNN-BLOCK DBSCAN: Fast Clustering for Big Data

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib