EGO Published Version

Environmental Science and Pollution Research https://doi.org/10.1007/s11356-023-26780-1 APPLICATIONS OF EMERGING GREEN TECHNOLOGIES FOR EFFICIENT VALORIZATION OF AGRO-INDUSTRIAL WASTE: A ROADMAP TOWARDS SUSTAINABLE ENVIRONMENT AND CIRCULAR ECONOMY Entropy-based grid approach for handling outliers: a case study to environmental monitoring data Anwar Shah1 · Bahar Ali1 · Fazal Wahab2 · Inam Ullah3 · Kassian T.T. Amesho4,5,6,7,8,9 · Muhammad Shafiq10,11 Received: 5 November 2022 / Accepted: 29 March 2023 © The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2023 Abstract Grid-based approaches render an efficient framework for data clustering in the presence of incomplete, inexplicit, and uncertain data. This paper proposes an entropy-based grid approach (EGO) for outlier detection in clustered data. The given hard clusters obtained from a hard clustering algorithm, EGO uses entropy on the dataset as a whole or on an individual cluster to detect outliers. EGO works in two steps: explicit outlier detection and implicit outlier detection. Explicit outlier detection is concerned with those data points that are isolated in the grid cells. They are either far from the dense region or maybe a nearby isolated data point and therefore declared as an explicit outlier. Implicit outlier detection is associated with the detection of outliers that are perplexedly deviated from the normal pattern. The determination of such outliers is achieved using entropy change of the dataset or a specific cluster for each deviation. The elbow based on the trade-off between entropy and object geometries optimizes the outlier detection process. Experimental results on CHAMELEON datasets and other similar datasets suggested that the proposed approach(es) detect the outliers more precisely and extend the capability of outliers detection to an additional 4.5% to 8.6%. Moreover, the resultant clusters became more precise and compact when the entropy-based gridding approach is applied on top of hard clustering algorithms. The performance of the proposed algorithms is compared with well-known outlier detection algorithms, including DBSCAN, HDBSCAN, RE3WC, LOF, LoOP, ABOD, CBLOF and HBOS. Finally, a case study for detecting outliers in environmental data has been carried out using the proposed approach and results are generated on our synthetically prepared datasets. The performance shows that the proposed approach may be an industrial-oriented solution to outlier detection in environmental monitoring data. Keywords Environmental monitoring data · Entropy · Grid · Hard clusters · Industrial · Outliers Introduction An outlier is a data point or group of points that deviated from the normal pattern of a given dataset. In general, outliers can be divided into three classes based on their behavior: global, conditional or contextual and collective outliers. The global outliers deviate significantly from the entire dataset (Liu et al. 2010; Yang et al. 2020). The conditional outliers deviate from the rest of the dataset based on some selected context (Bharti et al. 2019; Wang and Davidson 2009). In contrast, collective outliers are a group of instances collectively deviated from Responsible editor: Marcus Schulz B Muhammad Shafiq Srsshafiq@gmail.com Extended author information available on the last page of the article the rest of the dataset (Rajeswari et al. 2018; Sitanggang and Baehaki 2015). Outlier occurs due to several reasons, such as inappropriate scaling, improper data collection, measurement errors, sampling errors, and human and experimental errors (Campello et al. 2013; Campos et al. 2016b; Chen et al. 2017; Mia Hubert and Segaert 2015). Outlier detection plays a major important role in almost every quantitative discipline, such as cybersecurity, finance, and machine learning (Shah et al. 2021, 2022; Shafiq et al. 2020b, a). It has many applications, including fault diagnosis, web analytics, medical diagnosis, fraud detection, criminal activities detection, and malware detection (Andersson et al. 2016; Jabez and Muthukumar 2015; Lin and Brown 2006; Lucas et al. 2020; Malini and Pushpa 2017; Sandosh et al. 2020; Shafiq et al. 2020c). Outlier detection is an essential task in data processing that may cause severe effects in the final evaluation of results. 123 Environmental Science and Pollution Research Therefore, several approaches have been devised to handle outlier detection. These approaches include statisticalbased approaches, distance-based approaches, density-based approaches, deviation-based approaches, subspace-based approaches, depth-based approaches, and clustering-based approaches. We are handling outlier detection in the context of clustering. The main focus of cluster-based outlier detection methods is to define and observe compactness in clusters to find outliers. The outlier may be an independent object or appears as an individual cluster (Christy et al. 2015). The clustering methods implicitly define an outlier or group of outliers being background noise where the clusters are embedded in the foreground (Ester et al. 1996; Qiu et al. 2003; Luo et al. 2007). Density-based spatial clustering is a well-known method to discover clusters of applications and the surrounding noise in the spatial database (Ester et al. 1996). This method defines outliers based on densities or the collection of objects. An object with a low density and the same parameters is considered an outlier. Several extensions and variants to this method are proposed to handle various problems (Birant and Kut 2007; Borah and Bhattacharyya 2004; Duan et al. 2007; He et al. 2014; Louhichi et al. 2014; McInnes et al. 2017). Many researchers work in the same direction by modifying, extending, improving or refining the basic approach to handle different applications (Birant and Kut 2007; Borah and Bhattacharyya 2004; Duan et al. 2007; He et al. 2014; McInnes et al. 2017; Tran et al. 2013). Some approaches determine an object as an outlier based on its farness from the centroid of the cluster (Hautamäki et al. 2005; Zhang and Leung 2003). An approach towards group outliers detection has been devised in Jiang et al. (2001), which considers small-size clusters as group outliers. Another approach is to detect outliers based on a separate cluster called noise or outlier (Gan and Ng 2017; Rehm et al. 2007). The potential outliers in the dataset have a high association level with that cluster. Besides these approaches, fuzzy and graph-based methods have also been devised to handle outliers (Xu et al. 2007; Yang et al. 2010). These existing approaches employ some threshold or constraint to isolate outliers from inliers. There are some key issues with these approaches. The first issue is the determination of a suitable threshold. The second issue: they determine a single criterion for a multi-class dataset. In a multi-class environment, each class may be of different density, structure and geometry, where a single criterion may fail to determine all the outliers in the dataset. We introduce an entropy-based gridding approach, EGO and examine its efficiency and application in outlier detection. The EGO is motivated by the fact that entropy is related to the randomness of a system (Guseva and Kuznetsov 2017). More specifically, entropy measures the separation of an object from the centroid of its corresponding cluster. Removing an object can lead to a drop in the entropy of a 123 particular cluster or dataset. The drop of an object(s) from a family of objects having the same distribution (sparse or dense) can affect the entropy drop in a systematic way which determines that the instances are the members of the same cluster and not outliers. In case of removal of an object(s) where the entropy drop is deviated from regular drops, determine the presence of outlier(s). The EGO can be applied on the whole dataset at once or individual cluster(s) as EGOC B . The EGOC B works on cluster basis where it determines the outliers corresponding to each given hard cluster. The refinement capability of EGOC B is comparatively more than EGO at the cost of increasing time complexity. The stopping criteria of EGO and EGOC B can be optimized using the entropy-geometry elbow. These approaches are evaluated against the inverse relation of objects geometry and entropy drop for each specific geometry of the objects in the underlying grid. Experimental results on two-dimensional real and multi-dimensional textual datasets suggest that entropy-based outlier detection detects and removes an additional 4.5% to 8.6% outliers that were undetected with DBSCAN or HDBSCAN. Moreover, it leads to more cluster compactness and better cluster quality compared to state-of-the-art outlier detection methods. Furthermore, the proposed approaches have been employed for environmental monitoring data taken as a case study. Environmental monitoring data is essential to analyze and detect harmful residuals impacting human health. Usually, industrial plants have monitoring systems and meteorological stations that enumerate the key variables of air quality in the presence of the residuals released by these complexes. These analyzed measurements may be contaminated with outliers that must be discarded to obtain a consistent dataset. To evaluate the proposed approach’s performance, synthetic outliers have been added to the environmental monitoring data. The correlation for fresh air attributes components is carried out using Pearson’s Correlation Coefficient (PCC) (Benesty et al. 2009). Similarly, the correlation of the attributes is computed in the presence of outliers. The attributes are coordinated based on the correlation that separates abnormal patterns from regular patterns. The proposed approach then detects the outliers based on the entropy-based gridding approach. The detection accuracy of the proposed approach for environmental monitoring data is 96.67% which suggests the industry orientation of our methods. The remaining paper has been structured as follows: Background is discussed in the “Background” section, where the “Entropy-based gridding approach using elbow method” and “Experimental results and evaluations” sections describe the proposed approaches and the conducted experiments based on these approaches, respectively. The “Conclusion” section is the conclusion, which is an inference to the paper. Environmental Science and Pollution Research Background This section introduces the background required to explain the proposed approach(es). In particular, we discuss data scaling, gridding, and entropy. Data scaling The density-based clustering is capable of identifying arbitrary size or shaped data clusters while observing difficulty in obtaining clusters with varying densities. A number of efforts have been made to cope with this challenge (Güngör and Özmen 2017; Zhu et al. 2018, 2016). For instance, Rescale is a one-dimensional density-ratio scaling approach where DScale is a multi-dimensional distance-based scaling preprocessing technique (Zhu et al. 2018, 2016). DScale scales all the computed distances between the instances of a dataset. Consider a Universal set U = {x1 , x2 , x3 , …, xn } with n objects. Each object xi has A attributes, i.e., xi = (xi1 , …, xiA ), where xia represents the ath attribute of object xi . To rescale the computed distance d(x, y) to d (x, y), DScale define the scaling function r(x) as, r (x) = |Nη (x, d)| × m A n × ηA A1 (1) Entropy where Nη (x, d) = {y ∈ U| d(x,y) ≤ η} is the η-neighborhood of x, η ∈ (0, 1) is the radius parameter for neighborhood, m = max x,y∈U d(x, y) is the maximum distance between x and y and n is the number of objects inside U . The scaling function r(x) scale the computed distance matrix DM = [d(x,y)] to scaled distance matrix D M = [d (x, y)] as, d (x, y) = d(x, y) × r (x) (d(x, y) − η) × m−η×r (x) m−η (Amini et al. 2014; Erskine et al. 2006; Kotsiantis and Pintelas 2004; Rai and Singh 2010; Rokach 2009; Xu and Tian 2015). These algorithms are used to formulate solutions for many problems such as outlier detection, clustering, task scheduling and pathfinding (Agrawal et al. 1998; Bai et al. 2016; Blythe et al. 2005; Lee and Cho 2016; Liao et al 2004; Ma and Chow 2004; Mahmoud et al. 2016; Ohadi et al. 2020; Park and Lee 2004; Pilevar and Sukumar 2005; Sheikholeslami et al. 2002; Wang et al. 2009, 1997; Yap 2002). The performance of these algorithms depends upon the granularity and mesh refinment (Liao et al 2004; Ohadi et al. 2020). Some notable granularity and refinement approaches operate on a static grid, dynamic grid, local grid and adaptive grid (Batra and Ko 1992; Berger et al. 1989; Berger and Oliger 1984; Eiseman 1987; Fakhari and Lee 2014; Fuchs 1986; Osekowska et al. 2014; Rencis and Mullen 1986). Grid-based approaches are employed in two steps. First, collect the statistical data corresponding to each cell, using precisely selected grid (Batra and Ko 1992; Berger et al. 1989; Berger and Oliger 1984; Eiseman 1987; Fakhari and Lee 2014; Fuchs 1986; Rencis and Mullen 1986). Finally, perform the specific operation like outlier detection or clustering on the deployed grid without accessing the data inside the database. i f y ∈ Nη + η × r (x) i f y ∈ U \ Nη In the area of machine learning, entropy defines the randomness or impurity of a given dataset. The entropy model of Shanon uses a base 2 logarithmic function to measure the rate of entropy, i.e., log2 (P(x)), because an increase in the probability P(x) of an event tends the result towards binary logarithm one as depicted in Fig. 2. In the case of more than one event or element, accumulative entropy can be calculated using the following equation, (2) The above equation defines two cases, one is to rescale distance between x and the elements inside its η-neighborhood using linear scaling where the number of elements remains same. The second case is to rescale the distance between x and y ∈ U \ Nη using min-max normalization to keep the same object rank. The effect of rescale can be simulated as in Fig. 1. Griding and grid-based algorithms A grid is a collection of uniformly arranged straight lines, intersecting each other at equal intervals, representing a series of connected rectangles or squares. Grid-based approaches add value to the performance of algorithms due to their scalability, lower time complexity and parallel processing nature H (x) = − n−1 P(x) log2 (P(x)) (3) i=0 where H(x) is the Shannon entropy, P(x) is the probability of occurrence of the random variable x and P(x) is the contained information. The above equation is the formal definition of entropy which is the baseline to calculate the information gain of a system. The alternative approach to randomness is distance where the elements are more similar to each other as they became close to each other in some feature space. Therefore, we argue that distant or remoteness can be replaced with randomness in those situations where the elements are weighed in same feature space holding different positions. The volume of the system varies with distance among the objects. let us consider Fig. 3 to completely understand the underlying approach. 123 Environmental Science and Pollution Research Fig. 1 Data scaling of with DScale In Fig. 3(a), a circle representing cluster c1 with a centroid co visualized by a black point in the center. The red points represent the objects of the cluster c1 . The Euclidean-based distance (d f ar ) between the object o1 and centroid co is d1 . It is inversely proportional to the probability of similarity (Psim ) of object o1 to centroid co , which is given as follow, Psim ∝ 1 d f ar Psim = C × the center and red points are the objects belonging to cluster c1 . The Euclidean-based distance between the object o1 and centroid co is d2 . The entropy of object o1 will be increase, because d2 > d1 , which leads to increase in the accumulative entropy from E 1 to E 2 . Mathematically, we have E1 = − 1 d f ar n−1 P(x) log2 (P(x)) i=0 (4) =− n−2 P(x) log2 (P(x)) + P(x1 ) log2 (P(x1 )) i=0 where C is the constant of proportionality which depends upon the weight or importance of objects, generally its value is equal to 1. Furthermore, E 1 is the accumulative entropy for Fig. 3(a). Similarly, in Fig. 3(b), the circle represents the cluster c1 with a centroid co visualized by black point in Fig. 2 A graphical representation of entropy of an event 123 =− n−2 Psim (x) log2 (Psim (x)) + Psim (x1 ) log2 (Psim (x1 )) i=0 =− n−2 i=0 Psim (x) log2 (Psim (x)) + 1 1 (5) log2 (d1 ) (d1 ) Environmental Science and Pollution Research Fig. 3 Visual demonstration for distantial entropy of a cluster Similarly, ⎡ n−2 E2 = − ⎣ Psim (x) log2 (Psim (x)) + i=0 ⎤ 1 ⎦ 1 log2 (d2 ) (d2 ) (6) from Eqs. 5–6, E 2 > E 1 as d2 > d1 . Environmental monitoring data In order to have a pollution-free environment, the process of environmental data monitoring is a must-to-do task. Pollution is directly associated with industrial plants and their operations. The analysis is carried out to observe the range of contaminants like ozone O3 , nitrogen dioxide N O2 , lead, excessive carbon and its byproducts and other dangerous-tohealth particulate materials measured in the units of mg/m 3 or parts-per-billion (ppb) by chromatographs. The excessiveness of these substances in the air may cause dangerous impacts on human health; for instance, excessiveness of N O2 may cause skin irritation. High levels of S O2 may cause bronchitis, asthma and even heart attack. The air quality stations measure the contaminated substances, whereas the meteorological stations measure the wind speed, direction, humidity, and temperature. The environmental monitoring data is obtained using chromatographs and stored in the environmental database to be used for analysis. Unfortunately, the data contain abnormalities due to different possibilities, such as instrumentation faults, communication channel faults, electrical faults, and problems with emission sources. This leads to masking and swamping effects. Masking means to consider an abnormal measurement as a normal measurement, while swamping is to consider a normal measurement as an abnormal measurement. That can lead to severe effects on the precautionary measures taken based on these analyses. Many authors formulate solutions to cope with these types of outliers. Pearson works on the identification of systemgenerated outliers. He discussed that an outlier is not always a byproduct of wrong measurements, but it can be caused due to faulty system operations (Pearson 2002). The severity of these outliers has also been discussed in Kadlec et al. (2009); Warne et al. (2004). Hugo and Daniel presented a nonlinear approach to detect outliers (Garces and Sbarbaro 2009). Brahim Alameddine et al. comparatively analyze three outlier detection mechanisms on environmental monitoring data, particularly water quality in different lakes (Alameddine et al. 2010). Petr Visilik et al. automate the process of outlier detection based on statistical analysis (Veselík et al. 2020). Entropy-based gridding approach using elbow method In this section, we present a gridding approach that makes use of the elbow method based on the entropy measurement of each cluster. The entropy is used to measure the randomness in the clusters while the elbow determines the feasibility for objects to decide them as outlier or inlier. In particular, for improved gridded hard cluster(s), we calculate an initial entropy for the system and keep on evaluating the entropy for implicit outliers with different geometries. The elbow method finds the outliers by establishing a trade-off between the entropy and suspected outliers. Data scaling It is not always true that a dataset contains clusters or classes with the same densities. Therefore, a conventional machine- 123 Environmental Science and Pollution Research learning algorithm behaves unexpectedly in the case of varied densities clusters. A data scaling is therefore required to balance the densities of all the clusters to meet the optimum results. There are two scaling methods namely, data-ratio and DScale (Zhu et al. 2018, 2016). The data-ratio and DScale scaling method are used for two-dimensional and n-dimensional data, respectively. We use DScale for data scaling using Eq. 2. The distance matrix D M is computed based on Euclidean distance in the first step, which is transformed to a scaled matrix D M using Eqs. 1–2. Data scaling may improve the clustering process (Zhu et al. 2018, 2016). Hard clustering Let D = { x1 , x2 , . . ., xn } is the dataset containing finite number (n) of objects. The scaling of distances between the objects results in a computed scaled matrix D M . A conventional machine learning algorithm cluster the scaled data into groups or clusters based on the similarities between objects such that C = {c1 , c2 , . . ., c K }. The conventional machinelearning algorithms used in this paper are HDBSCAN (for initial hard clusters with noise removal) or k-means (for oval shaped initial hard clusters). Explicit outliers detection Grid representation The next step is to represent the dataset in a grid. The proposed approach(es) work on normalized attributes and division of the unit square (in case of two attributes) or unit hypercube (in case of more than two attributes) into an equally distributed grid (Gu et al. 2017). This approach uses the Euclidean distance metric to measure the influence of each attribute. Therefore, first, we normalize the attributes A in the range [0,1] to balance the effect of each attribute during distance analysis (Goldstein 2014). It scales the whole problem space to a unit hypercube. There are O(n.A) operations required to complete this step. Further, we divide the unit square (in case of two attributes) or unit hypercube (in case of more than two attributes) into p number of equidistant parts such that p ∈ N. Hence, the total number of grid cells became p A where p is the number of grid cells in one dimension and A is the number of attributes in the data space. Each grid cell can be located through A-tuple ( j0 , j1 , . . ., j A−1 ) where jk ∈ {0, 1, . . ., p -1}. Furthermore, we can map this multidimensional grid to a single value index, i.e., I = {0, 1, 2, . . ., p A }. Algorithm 1 demonstrates the grid-based representation of data space as shown in Fig. 4 and supported by Table 1. The clustered data is logically gridded using Algorithm 1. The objects in the dataset are thereby tested for a possible outlier. The grid plays a key role in the process of explicit outliers detection. An object having all its immediate neighboring cells empty is an exceptional case, which is an obvious outlier. The rest of objects are arranged in sets based on neighboring empty cells. Each object in these sets is examined for possible outliers based on their acquired geometry. Entropy measurement The process of explicit outlier detection is followed by implicit outlier detection mechanism. The initial entropy of the remaining dataset is determined say E 1 . The elements are divided in eight sets with respect to their immediate empty neighboring cells, i.e., G e = {e1 , e2 , . . ., e8 }, where ei represents elements with i number of immediate empty cells. Now, to inquire the effect in entropy by declaring a set of objects as outlier, we compute a power set of G e . The power set P(G e ) has all possible combinations of objects that exist in various geometries, arranged in 28 subsets of set G e . It may be noted that we do not consider ∅ in our analysis such that G e = P(G e ) - 1. Each time a subset of the G e set is examined for outliers verification, the entropy is calculated for the dataset excluding that subset. The graph created an elbow at a specific point declaring the possible boundary between outliers and inliers. Elbow Fig. 4 Data points representation on a grid 123 In machine learning, we often use the sum of the squared distance-based elbow method for determining the number of clusters. Similarly, an entropy-based graph plotted against the objects with various geometries. An elbow evolved at the point of the absurd jump, which determines the boundary between outliers and inliers. Moreover, the process may potentially be repeated using various settings of empty cells around an object such as 3 by 3 or 5 by 5 cells. The increase in processing area leads to more outliers detection. Environmental Science and Pollution Research Table 1 Hash table to map data points in a grid Object Grid Position(I) Index Cells x y x1 I1 4 3.3 18.5 x2 I2 4 3.8 18.2 x3 I3 10, 11 1 x4 I4 27, 28 x5 I5 38, 39 x6 I6 43, 44 x7 I7 15, 16, 21, 22 A special case (group outliers) Group outliers are a special case of outliers. These outliers may exist in groups and be located away from normal clusters. The outliers detected in the above steps be verified against their neighbors as follows, – Case 1: The immediate neighboring cells (cells ) containing outliers. – Case 2: An immediate neighbor is not an outlier; we check all of its neighborhood elements. If all the elements in its neighborhood match the threshold for most of their neighbors, that is acquired for other outliers then decide them as group outliers. – Case 3: An immediate neighbor which is not an outlier; we check all the immediate neighbors of that non-outlier object. The object decided as inlier if most of its immediate neighbors are inliers. I denti f ier I d Location I d 1 I d1 2 Cells (I3 ) 2 4 I d2 2 Cells (I4 ) 3 9 I d3 2 Cells (I5 ) 4 16 I d4 2 Cells (I6 ) 3 12.5 I d5 4 Cells (I7 ) Here oci is the centroid of cluster ci and oi is the object which belongs to the cluster ci . The attributes for each object are denoted by A. The next step is to scale the distances between zero and one. The objects are categorized based on their geometry. Each category of objects is removed and the entropy of the complete dataset is calculated. The computed entropy is plotted against the geometry. The algorithm iterated for each specific geometry of objects and its effect on the entropy of the dataset is recorded. The maximum change in the entropy compared to the initial entropy is considered as the boundary between inliers and outliers. Similarly, the process can be repeated for cluster-based entropy analysis which is more robust. The entropy of each cluster is calculated and plotted against objects geometries. We get an entropy-geometry elbow for each cluster. Finally, we get an optimum cleaned dataset. Algorithm to grid the space A demonstration to entropy measurement in the proposed approach(es) The entropy measurement is the backbone of the proposed method(s). To understand the underlying mechanism, we consider a dataset as shown in Fig. 5. There are two clusters c1 and c2 in the dataset, represented by blue and green points, respectively. The hard clusters are obtained using a clustering algorithm, i.e., k-means for oval-shaped clusters and HDBSAN for non-oval shaped cluster dataset. The geometry of different objects is grasped and stored in a hash table similar to Table 1. The objects with one or more nonempty nearest neighbor cells are considered inlier; however, objects with all empty nearest neighboring cells are considered as outliers in the first place. Next, the centroid for each cluster is computed. The distance between the centroid and each of the object belonging to that cluster is computed using Euclidean distance metric as follow, d(oci , oi ) = A (ocai − oia )2 a=1 (7) This section introduces an algorithm to logically grid the space. The algorithm Grid-Map is shown as Algorithm 1. It is used to grid and map the objects of a dataset to the constructed grid. A scaled matrix D M is used as the only input to the algorithm. The outputs include a grid-based representation of the data objects, number of objects in each cell n[ p] and mapping of the dataset coordinates to the grid index (I ) = {0, 1, . . ., p A }. The algorithm receives a universal set U and a scaled matrix D M of the distances between the objects of U . The first for loop is used to normalize the attributes A in range [0, 1] for each object in the universal set U . The second for loop is used to divide the normalized attributes into p parts or equivalently the unit hypercube (in case of multi attributes) into p A grid cells. Grid cells are represented as gridcell. The collection of all grid cells produces a grid. The third for loop is to count the number of objects in each cell and store them in the array n[i] corresponding to the index (I ). The final loop in the algorithm is used to map the coordinates of each instance to its position on the grid as shown in Table 1. The object located in more than one cell is given an identifier which 123 Environmental Science and Pollution Research Fig. 5 Grid-based representation of a dataset removes the redundancy in the process of reverse mapping in a later stage. This algorithm returns a grid G, the number of objects in each cell of the grid n[i] and their mapping to the coordinate system. Algorithm for complete dataset-based EGO This section elaborates the working mechanism of EGO Algorithm 2 for a complete dataset. The input to this algorithm is a universal set with a finite number of instances. The output of the algorithm is a set of outliers OU T and a set of non-outliers N on OU T which represents compact clusters. The algorithm starts by computing the scaled matrix D M from the distance matrix D M of objects in universal set U . Line 2 is a function call, it takes a single argument and returns the three values, i.e., initial grid G to represent these objects, the number of objects in each cell n[i] of grid G and mapping of objects in a coordinate system to the cells in the initial grid G. In line 3, the algorithm partitions the data using a hard clustering algorithm in case of unlabeled data or creates partitions based on classification labels. Line 4 calculates the entropy EU for the complete dataset using Eq. 3. There are two for loops in this algorithm. The first for loop in lines 5–10, each cell in grid G is examined to be empty or with empty neighbors. These cells be excluded and the objects in these cells are considered as explicit outliers. Next in lines 11–16, the algorithm enters in a nested for loop. The loop goes through power set G e , excludes the cells and calculates the entropy. Line 17 determines the optimum entropy by considering a comparatively abrupt elbow. Line 18 cumulates the objects in the cells, which causes an elbow. Lines 19 and 20 separate the outliers and inliers into two sets OU T and N on OU T . Algorithm 2 An algorithm for complete dataset-based EGO Algorithm 1 Grid-Map algorithm Input A scaled matrix D M from Algorithm (2 or 3) , p > 1 Output A Grid G, Number of objects in each cell n[i], Mapping of data into grid 1: function Grid-MapD M 2: for each xi ∈ U do 3: Normalize A attributes between 0 and 1 4: end for 5: for each a in A do 6: Divide A unit intervals ∈ [0,1] in p parts (or equivalently) unit 7: hypercube [0, 1] A into p A grid cells (gridcell) to obtain G 8: end for 9: for each i in p A do 10: n[i] = Count number of objects 11: end for 12: for each xi in U do 13: Map coordinates of xi to gridcell ∈ G 14: Check for duplicate entry 15: if xi ∈ more than one cell then 16: mark it with identifier 17: end if 18: Get Mapping 19: end for 20: return (G, n[i], Mapping) 21: end function 123 Input A universal set U = {x1 , x2 , x3 , . . . xn }. Output The OU T and N on OU T representing the sets of outliers and inliers. 1: Compute a distance matrix D M from U 2: (Grid(G), n[ind], Mapping) = Grid-MapD M it is a function call 3: Obtain initial partitions C = {c1 , c2 , c3 , ..., c K } using hard clustering algorithm or data labels. 4: Calculate entropy EU for the dataset using Eq. 3 5: for each cell ∈ G do 6: Examine the neighboring cells 7: if All neighbor cells = ∅ ∨ cell = ∅ then 8: cell is empty or otherwise having outliers. 9: end if 10: end for 11: for i in G e do 12: for each non-empty cell ∈ G do 13: Exclude all the cells having i empty neighbors 14: end for 15: Calculate entropy E i for the dataset using Eq. 3 16: end for 17: Compare E i for optimum entropy drop E using elbow 18: Detached = cells causing optimum entropy E 19: OU T = Mapping(Detached → objects) 20: N on OU T = Mapping(U − Detached → objects) Environmental Science and Pollution Research Algorithm for cluster-based EGO (EGOCB ) Algorithm for group outlier-based EGO (EGOclique ) Algorithm 3 works for individual cluster and the result is computed at the end. A universal set with a finite number of instances is an input to the algorithm. There are two outputs to this algorithm, i.e., a set of outliers OU T and a set of nonoutliers N on OU T which represents the compact clusters. The algorithm computes a scaled matrix D M for the objects in the universal set U . In line 2, a function with a single argument is called which returns three values, i.e., initial grid G to represent these objects, the number of objects in each cell n[i] of grid G and mapping of objects in a coordinate system to the cells in the initial grid G. Line 3 creates initial partition using a hard clustering algorithm or data labels. Line 4 computes the entropy EU for the complete dataset using Eq. 3. There are two for loops used in this algorithm. The first for loop in lines 5–10, each cell in grid G is examined to be empty or with empty neighbors. These cells and the objects inside them are considered to be explicit outliers. Lines 11–18, algorithm enters in a nested for loop. The algorithm loops for each cluster ck , calculate initial entropy of cluster ck , loops in through power set G e , exclude the cells, calculate the entropy of ck again and ends with finding an optimum entropy E i . Line 19 cumulates the objects in the cells, which cause an elbow for each ck . Lines 20 and 21 separate the outliers and inliers into two sets OU T and N on OU T . This section presents the EGO Algorithm 4 for detecting group outliers. The set of inliers N on OU T and the desired number of clusters n d are the two inputs to this algorithm. This algorithm gives a family of group outliers OU Tclique . The algorithm computes a scaled matrix D M for the objects in a universal set U . Line 1 maps the given inliers to the coordinate system using Algorithm 1. Lines 2–9 loop for each cell in the grid G followed by an if-else statement, that checks possible group outliers. The if-else statement suggests group outliers, if all immediate neighbors contain outliers or a maximum number of immediate neighbors of the immediate neighbors contain outliers otherwise inliers. Line 10 checks for optimum entropy drop for each group and a combination of different groups to obtain the desired number of clusters. Line 11 removes the non-desired regions. Lines 12–14 decide the group of outliers based on the optimum entropy. Algorithm 3 Algorithm for cluster-based EGO Input A universal set U = {x1 , x2 , x3 , . . . xn }. Output The OU T and N on OU T representing the sets of outliers and inliers. 1: Compute a distance matrix D M from U 2: (Grid(G), n[ind], Mapping) = Grid-MapD M it is a function call 3: Obtain initial partitions C = {c1 , c2 , c3 , ..., c K } using hard clustering algorithm or data labels. 4: Calculate entropy EU for the dataset using Eq. 3 5: for each cell ∈ G 6: Examine the neighboring cells 7: if All neighbor cells = ∅ ∨ cell = ∅ then 8: cell is empty or otherwise having outliers. 9: end if 10: end for 11: for each ck ∈ C 12: Calculate entropy EU for ck using Eq. 3 13: for i in G e do 14: Exclude all the cells having i empty neighbors 15: Calculate entropy E i for the ck using Eq. 3 16: end for 17: Compare E i for optimum entropy drop E using elbow 18: end for 19: Detached = cells causing optimum entropy E for each ck 20: OU T = Mapping(Detached → objects) 21: N on OU T = Mapping(U − Detached → objects) Algorithm 4 An algorithm for group outlier-based EGO Input Set of inliers N on OU T and set of inliers from Algorithm 2 Algorithm 3. Output The set OU Tclique representing the group outliers. 1: Get Mapping for U using Algorithm 1. 2: for each cell ∈ G 3: if cells of cell contain outliers by elbow ∨ ∃cells | Max(cells (cell)) contain outliers by elbow then 4: A group of outliers 5: else if ∃cells | Max(cells (cell)) contain inliers then 6: Inliers 7: Calculate the entropy E i using Eq. 3 8: end if 9: end for 10: Check optimum entropy E drop for each group(s) 11: Detached = Remove the cells 12: if E is optimum then 13: OU Tclique = Mapping(Detached → objects) 14: end if or Experimental results and evaluations This section reports the results of our approach and its comparative analysis with some of the best outlier detection approaches. We measure the performance of our new algorithms EGO and EGOC B in an unsupervised scenario. We use the traditional clustering algorithms that perform cluster generation along with outlier detection. Then, the application of the proposed approaches on the initially purified clusters investigates its capability to further outlier detection and cluster purification. These experiments are performed using five unlabeled and three labeled datasets. The unlabeled datasets include t4.8k, t5.8k, t8.8k, t7.10k and A3 123 Environmental Science and Pollution Research datasets where unlabeled datasets are D31, 20newsgroup and 50 class amazon reviews. The t4.8k, t5.8k and t8.8k are bi-featured datasets containing 8,000 objects each and t7.10k dataset has 10,000 instances including many noise points (Karypis et al. 1999; McInnes et al. 2017). The datasets A3 and D31 are also bi-featured, containing 7, 500 and 3,100 objects, respectively (Krkkinen and Frnti 2002; Veenman et al. 2002). The 20newsgroup (20NG) and 50 class amazon review (50CR) are textual dataset (Chen and Liu 2014; Fei and Liu 2016). The 20NG dataset contains the news on 20 topics (Lang 1995). Each topic contains around one thousand news. The 50CR dataset contains amazon reviews for 50 electronic products. There are one thousand reviews for each electronic product. The initial clusters can be obtained using the most recent clustering algorithm HDBSCAN. Then, we apply EGO and EGOC B on these initial clustering results to evaluate its performance. Experiments for EGO and EGOCB We now evaluate the performance of EGO and EGOC B algorithms for cluster compactness in the presence of noise. Tables 2 and 3 show the detailed analysis of the proposed approaches based on internal validation measures, i.e., Table 2 Comparison of BD index, Dunn index and silhouette for EGO Dataset t4.8k t5.8k t7.10k t8.8k A3 D31 20NG 50CR 123 Davies Boudlin (DB) index, Dunn index and silhouette. DB index is a ratio between within-cluster scatteredness to the separation between clusters in a dataset. Dunn index compares within-cluster variance to the separation between the mean of different clusters. The silhouette compares the cohesion or consistency within-cluster to the clusters separation. Overall, the internal validation measures are used to reflect within-cluster consistency, connectedness and distance or separation between clusters. The unlabeled datasets are evaluated using DBSCAN and HDBSCAN for metric evaluation and then the algorithm (2) and (3) are applied on the top of HDBSCAN algorithm, which made a significant improvement in their compactness, connectedness and separation between different clusters by detecting and removing outliers. In the case of labeled datasets, the initial partitions are obtained using data labels. The proposed algorithms are then applied on the top of labeled based partitions, that improve the cluster or class compactness, connectivity among objects and separation between clusters while detecting outliers. The results for DBSCAN, HDBSCAN and EGO are listed in Table 2. It may be noted that initial partitions are obtained using HDBSCAN or data labels. The EGO algorithm adds further cohesion within-cluster and separation Approaches DB index Dunn index Silhouette DBSCAN 1.030 ± 0.004 0.567 ± 0.001 0.309 ± 0.001 HDBSCAN 1.020 ± 0.002 0.601 ± 0.001 0.316 ± 0.002 EGO 1.010 ± 0.002 0.667 ± 0.002 0.341 ± 0.017 DBSCAN 0.637 ± 0.012 0.663 ± 0.001 0.457 ± 0.012 HDBSCAN 0.578 ± 0.003 0.696 ± 0.004 0.573 ± 0.014 EGO 0.539 ± 0.002 0.741 ± 0.003 0.589 ± 0.013 DBSCAN 0.631 ± 0.003 0.587 ± 0.002 0.424 ± 0.011 HDBSCAN 0.619 ± 0.013 0.617 ± 0.012 0.452 ± 0.011 EGO 0.588 ± 0.012 0.676 ± 0.011 0.499 ± 0.017 DBSCAN 0.626 ± 0.010 0.587 ± 0.012 0.431 ± 0.011 HDBSCAN 0.603 ± 0.010 0.631 ± 0.002 0.471 ± 0.003 EGO 0.573 ± 0.015 0.689 ± 0.020 0.501 ± 0.023 DBSCAN 0.611 ± 0.004 0.589 ± 0.005 0.510 ± 0.004 HDBSCAN 0.593 ± 0.043 0.641 ± 0.027 0.527 ± 0.022 EGO 0.525 ± 0.022 0.703 ± 0.011 0.632 ± 0.003 DBSCAN 0.581 ± 0.030 0.614 ± 0.003 0.545 ± 0.034 Labeled 0.548 ± 0.006 0.638 ± 0.002 0.576 ± 0.003 EGO 0.410 ± 0.020 0.687 ± 0.003 0.671 ± 0.004 DBSCAN 0.898 ± 0.041 0.510 ± 0.022 0.318 ± 0.021 Labeled 0.861 ± 0.007 0.545 ± 0.003 0.339 ± 0.005 EGO 0.723 ± 0.005 0.593 ± 0.002 0.401 ± 0.020 DBSCAN 0.917 ± 0.013 0.503 ± 0.032 0.300 ± 0.001 Labeled 0.894 ± 0.003 0.539 ± 0.002 0.302 ± 0.003 EGO 0.782 ± 0.002 0.601 ± 0.001 0.396 ± 0.007 Environmental Science and Pollution Research Table 3 Comparison of BD index, Dunn index and silhouette for EGOC B Dataset t4.8k t5.8k t7.10k t8.8k A3 D31 20NG 50CR Approaches DB index Dunn index Silhouette 0.309 ± 0.001 DBSCAN 1.030 ± 0.004 0.567 ± 0.001 HDBSCAN 1.020 ± 0.002 0.601 ± 0.001 0.316 ± 0.002 EGOC B 0.998 ± 0.002 0.693 ± 0.002 0.355 ± 0.019 0.457 ± 0.012 DBSCAN 0.637 ± 0.012 0.663 ± 0.001 HDBSCAN 0.578 ± 0.003 0.696 ± 0.004 0.573 ± 0.014 EGOC B 0.522 ± 0.002 0.756 ± 0.031 0.600 ± 0.015 DBSCAN 0.631 ± 0.003 0.587 ± 0.002 0.424 ± 0.011 HDBSCAN 0.619 ± 0.013 0.617 ± 0.012 0.452 ± 0.011 EGOC B 0.540 ± 0.004 0.693 ± 0.012 0.510 ± 0.016 DBSCAN 0.626 ± 0.010 0.587 ± 0.012 0.431 ± 0.011 HDBSCAN 0.603 ± 0.010 0.631 ± 0.002 0.471 ± 0.003 EGOC B 0.569 ± 0.017 0.694 ± 0.003 0.504 ± 0.024 DBSCAN 0.611 ± 0.004 0.589 ± 0.005 0.510 ± 0.004 HDBSCAN 0.593 ± 0.043 0.641 ± 0.027 0.527 ± 0.022 EGOC B 0.500 ± 0.024 0.723 ± 0.017 0.641 ± 0.003 DBSCAN 0.581 ± 0.030 0.614 ± 0.003 0.545 ± 0.034 Labeled 0.548 ± 0.006 0.638 ± 0.002 0.576 ± 0.003 EGOC B 0.400 ± 0.021 0.691 ± 0.004 0.677 ± 0.004 DBSCAN 0.898 ± 0.041 0.510 ± 0.022 0.318 ± 0.021 Labeled 0.861 ± 0.007 0.545 ± 0.003 0.339 ± 0.005 EGOC B 0.710 ± 0.003 0.599 ± 0.003 0.405 ± 0.012 DBSCAN 0.917 ± 0.013 0.503 ± 0.032 0.300 ± 0.001 Labeled 0.894 ± 0.003 0.539 ± 0.002 0.302 ± 0.003 EGOC B 0.771 ± 0.002 0.617 ± 0.002 0.404 ± 0.001 between clusters. The loose instances are detached and similar instances became a separate cluster. In CHAMELEON datasets, the t5.8k got more affected where t4.8k has a small improvement after the application of EGO. EGO detected more outliers for A3 dataset, enhancing their compactness and improving the result at a maximum among the unlabeled datasets. The labeled datasets include D31 and two textual datasets. The performance of EGO for labeled datasets is comparatively higher among the considered datasets. The textual datasets are highly congested and therefore, require a careful observation for outlier detection. The proposed approaches detect the outlier in a unique way, avoiding the space loss and complexity reduction. Table 3 reports the results for EGOC B in terms of three evaluation metrics namely, DB index, Dunn index and silhouette coefficient. It may be noted that for all datasets the results are improved by eliminating more outliers. The compactness of initial clusters and outlier detection performance EGOC B is a bit high in comparison to EGO. The outliers are more precisely detected and clusters are further cleaned and more compactness is added to each cluster. The EGOC B performance for labeled datasets is more realized due to their highly congested nature. Visual demonstration of the proposed approach This section visually demonstrates the performance of the proposed approach EGOC B . In particular, the cluster representation is elaborated in the presence of noise using HDBSCAN and EGOC B . The visuals show the superiority of the performance of EGOC B over the highly reputable algorithm ‘HDBSCAN’. For a visual demonstration, we consider two unlabeled and a labeled dataset. In the case of unlabeled datasets t5.8k and t7.10k, the initial clusters are obtained using HDBSCAN while the primary partitions are obtained using data labels for labeled dataset D31. Figure 6 shows the visuals for unlabeled dataset t5.8k. The initial labels are obtained for a supervised setting. The clusters obtained after HDBSCAN are shown in Fig. 6(b)– (c). There are six clusters in this dataset (represented by different colors) with noise all around them represented by black points. We may observe that the compactness of these clusters is remarkably improved by EGOC B by eliminating more outliers as shown in Fig. 6(d). The last Fig. 6(e) shows the most related instances of all the six clusters. These instances are more likely to increase the cluster consistency. 123 Environmental Science and Pollution Research Fig. 6 Visual results of cluster-based EGO for t5.8k dataset Next, we get into another unlabeled dataset t7.10k as shown in Fig. 7(a). Figure 7(b) shows the HDBSCAN process to detect outliers for the sake of compact cluster formation. Figure 7(c) visualizes the processing of EGC B to detect outliers from scratch that shows its ability as a standalone method. There are nine clusters in this dataset (represented by different colors) with noise all around them represented by black points. We may note that the EGOC B may, however, detect more outliers and improve cluster quality as shown in Fig. 7(c). Figure 7(d) shows the most compacted nine clusters. These clusters are much cohesive with maximum distance from each other. Now we look into the performance of EGOC B algorithm for a labeled dataset D31 as shown in Fig. 8(a). This labeled dataset arranged 3,100 instances in thirty-one classes as shown in Fig. 8(b). In the case of labeled dataset, the partitions are obtained using data labels without applying any clustering algorithm. The EGOC B is then applied on the top of labeled separated dataset. The classes are con- 123 densed and more apart due to the application of EGOC B as shown in Fig. 8(c). The loosely attached instances especially at the boundaries are detached from the cluster. Figure 8(d) shows the instances that belong to each cluster or class. The compactness of these clusters can be observed visually and mathematically in Tables (2 and 3). Comparative analysis of the proposed approach(es) The comparative analysis of outlier detection algorithms is a tangled and chaotic task. This is because of some basic issues that include, the unavailability of suitable datasets with proper ground truth, the consideration of optimal parameters for outlier determination and the choice of outlier discrimination from the rest of the data (Campos et al 2016a; Xu et al. 2018). Despite all the discussed difficulties, we did compare the proposed method with the well-known approaches to realize its effectiveness under the same set of conditions. Environmental Science and Pollution Research Fig. 7 Visual results of cluster-based EGO for t7.10k Dataset Fig. 8 Visual results for cluster-based EGO for D31 Dataset 123 Environmental Science and Pollution Research It may be noted that the proposed approach has the advantage of additional outlier detection because of its application on top of improved hard clusters. Besides this, for a fair comparison, we, however, consider the self-contained nature of the proposed approach. The datasets used in this set of experiments are cardiotocography and Pima Indians Diabetes (Campos et al 2016a). These datasets are semantically meaningful, i.e., realizing the real world data including instances that are deviating from the normal pattern. For example, a dataset of players with a particular class of badminton players. The Cardiotocography dataset containing three classes of individually collected data from people in search of heart disease. These classes contain normal, suspected and pathological patients with heart disease. The normal people are inliers where the other two classes of suspected and pathological patients are considered as outliers. The Pima Indian Diabetes dataset has two classes containing normal and patients with diabetes. The normal people are inliers where the diabetic patients are considered as outliers. To realize the rare nature of outliers in the considered datasets, we randomly downsampled the classes containing outliers at 5%, 10%, 15% and 20% Campos et al (2016a). The experiments are repeated ten times and taking the mean of all results. The following measures evaluate the performance of the proposed approach(es) (Ali et al. 2021; Campos et al 2016a; Xu et al. 2018), Correctly detected outliers and inliers , (8) Total objects Correctly detected outliers , (9) Pr ecision = Total identified outliers Correctly detected outliers Recall = , (10) Total actual outliers 2 × Pr ecision × Recall , (11) F1 = Pr ecision + Recall Accuracy = The above equation means that how accurately the EGO and EGOC B classify a given instance as outlier or inlier. In particular, the precision shows the proportion of predicted outliers that are actually outliers, i.e., how many of the predicted outliers (true outliers plus false outliers) are actually outliers (true outliers). The recall shows the proportion of correctly identified outliers out of actual outliers. The accuracy shows the proportion of correctly classified instances, both outliers and inliers. The F1 score summarizes the performance into a single metric and therefore, computed as a harmonic mean of precision and recall. The detailed results are reported in Tables 4 and 5. The performance of the proposed approach(es) is compared with six well-known outlier detection approaches including RE3WC (reduction- and elevation-based threeway clustering (Ali et al. 2021)) LOF (local outlier factor (Breunig et al. 2000)), LoOP (local outlier probabilities 123 (Kriegel et al. 2009)), ABOD (angle-based outlier detection (Kriegel et al. 2008)), CBLOF (cluster-based local outlier factor (He et al. 2003)) and HBOS (histogram-based outlier score (Goldstein and Dengel 2012)). These approaches based on criterion-based ranking of each object to detect outliers. Table 4 shows the results for 5% and 10% downsampled outlier class in cardiotocography and Indian Pima datasets. The results for both EGO and EGOC B are listed on the top of each comparison. The listed values for different metrics are the mean of ten times repeated experiments for each approach. In the case of Indian Pima Diabetes dataset, the accuracy of EGO for 5% and 10% is the same as RE3WC. The proportion of classification of actual outliers is a bit low for 10% downsampled outliers as compared to RE3WC. The EGOC B work on cluster basis and therefore, has a maximum accuracy for both 5% and 10% downsampled outliers. The precision values for LoOP are high for 5% and very close for 10% downsampled outliers on the Indian Pima Diabetes dataset. In the case of cardiotocography dataset, the proposed approaches perform better in comparison. The recall values of ABOD for 5%, 10% and CBLOF for 10% downsampled outliers are high as compared to the proposed approaches. The F1 score of the proposed approaches is always better in comparison to the participated outlier detection algorithms for 5% and 10% downsampled outliers. Table 5 shows the results for 15% and 20% downsampled outlier class in cardiotocography and Indian Pima datasets. The computed metric values for EGO and EGOC B are listed on the top of each comparison. The tabulated values are the computed mean of ten times repeated experiments for each approach. In the case of Indian Pima Diabetes dataset, the proposed approaches perform better except for a beat in precision. The precision value of LOF for 15% is a bit high as compared to EGO, however, RE3WC, LOF and ABOD beat both EGO and EGOC B for 20% downsampled outliers. In the case of cardiotocography dataset, the EGOC B outperforms all the compared approaches while EGO has a beat in precision. The F1 score result shows that the proposed approaches consistently produce better results as compared to the participated approaches and therefore can be considered as a better solution towards outlier detection. Experiments on environmental monitoring data This section shows the experiments related to the detecting capability of the proposed approach in environmental monitoring data. These experiments are conducted for both real and synthetically prepared datasets. First, the attributes in the dataset have been analyzed based on their correlation. The correlation is computed using the Pearson Correlation Coefficient (PCC) (Benesty et al. 2009). In order to compute the dependencies of various variables, a correlation matrix is Environmental Science and Pollution Research Table 4 Results on Pima Indian Diabetes and Cardiotocography datasets with 5% and 10% downsampled outliers 5% Outliers Approach EGOC B Pima Cardiotography Accuracy Recall Precision F1-Score Accuracy Recall Precision F1-Score 0.91 ± 0.02 0.53 ± 0.03 0.55 ± 0.01 0.54 ± 0.02 0.94 ± 0.04 0.57 ± 0.03 0.58 ± 0.03 0.57 ± 0.02 EGO 0.87 ± 0.01 0.47 ± 0.02 0.51 ± 0.02 0.49 ± 0.02 0.91 ± 0.03 0.51 ± 0.03 0.49 ± 0.02 0.50 ± 0.02 RE3WC 0.87 ± 0.01 0.43 ± 0.04 0.50 ± 0.03 0.46 ± 0.04 0.88 ± 0.01 0.51 ± 0.01 0.29 ± 0.01 0.37 ± 0.13 LOF 0.86 ± 0.02 0.08 ± 0.05 0.31 ± 0.08 0.13 ± 0.07 0.88 ± 0.01 0.51 ± 0.01 0.29 ± 0.01 0.37 ± 0.13 LoOP 0.88 ± 0.09 0.10 ± 0.03 0.70 ± 0.31 0.17 ± 0.09 0.93 ± 0.03 0.22 ± 0.02 0.53 ± 0.04 0.31 ± 0.02 ABOD 0.85 ± 0.02 0.29 ± 0.08 0.37 ± 0.05 0.32 ± 0.04 0.89 ± 0.03 0.62 ± 0.02 0.34 ± 0.02 0.44 ± 0.02 CBLOF 0.75 ± 0.07 0.13 ± 0.04 0.11 ± 0.04 0.12 ± 0.04 0.86 ± 0.02 0.42 ± 0.02 0.21 ± 0.01 0.28 ± 0.07 HBOS 0.71 ± 0.08 0.14 ± 0.09 0.09 ± 0.03 0.11 ± 0.07 0.85 ± 0.03 0.22 ± 0.07 0.14 ± 0.02 0.17 ± 0.02 Accuracy Recall Precision F1-Score Accuracy Recall Precision F1-Score EGOC B 0.87 ± 0.07 0.54 ± 0.06 0.73 ± 0.07 0.62 ± 0.08 0.91 ± 0.06 0.51 ± 0.05 0.66 ± 0.06 0.58 ± 0.05 EGO 0.84 ± 0.08 0.44 ± 0.05 0.69 ± 0.06 0.54 ± 0.05 0.89 ± 0.06 0.51 ± 0.05 0.55 ± 0.07 0.53 ± 0.06 RE3WC 0.84 ± 0.09 0.45 ± 0.09 0.69 ± 0.03 0.54 ± 0.02 0.89 ± 0.02 0.48 ± 0.01 0.55 ± 0.01 0.51 ± 0.01 10% Outliers Approach Pima Cardiotography LOF 0.80 ± 0.08 0.16 ± 0.02 0.68 ± 0.11 0.26 ± 0.04 0.88 ± 0.03 0.24 ± 0.03 0.59 ± 0.03 0.34 ± 0.03 LoOP 0.81 ± 0.05 0.19 ± 0.02 0.71 ± 0.07 0.30 ± 0.09 0.88 ± 0.04 0.25 ± 0.03 0.56 ± 0.04 0.34 ± 0.02 ABOD 0.80 ± 0.09 0.33 ± 0.02 0.53 ± 0.03 0.41 ± 0.07 0.87 ± 0.02 0.65 ± 0.08 0.50 ± 0.03 0.56 ± 0.07 CBLOF 0.73 ± 0.08 0.15 ± 0.02 0.26 ± 0.04 0.19 ± 0.03 0.86 ± 0.03 0.58 ± 0.03 0.47 ± 0.02 0.52 ± 0.09 HBOS 0.74 ± 0.09 0.26 ± 0.05 0.35 ± 0.03 0.30 ± 0.02 0.84 ± 0.03 0.26 ± 0.08 0.37 ± 0.02 0.31 ± 0.02 F1-Score Table 5 Results on Pima Indian Diabetes and Cardiotocography datasets with 15% and 20% downsampled outliers 15% Outliers Approach Pima Cardiotography Accuracy Recall Precision F1-Score Accuracy Recall Precision EGOC B 0.74 ± 0.07 0.37 ± 0.03 0.59 ± 0.05 0.45 ± 0.05 0.89 ± 0.07 0.61 ± 0.04 0.76 ± 0.03 0.68 ± 0.05 EGO 0.72 ± 0.08 0.31 ± 0.04 0.55 ± 0.03 0.40 ± 0.04 0.85 ± 0.05 0.50 ± 0.05 0.66 ± 0.04 0.57 ± 0.06 RE3WC 0.71 ± 0.08 0.29 ± 0.04 0.50 ± 0.02 0.36 ± 0.04 0.84 ± 0.08 0.45 ± 0.04 0.57 ± 0.02 0.51 ± 0.07 LOF 0.72 ± 0.08 0.14 ± 0.05 0.58 ± 0.04 0.23 ± 0.02 0.85 ± 0.04 0.25 ± 0.01 0.73 ± 0.12 0.37 ± 0.08 LoOP 0.70 ± 0.07 0.19 ± 0.02 0.48 ± 0.09 0.27 ± 0.07 0.84 ± 0.02 0.27 ± 0.01 0.61 ± 0.08 0.37 ± 0.09 ABOD 0.68 ± 0.07 0.32 ± 0.04 0.45 ± 0.04 0.37 ± 0.05 0.85 ± 0.09 0.59 ± 0.06 0.57 ± 0.05 0.58 ± 0.05 CBLOF 0.64 ± 0.09 0.25 ± 0.06 0.34 ± 0.05 0.28 ± 0.01 0.82 ± 0.05 0.42 ± 0.06 0.51 ± 0.07 0.47 ± 0.09 HBOS 0.70 ± 0.06 0.26 ± 0.08 0.46 ± 0.04 0.33 ± 0.03 0.79 ± 0.03 0.26 ± 0.04 0.38 ± 0.07 0.31 ± 0.07 Recall Precision F1-Score Accuracy Recall Precision F1-Score 20% Outliers Approach Pima Accuracy Cardiotography EGOC B 0.72 ± 0.04 0.31 ± 0.08 0.40 ± 0.04 0.35 ± 0.05 0.88 ± 0.05 0.58 ± 0.03 0.83 ± 0.04 0.68 ± 0.05 EGO 0.69 ± 0.05 0.29 ± 0.04 0.38 ± 0.05 0.33 ± 0.04 0.84 ± 0.07 0.52 ± 0.03 0.72 ± 0.04 0.60 ± 0.03 RE3WC 0.69 ± 0.04 0.26 ± 0.09 0.64 ± 0.05 0.37 ± 0.06 0.84 ± 0.09 0.43 ± 0.06 0.75 ± 0.08 0.55 ± 0.05 LOF 0.65 ± 0.09 0.06 ± 0.01 0.53 ± 0.08 0.11 ± 0.09 0.79 ± 0.03 0.10 ± 0.08 0.82 ± 0.12 0.17 ± 0.07 LoOP 0.67 ± 0.09 0.23 ± 0.04 0.58 ± 0.05 0.33 ± 0.06 0.81 ± 0.03 0.26 ± 0.07 0.73 ± 0.12 0.39 ± 0.07 ABOD 0.68 ± 0.04 0.24 ± 0.01 0.63 ± 0.05 0.35 ± 0.09 0.82 ± 0.03 0.52 ± 0.05 0.63 ± 0.06 0.57 ± 0.03 CBLOF 0.61 ± 0.04 0.17 ± 0.05 0.34 ± 0.08 0.23 ± 0.01 0.79 ± 0.09 0.38 ± 0.05 0.56 ± 0.03 0.45 ± 0.07 HBOS 0.61 ± 0.07 0.17 ± 0.05 0.34 ± 0.02 0.23 ± 0.04 0.71 ± 0.03 0.10 ± 0.06 0.22 ± 0.07 0.14 ± 0.05 123 Environmental Science and Pollution Research Table 6 Results on environmental monitoring data for a sample size 2000 with 10% contamination Table 8 Results on environmental monitoring data for a sample size 2000 with 20% contamination Contamination Number of Outlier Number of detected Number of False Detected Contamination Number of Outlier Number of detected Number of False Detected Carbon 20 18 5 Carbon 50 49 5 Carbon monoxide 24 23 6 Carbon monoxide 40 38 4 Carbon dioxide 22 21 7 Carbon dioxide 40 38 4 Lead 25 23 6 Lead 40 39 6 Nitrogen monoxide 24 22 7 Nitrogen monoxide 40 37 3 Nitrogen dioxide 22 21 8 Nitrogen dioxide 40 39 6 Ground-level ozone 21 19 4 Ground-level ozone 40 40 8 Sulfur monoxide 22 20 3 Sulfur monoxide 55 53 5 Sulfur dioxide 20 18 3 Sulfur dioxide 55 54 4 computed based on the correlation coefficient. The correlation matrix defines the linear relationship between each pair of variables. Mathematically, it is defined as follows: P(a, a) P(a, b) P(b, a) P(b, b) Cov(a, b) ρ= σa σb Matri x = (12) (13) The above Eqs. 12–13 show the correlation between two attributes, a and b. Here a denotes the standard deviation for a and b where Cov(a, b) represents the covariance. The correlation between various components determines the balanced components in the fresh air. Secondly, it is also observed that some contamination is sometimes allowed until its severity. Therefore, this rule has been strictly followed while computing correlations between various air components. The contamination has been added synthetically to observe the outline detection performance of the proposed approach. Next, the proposed approach has been employed Table 7 Results on environmental monitoring data for a sample size 2000 with 15% contamination Contamination Number of Outlier Number of detected Number of False Detected Carbon 30 28 5 Carbon monoxide 30 28 5 Carbon dioxide 40 39 6 Lead 36 35 6 Nitrogen monoxide 25 23 5 Nitrogen dioxide 35 33 4 Ground-level ozone 35 32 2 Sulfur monoxide 36 33 1 Sulfur dioxide 33 31 2 123 on this contaminated dataset. The results are tabulated in Tables 6, 7, and 8. Table 6 represents the results for EGO on environmental monitoring data using a sample size of 2000 with 10% contamination. The results for carbon, carbon monoxide, carbon dioxide, lead, nitrogen monoxide, nitrogen dioxide, ground-level ozone, sulfur monoxide and sulfur dioxide are 90.00%, 95.83%, 95.45%, 92.00%, 91.67%, 95.45%, 90.48%, 90.91% and 90.00%, respectively. The overall performance for 10% added contamination is 92.42%. There is a relation between the observed outliers and false detections. The high outlier detection rate increases the error rate. It means that these outliers are very much sparsed and connected boundaries with non-outlier components of the air. Table 7 shows the results of the proposed approach employed at environmental monitoring data using a sample size of 2000 with 15% contamination. The results for the contaminations: carbon, carbon monoxide, carbon dioxide, lead, nitrogen monoxide, nitrogen dioxide, ground-level ozone, sulfur monoxide and sulfur dioxide are 93.33%, 93.33%, 97.50%, 97.22%, 92.00%, 94.29%, 91.43%, 91.67%, and 93.94%, respectively. The overall performance for 15% added contamination is 93.86%. The results became improved with the addition of more contaminations. The same relationship can be observed between detection and error rate. The number of true positives increases; meanwhile, false positives also increase. It means that these outliers are far from each other but near with overlapping boundaries of non-outlier components in the air. Table 8 tabulates the performance of the proposed approach for environmental monitoring data using a sample size of 2000 with 20% contamination. The performance of the proposed approach for the contaminations: carbon, carbon monoxide, carbon dioxide, lead, nitrogen monoxide, nitrogen dioxide, ground-level ozone, sulfur monoxide and sulfur dioxide are 98.00%, 95.00%, 95.00%, 97.50%, 92.50%, 97.50%, 100.00%, 96.36%, and 98.18%, respectively. The Environmental Science and Pollution Research overall performance for 15% added contamination is 96.67%. The results keep improving with the addition of more contamination. The observations are the same, i.e., there exists a relationship between detection and error rate. An increase in the detection rate increases the error rate. An increase in the number of true positives may increase false positives. That statement suggests that the outliers with high detection are close to each other, with overlapping boundaries to the non-contaminated components. Ali. Supervision: Anwar Shah, Bahar Ali, and Muhammad Shafiq. Project administration: Anwar Shah, Inam Ullah, Kassian T.T. Amesho, and Muhammad Shafiq. Funding acquisition: Kassian T.T. Amesho, Muhammad Shafiq, Shahid Anwar, and Ahyoung Choi. Funding This work is supported by the National Natural Science Foundation of China, Grant No. 62250410365. Availability of data and materials The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation. Declarations Conclusion Entropy in machine learning is used in many applications. The concept of randomness can be applied where uncertainty exists. We introduce EGO, which combines the two concepts of grid and entropy. The grid effectively distributes the data and facilitates easy analysis. Entropy plays a vital role in measuring randomness. The entropy in this study is considered as a distance function, the value of which became dropped due to the separation of a member from its respective cluster. The EGO is applied on the top of hard clusters obtained using a hard clustering algorithm or data labels. The explicit outliers are removed initially. The entropy is applied to the whole dataset to know the initial randomness. Then the remaining sparse instances are examined for the implicit outliers. The method is flexible and can be applied to clusterbased approach. This has a more time complexity at the cost of good and improved results. The approach is clusterbased EGO or EGOC B . The proposed approach(es) are comparatively analyzed against well-known outlier detection algorithms, including DBSCAN, HDBSCAN, RE3WC, LOF, LoOP, ABOD, CBLOF and HBOS. Experimental results indicate that the EGO and EGOC B effectively detect outliers while producing compact clusters. The results of other algorithms used to cluster in noise-polluted datasets can be further refined by EGO and EGOC B . In particular, our approach(es) detect an additional 4.5% to 8.6% outliers that were undetected by well-known approaches. Moreover, outlier detection is carried out in the environmental data analysis as a case study of the proposed approach. The experiments suggest that the proposed approach(es) may be considered a suitable choice for obtaining compact clusters in the presence of noise, whereas the robustness of environmental data analysis encourages its industry orientation. Author contribution Conceptualization: Anwar Shah and Bahar Ali. Methodology: Anwar Shah, Bahar Ali, and Kassian T.T. Amesho. Software: Anwar Shah, Bahar Ali, Fazal Wahab, and Inam Ullah. Validation: Anwar Shah, Bahar Ali, and Muhammad Shafiq. Formal analysis: Anwar Shah. Investigation: Anwar Shah, Bahar Ali, Inam Ullah, and Muhammad Shafiq. Resources: Anwar Shah and Inam Ullah. Data curation: Anwar Shah, Fazal Wahab, and Inam Ullah. Writing— original draft preparation: Anwar Shah. Writing—review and editing: Anwar Shah and Bahar Ali. Visualization: Anwar Shah and Bahar Ethics statement The studies involving human participants were reviewed and approved by the National Natural Science Foundation of China, Grant No. 62250410365. The ethics committee waived the requirement of written informed consent for participation. Written informed consent was obtained from the individual(s) for the publication of any potentially identifiable images or data included in this article. Consent to participate Informed consent was obtained from all individual participants included in the study. Consent for publication The authors of the article “Entropy Based Grid Approach for Handling Outliers: A Case Study to Environmental Monitoring Data” give consent for the publication of all the identifiable details, including text, images, and materials, to be published in the journal Environmental Science and Pollution Research. Conflict of interest The authors declare no competing interests. References Agrawal R, Gehrke J, Gunopulos D, et al (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the international conference on Management of data. pp 94–105 Alameddine I, Kenney MA, Gosnell RJ et al (2010) Robust multivariate outlier detection methods for environmental data. J Environ Eng 136(11):1299–1304 Ali B, Azam N, Shah A et al (2021) A spatial filtering inspired threeway clustering approach with application to outlier detection. Int J Approx Reason 130:1–21 Amini A, Wah TY, Saboohi H (2014) On density-based data streams clustering algorithms: A survey. J Comput Sci Technol 29(1):116– 141 Andersson JL, Graham MS, Zsoldos E et al (2016) Incorporating outlier detection and replacement into a non-parametric framework for movement and distortion correction of diffusion mr images. NeuroImage 141:556–572 Bai M, Wang X, Xin J et al (2016) An efficient algorithm for distributed density-based outlier detection on big data. Neurocomputing 181:19–28 Batra R, Ko KI (1992) An adaptive mesh refinement technique for the analysis of shear bands in plane strain compression of a thermoviscoplastic solid. Comput Mech 10(6):369–379 Benesty J, Chen J, Huang Y, et al (2009) Pearson correlation coefficient. In: Noise reduction in speech processing. Springer, p 1–4 Berger MJ, Oliger J (1984) Adaptive mesh refinement for hyperbolic partial differential equations. J Comput Phys 53(3):484–512 123 Environmental Science and Pollution Research Berger MJ, Colella P et al (1989) Local adaptive mesh refinement for shock hydrodynamics. J Comput Phys 82(1):64–84 Bharti S, Pattanaik K, Pandey A (2019) Contextual outlier detection for wireless sensor networks. J Ambient Intell Humanized Comput 1–20 Birant D, Kut A (2007) St-dbscan: An algorithm for clustering spatialtemporal data. Data Knowl Eng 60(1):208–221 Blythe J, Jain S, Deelman E et al (2005) Task scheduling strategies for workflow-based applications in grids. In: IEEE International Symposium on Cluster Computing and the Grid, vol 2005. pp 759– 767 Borah B, Bhattacharyya D (2004) An improved sampling-based dbscan for large spatial databases. In: Proceedings of the International conference on intelligent sensing and information processing. pp 92–96 Breunig MM, Kriegel HP, Ng RT, et al (2000) Lof: identifying densitybased local outliers. In: Proceedings of the international conference on Management of data. pp 93–104 Campello RJ, Moulavi D, Sander J (2013) Density-based clustering based on hierarchical density estimates. In: Proceedings of the Pacific-Asia conference on knowledge discovery and data mining. pp 160–172 Campos GO, Zimek A, Sander J et al (2016) On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Min Knowl Discov 30(4):891–927 Campos GO, Zimek A, Sander J et al (2016) On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Min Knowl Discov 30(4):891–927 Chen J, Sathe S, Aggarwal C, et al (2017) Outlier detection with autoencoder ensembles. In: Proceedings of the international conference on data mining. pp 90–98 Chen Z, Liu B (2014) Mining topics in documents: standing on the shoulders of big data. In: Proceedings of the international conference on Knowledge discovery and data mining. pp 1116–1125 Christy A, Gandhi GM, Vaithyasubramanian S (2015) Cluster based outlier detection algorithm for healthcare data. Procedia Comput Sci 50:209–215 Duan L, Xu L, Guo F et al (2007) A local-density based spatial clustering algorithm with noise. Inf Syst 32(7):978–986 Eiseman PR (1987) Adaptive grid generation. Comput Methods Appl Mech Eng 64(1–3):321–376 Erskine RH, Green TR, Ramirez JA, et al (2006) Comparison of gridbased algorithms for computing upslope contributing area. Water Resour Res 42(9) Ester M, Kriegel HP, Sander J, et al (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Knowledge Discovery and Data Mining. pp 226–231 Fakhari A, Lee T (2014) Finite-difference lattice boltzmann method with a block-structured adaptive-mesh-refinement technique. Phys Rev E 89(3):033310 Fei G, Liu B (2016) Breaking the closed world assumption in text classification. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pp 506–514 Fuchs L (1986) A local mesh-refinement technique for incompressible flows. Comput Fluids 14(1):69–81 Gan G, Ng MKP (2017) K-means clustering with outlier removal. Pattern Recog Lett 90:8–14 Garces H, Sbarbaro D (2009) Outliers detection in environmental monitoring data. IFAC Proc 42(23):330–335 Goldstein M, Dengel A (2012) Histogram-based outlier score (hbos): A fast unsupervised anomaly detection algorithm. Poster Demo Track 59–63 Goldstein MB (2014) Anomaly detection in large datasets. Verlag Dr. Hut 123 Gu Y, Ganesan RK, Bischke B, et al (2017) Grid-based outlier detection in large data sets for combine harvesters. In: Proceedings of the International Conference on Industrial Informatics. pp 811–818 Güngör E, Özmen A (2017) Distance and density based clustering algorithm using gaussian kernel. Expert Syst Appl 69:10–20 Guseva AI, Kuznetsov IA (2017) The use of entropy measure for higher quality machine learning algorithms in text data processing. In: Proceedings of the International Conference on Future Internet of Things and Cloud Workshops. pp 47–52 Hautamäki V, Cherednichenko S, Kärkkäinen I, et al (2005) Improving k-means by outlier removal. In: Scandinavian Conference on Image Analysis. Springer, pp 978–987 He Y, Tan H, Luo W et al (2014) Mr-dbscan: a scalable mapreducebased dbscan algorithm for heavily skewed data. Front Comput Sci 8(1):83–99 He Z, Xu X, Deng S (2003) Discovering cluster-based local outliers. Pattern Recogn Lett 24(9–10):1641–1650 Jabez J, Muthukumar B (2015) Intrusion detection system (ids): anomaly detection using outlier detection approach. Procedia Comput Sci 48:338–346 Jiang MF, Tseng SS, Su CM (2001) Two-phase clustering process for outliers detection. Pattern Pattern Recognit 22(6–7):691–700 Kadlec P, Gabrys B, Strandt S (2009) Data-driven soft sensors in the process industry. Comput Chem Eng 33(4):795–814 Karypis G, Han EH, Kumar V (1999) Chameleon: Hierarchical clustering using dynamic modeling. Computer 32(8):68–75 Kotsiantis S, Pintelas P (2004) Recent advances in clustering: A brief survey. Trans Inf Sci Appl 1(1):73–81 Kriegel HP, Schubert M, Zimek A (2008) Angle-based outlier detection in high-dimensional data. In: Proceedings of the international conference on Knowledge discovery and data mining. pp 444– 452 Kriegel HP, Kröger P, Schubert E, et al (2009) Loop: local outlier probabilities. In: Proceedings of the conference on Information and knowledge management. pp 1649–1652 Krkkinen I, Frnti P (2002) Dynamic local search algorithm for the clustering problem. Department of Computer Science, University of Joensuu, Tech Rep A-2002-6 Lang K (1995) Newsweeder: Learning to filter netnews. In: Machine Learning Proceedings 1995. Elsevier, p 331–339 Lee J, Cho NW (2016) Fast outlier detection using a grid-based algorithm. PLoS ONE 11(11):e0165972 Liao Wk, Liu Y, Choudhary A (2004) A grid-based clustering algorithm using adaptive mesh refinement. In: Proceedings of the international conference on data mining. pp 61–69 Lin S, Brown DE (2006) An outlier-based data association method for linking criminal incidents. Decis Support Syst 41(3):604–615 Liu B, Yin J, Xiao Y, et al (2010) Exploiting local data uncertainty to boost global outlier detection. In: Proceedings of the International Conference on Data Mining, pp 304–313 Louhichi S, Gzara M, Abdallah HB (2014) A density based algorithm for discovering clusters with varied density. In: Proceedings of World Congress on Computer Applications and Information Systems). pp 1–6 Lucas Y, Portier PE, Laporte L et al (2020) Towards automated feature engineering for credit card fraud detection using multi-perspective hmms. Futur Gener Comput Syst 102:393–402 Luo J, Xu L, Jamont JP et al (2007) Flood decision support system on agent grid: method and implementation. Enterp Inf Syst 1(1):49– 68 Ma EW, Chow TW (2004) A new shifting grid clustering algorithm. Pattern Recogn 37(3):503–514 Mahmoud E, Elmogy AM, Sarhan A (2016) Enhancing grid local outlier factor algorithm for better outlier detection. Artif Intell Mach Learn J 16(1):13–21 Environmental Science and Pollution Research Malini N, Pushpa M (2017) Analysis on credit card fraud identification techniques based on knn and outlier detection. In: Proceedings of the third International Conference on Advances in Electrical, Electronics, Information, Communication and Bio-Informatics. pp 255–258 McInnes L, Healy J, Astels S (2017) hdbscan: Hierarchical density based clustering. J Open Source Softw 2(11):205 Mia Hubert PR, Segaert P (2015) Discussion of multivariate functional outlier detection. Stat Methods Appl 24(2):177–202 Ohadi N, Kamandi A, Shabankhah M, et al (2020) Sw-dbscan: A gridbased dbscan algorithm for large datasets. In: Proceddings of the International Conference on Web Research (ICWR). pp 139–145 Osekowska E, Johnson H, Carlsson B (2014) Grid size optimization for potential field based maritime anomaly detection. Transp Res Procedia 3:720–729 Park NH, Lee WS (2004) Statistical grid-based clustering over data streams. ACM Sigmod Rec 33(1):32–37 Pearson RK (2002) Outliers in process modeling and identification. IEEE Trans Control Syst Technol 10(1):55–63 Pilevar AH, Sukumar M (2005) Gchl: A grid-clustering algorithm for high-dimensional very large spatial data bases. Pattern Recogn Lett 26(7):999–1010 Qiu GF, Li HZ, Xu LD et al (2003) A knowledge processing method for intelligent systems based on inclusion degree. Expert Syst 20(4):187–195 Rai P, Singh S (2010) A survey of clustering techniques. Int J Comput Appl 7(12):1–5 Rajeswari A, Yalini S, Janani R, et al (2018) A comparative evaluation of supervised and unsupervised methods for detecting outliers. In: Proceedings of the Second International Conference on Inventive Communication and Computational Technologies. pp 1068–1073 Rehm F, Klawonn F, Kruse R (2007) A novel approach to noise clustering for outlier detection. Soft Comput 11(5):489–494 Rencis JJ, Mullen RL (1986) Solution of elasticity problems by a self-adaptive mesh refinement technique for boundary element computation. Int J Numer Methods Eng 23(8):1509–1527 Rokach L (2009) A survey of clustering algorithms. In: Data mining and knowledge discovery handbook. p 269–298 Sandosh S, Govindasamy V, Akila G (2020) Enhanced intrusion detection system via agent clustering and classification based on outlier detection. Peer-to-Peer Netw Appl 1–8 Shafiq M, Tian Z, Bashir AK et al (2020) Corrauc: a malicious botiot traffic detection method in iot network using machine-learning techniques. IEEE Internet Things J 8(5):3242–3254 Shafiq M, Tian Z, Bashir AK et al (2020) Iot malicious traffic identification using wrapper-based feature selection mechanisms. Comput Secur 94:101863 Shafiq M, Tian Z, Sun Y et al (2020) Selection of effective machine learning algorithm and bot-iot attacks traffic identification for internet of things in smart city. Futur Gener Comput Syst 107:433– 442 Shah A, Azam N, Ali B et al (2021) A three-way clustering approach for novelty detection. Inf Sci 569:650–668 Shah A, Azam N, Alanazi E, et al (2022) Image blurring and sharpening inspired three-way clustering approach. Appl Intell 1–25 Sheikholeslami S, Chatterjee S, Zhang A (2002) A multi-resolution clustering approach for very large spatial databases. In: Proceedings of the International Conference on Formal Ontology in Information Systems. pp 622–630 Sitanggang IS, Baehaki DAM (2015) Global and collective outliers detection on hotspot data as forest fires indicator in riau province, indonesia. In: Proceedings of the International Conference on Spatial Data Mining and Geographical Knowledge Services. pp 66–70 Tran TN, Drab K, Daszykowski M (2013) Revised dbscan algorithm to cluster data with dense adjacent clusters. Chemometr Intell Lab Syst 120:92–96 Veenman CJ, Reinders MJT, Backer E (2002) A maximum variance cluster algorithm. IEEE Trans Pattern Anal Mach Intell 24(9):1273–1280 Veselík P, Sejkorová M, Nieoczym A, et al (2020) Outlier identification of concentrations of pollutants in environmental data using modern statistical methods. Pol J Environ Stud 29(1) Wang B, Xiao G, Yu H, et al (2009) Distance-based outlier detection on uncertain data. In: Proceddings of the International Conference on Computer and Information Technology. pp 293–298 Wang W, Yang J, Muntz R, et al (1997) Sting: A statistical information grid approach to spatial data mining. In: Proceeding of the conference very large data bases. pp 186–195 Wang X, Davidson I (2009) Discovering contexts and contextual outliers using random walks in graphs. In: Proceedings of the International Conference on Data Mining. pp 1034–1039 Warne K, Prasad G, Rezvani S et al (2004) Statistical and computational intelligence techniques for inferential model development: a comparative evaluation and a novel proposition for fusion. Eng Appl Artif Intell 17(8):871–885 Xu D, Tian Y (2015) A comprehensive survey of clustering algorithms. Ann Data Sci 2(2):165–193 Xu X, Yuruk N, Feng Z, et al (2007) Scan: a structural clustering algorithm for networks. In: Proceedings of the international conference on Knowledge discovery and data mining. pp 824–833 Xu X, Liu H, Li L et al (2018) A comparison of outlier detection techniques for high-dimensional data. Int J Comput Intell Syst 11(1):652–662 Yang H, Antonante P, Tzoumas V et al (2020) Graduated non-convexity for robust spatial perception: From non-minimal solvers to global outlier rejection. IEEE Robot Autom Lett 5(2):1127–1134 Yang X, Zhang G, Lu J et al (2010) A kernel fuzzy c-means clustering-based fuzzy support vector machine algorithm for classification problems with outliers or noises. IEEE Trans Fuzzy Syst 19(1):105–115 Yap P (2002) Grid-based path-finding. In: Conference of the Canadian Society for Computational Studies of Intelligence. pp 44–55 Zhang JS, Leung YW (2003) Robust clustering by pruning outliers. IEEE Trans Syst Man Cybern 33(6):983–998 Zhu Y, Ting KM, Carman MJ (2016) Density-ratio based clustering for discovering clusters with varying densities. Patt Recogn 60:983– 997 Zhu Y, Ting KM, Angelova M (2018) A distance scaling method to improve density-based clustering. In: Proceedings of the PacificAsia Conference on Knowledge Discovery and Data Mining. pp 389–400 Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law. 123 Environmental Science and Pollution Research Authors and Aﬃliations Anwar Shah1 · Bahar Ali1 · Fazal Wahab2 · Inam Ullah3 · Kassian T.T. Amesho4,5,6,7,8,9 · Muhammad Shafiq10,11 Anwar Shah anwar.shah@nu.edu.pk 4 Institute of Environmental Engineering, National Sun Yat-Sen University, Kaohsiung 804, Taiwan Bahar Ali bahar.ali@nu.edu.pk 5 Center for Emerging Contaminants Research, National Sun Yat-Sen University, Kaohsiung 804, Taiwan Fazal Wahab 1728039@stu.edu.cn 6 Tshwane School for Business and Society, Faculty of Management of Sciences, Tshwane University of Technology, Pretoria, South Africa 7 Centre for Environmental Studies, The International University of Management, Main Campus, Dorado Park Ext 1, Windhoek, Namibia Inam Ullah inam.fragrance@gmail.com Kassian T.T. Amesho kassian.amesho@gmail.com 1 2 3 8 Regent Business School, Durban 4001, South Africa National University of Computer and Emerging Sciences, Karachi, Pakistan 9 Destinies Biomass Energy and Farming Pty Ltd, P.O. Box 7387, Swakopmund, Namibia College of Computer Science and Technology, Northeastern University Shenyang, Shenyang, China 10 Cyberspace Institute of Advanced Technology, Guangzhou University, Guangzhou, China 11 Shenyang Normal University, Shenyang, China BK21 Chungbuk Information Technology Education and Research Center, Chungbuk National University, Cheongju, South Korea 123

EGO Published Version

Related documents

Products

Support

EGO Published Version

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib