Uploaded by Anwar Shah

EGO Published Version

advertisement
Environmental Science and Pollution Research
https://doi.org/10.1007/s11356-023-26780-1
APPLICATIONS OF EMERGING GREEN TECHNOLOGIES FOR EFFICIENT VALORIZATION OF AGRO-INDUSTRIAL
WASTE: A ROADMAP TOWARDS SUSTAINABLE ENVIRONMENT AND CIRCULAR ECONOMY
Entropy-based grid approach for handling outliers: a case study
to environmental monitoring data
Anwar Shah1 · Bahar Ali1 · Fazal Wahab2 · Inam Ullah3 · Kassian T.T. Amesho4,5,6,7,8,9 · Muhammad Shafiq10,11
Received: 5 November 2022 / Accepted: 29 March 2023
© The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2023
Abstract
Grid-based approaches render an efficient framework for data clustering in the presence of incomplete, inexplicit, and uncertain
data. This paper proposes an entropy-based grid approach (EGO) for outlier detection in clustered data. The given hard clusters
obtained from a hard clustering algorithm, EGO uses entropy on the dataset as a whole or on an individual cluster to detect
outliers. EGO works in two steps: explicit outlier detection and implicit outlier detection. Explicit outlier detection is concerned
with those data points that are isolated in the grid cells. They are either far from the dense region or maybe a nearby isolated
data point and therefore declared as an explicit outlier. Implicit outlier detection is associated with the detection of outliers
that are perplexedly deviated from the normal pattern. The determination of such outliers is achieved using entropy change of
the dataset or a specific cluster for each deviation. The elbow based on the trade-off between entropy and object geometries
optimizes the outlier detection process. Experimental results on CHAMELEON datasets and other similar datasets suggested
that the proposed approach(es) detect the outliers more precisely and extend the capability of outliers detection to an additional
4.5% to 8.6%. Moreover, the resultant clusters became more precise and compact when the entropy-based gridding approach is
applied on top of hard clustering algorithms. The performance of the proposed algorithms is compared with well-known outlier
detection algorithms, including DBSCAN, HDBSCAN, RE3WC, LOF, LoOP, ABOD, CBLOF and HBOS. Finally, a case
study for detecting outliers in environmental data has been carried out using the proposed approach and results are generated
on our synthetically prepared datasets. The performance shows that the proposed approach may be an industrial-oriented
solution to outlier detection in environmental monitoring data.
Keywords Environmental monitoring data · Entropy · Grid · Hard clusters · Industrial · Outliers
Introduction
An outlier is a data point or group of points that deviated from
the normal pattern of a given dataset. In general, outliers can
be divided into three classes based on their behavior: global,
conditional or contextual and collective outliers. The global
outliers deviate significantly from the entire dataset (Liu et al.
2010; Yang et al. 2020). The conditional outliers deviate from
the rest of the dataset based on some selected context (Bharti
et al. 2019; Wang and Davidson 2009). In contrast, collective
outliers are a group of instances collectively deviated from
Responsible editor: Marcus Schulz
B
Muhammad Shafiq
Srsshafiq@gmail.com
Extended author information available on the last page of the article
the rest of the dataset (Rajeswari et al. 2018; Sitanggang and
Baehaki 2015). Outlier occurs due to several reasons, such
as inappropriate scaling, improper data collection, measurement errors, sampling errors, and human and experimental
errors (Campello et al. 2013; Campos et al. 2016b; Chen
et al. 2017; Mia Hubert and Segaert 2015). Outlier detection
plays a major important role in almost every quantitative discipline, such as cybersecurity, finance, and machine learning
(Shah et al. 2021, 2022; Shafiq et al. 2020b, a). It has many
applications, including fault diagnosis, web analytics, medical diagnosis, fraud detection, criminal activities detection,
and malware detection (Andersson et al. 2016; Jabez and
Muthukumar 2015; Lin and Brown 2006; Lucas et al. 2020;
Malini and Pushpa 2017; Sandosh et al. 2020; Shafiq et al.
2020c).
Outlier detection is an essential task in data processing that
may cause severe effects in the final evaluation of results.
123
Environmental Science and Pollution Research
Therefore, several approaches have been devised to handle outlier detection. These approaches include statisticalbased approaches, distance-based approaches, density-based
approaches, deviation-based approaches, subspace-based
approaches, depth-based approaches, and clustering-based
approaches. We are handling outlier detection in the context
of clustering.
The main focus of cluster-based outlier detection methods
is to define and observe compactness in clusters to find outliers. The outlier may be an independent object or appears
as an individual cluster (Christy et al. 2015). The clustering
methods implicitly define an outlier or group of outliers being
background noise where the clusters are embedded in the
foreground (Ester et al. 1996; Qiu et al. 2003; Luo et al. 2007).
Density-based spatial clustering is a well-known method to
discover clusters of applications and the surrounding noise in
the spatial database (Ester et al. 1996). This method defines
outliers based on densities or the collection of objects. An
object with a low density and the same parameters is considered an outlier. Several extensions and variants to this method
are proposed to handle various problems (Birant and Kut
2007; Borah and Bhattacharyya 2004; Duan et al. 2007; He
et al. 2014; Louhichi et al. 2014; McInnes et al. 2017).
Many researchers work in the same direction by modifying, extending, improving or refining the basic approach to
handle different applications (Birant and Kut 2007; Borah
and Bhattacharyya 2004; Duan et al. 2007; He et al. 2014;
McInnes et al. 2017; Tran et al. 2013). Some approaches
determine an object as an outlier based on its farness from
the centroid of the cluster (Hautamäki et al. 2005; Zhang
and Leung 2003). An approach towards group outliers detection has been devised in Jiang et al. (2001), which considers
small-size clusters as group outliers. Another approach is to
detect outliers based on a separate cluster called noise or
outlier (Gan and Ng 2017; Rehm et al. 2007). The potential outliers in the dataset have a high association level with
that cluster. Besides these approaches, fuzzy and graph-based
methods have also been devised to handle outliers (Xu et al.
2007; Yang et al. 2010). These existing approaches employ
some threshold or constraint to isolate outliers from inliers.
There are some key issues with these approaches. The first
issue is the determination of a suitable threshold. The second issue: they determine a single criterion for a multi-class
dataset. In a multi-class environment, each class may be of
different density, structure and geometry, where a single criterion may fail to determine all the outliers in the dataset.
We introduce an entropy-based gridding approach, EGO
and examine its efficiency and application in outlier detection. The EGO is motivated by the fact that entropy is
related to the randomness of a system (Guseva and Kuznetsov
2017). More specifically, entropy measures the separation
of an object from the centroid of its corresponding cluster.
Removing an object can lead to a drop in the entropy of a
123
particular cluster or dataset. The drop of an object(s) from
a family of objects having the same distribution (sparse
or dense) can affect the entropy drop in a systematic way
which determines that the instances are the members of
the same cluster and not outliers. In case of removal of
an object(s) where the entropy drop is deviated from regular drops, determine the presence of outlier(s). The EGO
can be applied on the whole dataset at once or individual cluster(s) as EGOC B . The EGOC B works on cluster
basis where it determines the outliers corresponding to each
given hard cluster. The refinement capability of EGOC B
is comparatively more than EGO at the cost of increasing time complexity. The stopping criteria of EGO and
EGOC B can be optimized using the entropy-geometry elbow.
These approaches are evaluated against the inverse relation of objects geometry and entropy drop for each specific
geometry of the objects in the underlying grid. Experimental results on two-dimensional real and multi-dimensional
textual datasets suggest that entropy-based outlier detection detects and removes an additional 4.5% to 8.6% outliers that were undetected with DBSCAN or HDBSCAN.
Moreover, it leads to more cluster compactness and better
cluster quality compared to state-of-the-art outlier detection
methods.
Furthermore, the proposed approaches have been employed
for environmental monitoring data taken as a case study.
Environmental monitoring data is essential to analyze and
detect harmful residuals impacting human health. Usually,
industrial plants have monitoring systems and meteorological stations that enumerate the key variables of air quality
in the presence of the residuals released by these complexes. These analyzed measurements may be contaminated
with outliers that must be discarded to obtain a consistent
dataset. To evaluate the proposed approach’s performance,
synthetic outliers have been added to the environmental
monitoring data. The correlation for fresh air attributes components is carried out using Pearson’s Correlation Coefficient
(PCC) (Benesty et al. 2009). Similarly, the correlation of
the attributes is computed in the presence of outliers. The
attributes are coordinated based on the correlation that separates abnormal patterns from regular patterns. The proposed
approach then detects the outliers based on the entropy-based
gridding approach. The detection accuracy of the proposed
approach for environmental monitoring data is 96.67% which
suggests the industry orientation of our methods.
The remaining paper has been structured as follows: Background is discussed in the “Background” section, where the
“Entropy-based gridding approach using elbow method” and
“Experimental results and evaluations” sections describe the
proposed approaches and the conducted experiments based
on these approaches, respectively. The “Conclusion” section
is the conclusion, which is an inference to the paper.
Environmental Science and Pollution Research
Background
This section introduces the background required to explain
the proposed approach(es). In particular, we discuss data
scaling, gridding, and entropy.
Data scaling
The density-based clustering is capable of identifying arbitrary size or shaped data clusters while observing difficulty in
obtaining clusters with varying densities. A number of efforts
have been made to cope with this challenge (Güngör and
Özmen 2017; Zhu et al. 2018, 2016). For instance, Rescale
is a one-dimensional density-ratio scaling approach where
DScale is a multi-dimensional distance-based scaling preprocessing technique (Zhu et al. 2018, 2016).
DScale scales all the computed distances between the
instances of a dataset. Consider a Universal set U = {x1 , x2 ,
x3 , …, xn } with n objects. Each object xi has A attributes,
i.e., xi = (xi1 , …, xiA ), where xia represents the ath attribute
of object xi . To rescale the computed distance d(x, y) to
d (x, y), DScale define the scaling function r(x) as,
r (x) =
|Nη (x, d)| × m A
n × ηA
A1
(1)
Entropy
where Nη (x, d) = {y ∈ U| d(x,y) ≤ η} is the η-neighborhood
of x, η ∈ (0, 1) is the radius parameter for neighborhood, m =
max x,y∈U d(x, y) is the maximum distance between x and y
and n is the number of objects inside U . The scaling function
r(x) scale the computed distance matrix DM = [d(x,y)] to
scaled distance matrix D M = [d (x, y)] as,
d (x, y) =
d(x, y) × r (x)
(d(x, y) − η) ×
m−η×r (x)
m−η
(Amini et al. 2014; Erskine et al. 2006; Kotsiantis and Pintelas 2004; Rai and Singh 2010; Rokach 2009; Xu and Tian
2015). These algorithms are used to formulate solutions for
many problems such as outlier detection, clustering, task
scheduling and pathfinding (Agrawal et al. 1998; Bai et al.
2016; Blythe et al. 2005; Lee and Cho 2016; Liao et al 2004;
Ma and Chow 2004; Mahmoud et al. 2016; Ohadi et al.
2020; Park and Lee 2004; Pilevar and Sukumar 2005; Sheikholeslami et al. 2002; Wang et al. 2009, 1997; Yap 2002).
The performance of these algorithms depends upon the granularity and mesh refinment (Liao et al 2004; Ohadi et al.
2020). Some notable granularity and refinement approaches
operate on a static grid, dynamic grid, local grid and adaptive grid (Batra and Ko 1992; Berger et al. 1989; Berger and
Oliger 1984; Eiseman 1987; Fakhari and Lee 2014; Fuchs
1986; Osekowska et al. 2014; Rencis and Mullen 1986).
Grid-based approaches are employed in two steps. First,
collect the statistical data corresponding to each cell, using
precisely selected grid (Batra and Ko 1992; Berger et al.
1989; Berger and Oliger 1984; Eiseman 1987; Fakhari and
Lee 2014; Fuchs 1986; Rencis and Mullen 1986). Finally,
perform the specific operation like outlier detection or clustering on the deployed grid without accessing the data inside
the database.
i f y ∈ Nη
+ η × r (x) i f y ∈ U \ Nη
In the area of machine learning, entropy defines the randomness or impurity of a given dataset. The entropy model of
Shanon uses a base 2 logarithmic function to measure the
rate of entropy, i.e., log2 (P(x)), because an increase in the
probability P(x) of an event tends the result towards binary
logarithm one as depicted in Fig. 2. In the case of more than
one event or element, accumulative entropy can be calculated
using the following equation,
(2)
The above equation defines two cases, one is to rescale distance between x and the elements inside its η-neighborhood
using linear scaling where the number of elements remains
same. The second case is to rescale the distance between x
and y ∈ U \ Nη using min-max normalization to keep the
same object rank. The effect of rescale can be simulated as
in Fig. 1.
Griding and grid-based algorithms
A grid is a collection of uniformly arranged straight lines,
intersecting each other at equal intervals, representing a series
of connected rectangles or squares. Grid-based approaches
add value to the performance of algorithms due to their scalability, lower time complexity and parallel processing nature
H (x) = −
n−1
P(x) log2 (P(x))
(3)
i=0
where H(x) is the Shannon entropy, P(x) is the probability of
occurrence of the random variable x and P(x) is the contained
information. The above equation is the formal definition of
entropy which is the baseline to calculate the information
gain of a system.
The alternative approach to randomness is distance where
the elements are more similar to each other as they became
close to each other in some feature space. Therefore, we argue
that distant or remoteness can be replaced with randomness
in those situations where the elements are weighed in same
feature space holding different positions. The volume of the
system varies with distance among the objects. let us consider
Fig. 3 to completely understand the underlying approach.
123
Environmental Science and Pollution Research
Fig. 1 Data scaling of with DScale
In Fig. 3(a), a circle representing cluster c1 with a centroid
co visualized by a black point in the center. The red points
represent the objects of the cluster c1 . The Euclidean-based
distance (d f ar ) between the object o1 and centroid co is d1 .
It is inversely proportional to the probability of similarity
(Psim ) of object o1 to centroid co , which is given as follow,
Psim ∝
1
d f ar
Psim = C ×
the center and red points are the objects belonging to cluster
c1 . The Euclidean-based distance between the object o1 and
centroid co is d2 . The entropy of object o1 will be increase,
because d2 > d1 , which leads to increase in the accumulative
entropy from E 1 to E 2 .
Mathematically, we have
E1 = −
1
d f ar
n−1
P(x) log2 (P(x))
i=0
(4)
=−
n−2
P(x) log2 (P(x)) + P(x1 ) log2 (P(x1 ))
i=0
where C is the constant of proportionality which depends
upon the weight or importance of objects, generally its value
is equal to 1. Furthermore, E 1 is the accumulative entropy
for Fig. 3(a). Similarly, in Fig. 3(b), the circle represents the
cluster c1 with a centroid co visualized by black point in
Fig. 2 A graphical
representation of entropy of an
event
123
=−
n−2
Psim (x) log2 (Psim (x)) + Psim (x1 ) log2 (Psim (x1 ))
i=0
=−
n−2
i=0
Psim (x) log2 (Psim (x)) +
1
1
(5)
log2
(d1 )
(d1 )
Environmental Science and Pollution Research
Fig. 3 Visual demonstration for
distantial entropy of a cluster
Similarly,
⎡
n−2
E2 = − ⎣
Psim (x) log2 (Psim (x)) +
i=0
⎤
1 ⎦
1
log2
(d2 )
(d2 )
(6)
from Eqs. 5–6, E 2 > E 1 as d2 > d1 .
Environmental monitoring data
In order to have a pollution-free environment, the process
of environmental data monitoring is a must-to-do task. Pollution is directly associated with industrial plants and their
operations. The analysis is carried out to observe the range
of contaminants like ozone O3 , nitrogen dioxide N O2 , lead,
excessive carbon and its byproducts and other dangerous-tohealth particulate materials measured in the units of mg/m 3
or parts-per-billion (ppb) by chromatographs. The excessiveness of these substances in the air may cause dangerous
impacts on human health; for instance, excessiveness of N O2
may cause skin irritation. High levels of S O2 may cause
bronchitis, asthma and even heart attack. The air quality
stations measure the contaminated substances, whereas the
meteorological stations measure the wind speed, direction,
humidity, and temperature. The environmental monitoring
data is obtained using chromatographs and stored in the environmental database to be used for analysis. Unfortunately,
the data contain abnormalities due to different possibilities, such as instrumentation faults, communication channel
faults, electrical faults, and problems with emission sources.
This leads to masking and swamping effects. Masking means
to consider an abnormal measurement as a normal measurement, while swamping is to consider a normal measurement
as an abnormal measurement. That can lead to severe effects
on the precautionary measures taken based on these analyses.
Many authors formulate solutions to cope with these types
of outliers. Pearson works on the identification of systemgenerated outliers. He discussed that an outlier is not always
a byproduct of wrong measurements, but it can be caused
due to faulty system operations (Pearson 2002). The severity of these outliers has also been discussed in Kadlec et al.
(2009); Warne et al. (2004). Hugo and Daniel presented a
nonlinear approach to detect outliers (Garces and Sbarbaro
2009). Brahim Alameddine et al. comparatively analyze three
outlier detection mechanisms on environmental monitoring
data, particularly water quality in different lakes (Alameddine et al. 2010). Petr Visilik et al. automate the process of
outlier detection based on statistical analysis (Veselík et al.
2020).
Entropy-based gridding approach using
elbow method
In this section, we present a gridding approach that makes
use of the elbow method based on the entropy measurement
of each cluster. The entropy is used to measure the randomness in the clusters while the elbow determines the feasibility
for objects to decide them as outlier or inlier. In particular,
for improved gridded hard cluster(s), we calculate an initial
entropy for the system and keep on evaluating the entropy
for implicit outliers with different geometries. The elbow
method finds the outliers by establishing a trade-off between
the entropy and suspected outliers.
Data scaling
It is not always true that a dataset contains clusters or classes
with the same densities. Therefore, a conventional machine-
123
Environmental Science and Pollution Research
learning algorithm behaves unexpectedly in the case of varied
densities clusters. A data scaling is therefore required to balance the densities of all the clusters to meet the optimum
results. There are two scaling methods namely, data-ratio
and DScale (Zhu et al. 2018, 2016). The data-ratio and
DScale scaling method are used for two-dimensional and
n-dimensional data, respectively. We use DScale for data
scaling using Eq. 2. The distance matrix D M is computed
based on Euclidean distance in the first step, which is transformed to a scaled matrix D M using Eqs. 1–2. Data scaling
may improve the clustering process (Zhu et al. 2018, 2016).
Hard clustering
Let D = { x1 , x2 , . . ., xn } is the dataset containing finite
number (n) of objects. The scaling of distances between the
objects results in a computed scaled matrix D M . A conventional machine learning algorithm cluster the scaled data into
groups or clusters based on the similarities between objects
such that C = {c1 , c2 , . . ., c K }. The conventional machinelearning algorithms used in this paper are HDBSCAN (for
initial hard clusters with noise removal) or k-means (for oval
shaped initial hard clusters).
Explicit outliers detection
Grid representation
The next step is to represent the dataset in a grid. The
proposed approach(es) work on normalized attributes and
division of the unit square (in case of two attributes) or
unit hypercube (in case of more than two attributes) into an
equally distributed grid (Gu et al. 2017).
This approach uses the Euclidean distance metric to measure the influence of each attribute. Therefore, first, we
normalize the attributes A in the range [0,1] to balance the
effect of each attribute during distance analysis (Goldstein
2014). It scales the whole problem space to a unit hypercube.
There are O(n.A) operations required to complete this step.
Further, we divide the unit square (in case of two attributes)
or unit hypercube (in case of more than two attributes) into
p number of equidistant parts such that p ∈ N. Hence, the
total number of grid cells became p A where p is the number of grid cells in one dimension and A is the number of
attributes in the data space. Each grid cell can be located
through A-tuple ( j0 , j1 , . . ., j A−1 ) where jk ∈ {0, 1, . . ., p
-1}. Furthermore, we can map this multidimensional grid to
a single value index, i.e., I = {0, 1, 2, . . ., p A }. Algorithm 1
demonstrates the grid-based representation of data space as
shown in Fig. 4 and supported by Table 1.
The clustered data is logically gridded using Algorithm 1.
The objects in the dataset are thereby tested for a possible
outlier. The grid plays a key role in the process of explicit
outliers detection. An object having all its immediate neighboring cells empty is an exceptional case, which is an obvious
outlier. The rest of objects are arranged in sets based on neighboring empty cells. Each object in these sets is examined for
possible outliers based on their acquired geometry.
Entropy measurement
The process of explicit outlier detection is followed by
implicit outlier detection mechanism. The initial entropy of
the remaining dataset is determined say E 1 . The elements are
divided in eight sets with respect to their immediate empty
neighboring cells, i.e., G e = {e1 , e2 , . . ., e8 }, where ei represents elements with i number of immediate empty cells.
Now, to inquire the effect in entropy by declaring a set of
objects as outlier, we compute a power set of G e . The power
set P(G e ) has all possible combinations of objects that exist
in various geometries, arranged in 28 subsets of set G e . It
may be noted that we do not consider ∅ in our analysis such
that G e = P(G e ) - 1. Each time a subset of the G e set is examined for outliers verification, the entropy is calculated for the
dataset excluding that subset. The graph created an elbow
at a specific point declaring the possible boundary between
outliers and inliers.
Elbow
Fig. 4 Data points representation on a grid
123
In machine learning, we often use the sum of the squared
distance-based elbow method for determining the number of
clusters. Similarly, an entropy-based graph plotted against
the objects with various geometries. An elbow evolved at
the point of the absurd jump, which determines the boundary between outliers and inliers. Moreover, the process may
potentially be repeated using various settings of empty cells
around an object such as 3 by 3 or 5 by 5 cells. The increase
in processing area leads to more outliers detection.
Environmental Science and Pollution Research
Table 1 Hash table to map data
points in a grid
Object
Grid Position(I)
Index
Cells
x
y
x1
I1
4
3.3
18.5
x2
I2
4
3.8
18.2
x3
I3
10, 11
1
x4
I4
27, 28
x5
I5
38, 39
x6
I6
43, 44
x7
I7
15, 16, 21, 22
A special case (group outliers)
Group outliers are a special case of outliers. These outliers
may exist in groups and be located away from normal clusters. The outliers detected in the above steps be verified
against their neighbors as follows,
– Case 1: The immediate neighboring cells (cells ) containing outliers.
– Case 2: An immediate neighbor is not an outlier; we
check all of its neighborhood elements. If all the elements
in its neighborhood match the threshold for most of their
neighbors, that is acquired for other outliers then decide
them as group outliers.
– Case 3: An immediate neighbor which is not an outlier;
we check all the immediate neighbors of that non-outlier
object. The object decided as inlier if most of its immediate neighbors are inliers.
I denti f ier I d
Location I d
1
I d1
2 Cells (I3 )
2
4
I d2
2 Cells (I4 )
3
9
I d3
2 Cells (I5 )
4
16
I d4
2 Cells (I6 )
3
12.5
I d5
4 Cells (I7 )
Here oci is the centroid of cluster ci and oi is the object
which belongs to the cluster ci . The attributes for each object
are denoted by A. The next step is to scale the distances
between zero and one. The objects are categorized based on
their geometry. Each category of objects is removed and the
entropy of the complete dataset is calculated. The computed
entropy is plotted against the geometry. The algorithm iterated for each specific geometry of objects and its effect on
the entropy of the dataset is recorded. The maximum change
in the entropy compared to the initial entropy is considered
as the boundary between inliers and outliers.
Similarly, the process can be repeated for cluster-based
entropy analysis which is more robust. The entropy of each
cluster is calculated and plotted against objects geometries.
We get an entropy-geometry elbow for each cluster. Finally,
we get an optimum cleaned dataset.
Algorithm to grid the space
A demonstration to entropy measurement in the
proposed approach(es)
The entropy measurement is the backbone of the proposed
method(s). To understand the underlying mechanism, we
consider a dataset as shown in Fig. 5. There are two clusters
c1 and c2 in the dataset, represented by blue and green points,
respectively. The hard clusters are obtained using a clustering
algorithm, i.e., k-means for oval-shaped clusters and HDBSAN for non-oval shaped cluster dataset. The geometry of
different objects is grasped and stored in a hash table similar
to Table 1. The objects with one or more nonempty nearest
neighbor cells are considered inlier; however, objects with
all empty nearest neighboring cells are considered as outliers in the first place. Next, the centroid for each cluster is
computed. The distance between the centroid and each of the
object belonging to that cluster is computed using Euclidean
distance metric as follow,
d(oci , oi ) = A
(ocai − oia )2
a=1
(7)
This section introduces an algorithm to logically grid the
space. The algorithm Grid-Map is shown as Algorithm 1. It is
used to grid and map the objects of a dataset to the constructed
grid. A scaled matrix D M is used as the only input to the
algorithm. The outputs include a grid-based representation
of the data objects, number of objects in each cell n[ p] and
mapping of the dataset coordinates to the grid index (I ) = {0,
1, . . ., p A }.
The algorithm receives a universal set U and a scaled
matrix D M of the distances between the objects of U . The
first for loop is used to normalize the attributes A in range
[0, 1] for each object in the universal set U . The second for
loop is used to divide the normalized attributes into p parts or
equivalently the unit hypercube (in case of multi attributes)
into p A grid cells. Grid cells are represented as gridcell. The
collection of all grid cells produces a grid. The third for loop
is to count the number of objects in each cell and store them in
the array n[i] corresponding to the index (I ). The final loop in
the algorithm is used to map the coordinates of each instance
to its position on the grid as shown in Table 1. The object
located in more than one cell is given an identifier which
123
Environmental Science and Pollution Research
Fig. 5 Grid-based
representation of a dataset
removes the redundancy in the process of reverse mapping
in a later stage. This algorithm returns a grid G, the number
of objects in each cell of the grid n[i] and their mapping to
the coordinate system.
Algorithm for complete dataset-based EGO
This section elaborates the working mechanism of EGO
Algorithm 2 for a complete dataset. The input to this algorithm is a universal set with a finite number of instances. The
output of the algorithm is a set of outliers OU T and a set
of non-outliers N on OU T which represents compact clusters. The algorithm starts by computing the scaled matrix
D M from the distance matrix D M of objects in universal
set U . Line 2 is a function call, it takes a single argument
and returns the three values, i.e., initial grid G to represent
these objects, the number of objects in each cell n[i] of grid
G and mapping of objects in a coordinate system to the cells
in the initial grid G. In line 3, the algorithm partitions the
data using a hard clustering algorithm in case of unlabeled
data or creates partitions based on classification labels. Line
4 calculates the entropy EU for the complete dataset using
Eq. 3. There are two for loops in this algorithm. The first
for loop in lines 5–10, each cell in grid G is examined to
be empty or with empty neighbors. These cells be excluded
and the objects in these cells are considered as explicit outliers. Next in lines 11–16, the algorithm enters in a nested for
loop. The loop goes through power set G e , excludes the cells
and calculates the entropy. Line 17 determines the optimum
entropy by considering a comparatively abrupt elbow. Line
18 cumulates the objects in the cells, which causes an elbow.
Lines 19 and 20 separate the outliers and inliers into two sets
OU T and N on OU T .
Algorithm 2 An algorithm for complete dataset-based EGO
Algorithm 1 Grid-Map algorithm
Input
A scaled matrix D M from Algorithm (2 or 3) , p > 1
Output A Grid G, Number of objects in each cell n[i], Mapping of
data into grid
1: function Grid-MapD M 2:
for each xi ∈ U do
3:
Normalize A attributes between 0 and 1
4:
end for
5:
for each a in A do
6:
Divide A unit intervals ∈ [0,1] in p parts (or equivalently) unit
7:
hypercube [0, 1] A into p A grid cells (gridcell) to obtain G
8:
end for
9:
for each i in p A do
10:
n[i] = Count number of objects
11:
end for
12:
for each xi in U do
13:
Map coordinates of xi to gridcell ∈ G
14:
Check for duplicate entry
15:
if xi ∈ more than one cell then
16:
mark it with identifier
17:
end if
18:
Get Mapping
19:
end for
20:
return (G, n[i], Mapping)
21: end function
123
Input
A universal set U = {x1 , x2 , x3 , . . . xn }.
Output The OU T and N on OU T representing the sets of outliers
and
inliers.
1: Compute a distance matrix D M from U
2: (Grid(G), n[ind], Mapping) = Grid-MapD M it is a function call
3: Obtain initial partitions C = {c1 , c2 , c3 , ..., c K } using hard clustering algorithm or data labels.
4: Calculate entropy EU for the dataset using Eq. 3
5: for each cell ∈ G do
6:
Examine the neighboring cells
7:
if All neighbor cells = ∅ ∨ cell = ∅ then
8:
cell is empty or otherwise having outliers.
9:
end if
10: end for
11: for i in G e do
12:
for each non-empty cell ∈ G do
13:
Exclude all the cells having i empty neighbors
14:
end for
15:
Calculate entropy E i for the dataset using Eq. 3
16: end for
17: Compare E i for optimum entropy drop E using elbow
18: Detached = cells causing optimum entropy E
19: OU T = Mapping(Detached → objects)
20: N on OU T = Mapping(U − Detached → objects)
Environmental Science and Pollution Research
Algorithm for cluster-based EGO (EGOCB )
Algorithm for group outlier-based EGO (EGOclique )
Algorithm 3 works for individual cluster and the result is
computed at the end. A universal set with a finite number of
instances is an input to the algorithm. There are two outputs
to this algorithm, i.e., a set of outliers OU T and a set of nonoutliers N on OU T which represents the compact clusters.
The algorithm computes a scaled matrix D M for the
objects in the universal set U . In line 2, a function with a
single argument is called which returns three values, i.e., initial grid G to represent these objects, the number of objects in
each cell n[i] of grid G and mapping of objects in a coordinate
system to the cells in the initial grid G. Line 3 creates initial partition using a hard clustering algorithm or data labels.
Line 4 computes the entropy EU for the complete dataset
using Eq. 3. There are two for loops used in this algorithm.
The first for loop in lines 5–10, each cell in grid G is examined to be empty or with empty neighbors. These cells and
the objects inside them are considered to be explicit outliers. Lines 11–18, algorithm enters in a nested for loop. The
algorithm loops for each cluster ck , calculate initial entropy
of cluster ck , loops in through power set G e , exclude the
cells, calculate the entropy of ck again and ends with finding
an optimum entropy E i . Line 19 cumulates the objects in
the cells, which cause an elbow for each ck . Lines 20 and
21 separate the outliers and inliers into two sets OU T and
N on OU T .
This section presents the EGO Algorithm 4 for detecting
group outliers. The set of inliers N on OU T and the desired
number of clusters n d are the two inputs to this algorithm.
This algorithm gives a family of group outliers OU Tclique .
The algorithm computes a scaled matrix D M for the objects
in a universal set U . Line 1 maps the given inliers to the
coordinate system using Algorithm 1. Lines 2–9 loop for
each cell in the grid G followed by an if-else statement, that
checks possible group outliers. The if-else statement suggests
group outliers, if all immediate neighbors contain outliers or
a maximum number of immediate neighbors of the immediate neighbors contain outliers otherwise inliers. Line 10
checks for optimum entropy drop for each group and a combination of different groups to obtain the desired number
of clusters. Line 11 removes the non-desired regions. Lines
12–14 decide the group of outliers based on the optimum
entropy.
Algorithm 3 Algorithm for cluster-based EGO
Input
A universal set U = {x1 , x2 , x3 , . . . xn }.
Output The OU T and N on OU T representing the sets of outliers
and
inliers.
1: Compute a distance matrix D M from U
2: (Grid(G), n[ind], Mapping) = Grid-MapD M it is a function call
3: Obtain initial partitions C = {c1 , c2 , c3 , ..., c K } using hard clustering algorithm or data labels.
4: Calculate entropy EU for the dataset using Eq. 3
5: for each cell ∈ G
6:
Examine the neighboring cells
7:
if All neighbor cells = ∅ ∨ cell = ∅ then
8:
cell is empty or otherwise having outliers.
9:
end if
10: end for
11: for each ck ∈ C
12:
Calculate entropy EU for ck using Eq. 3
13:
for i in G e do
14:
Exclude all the cells having i empty neighbors
15:
Calculate entropy E i for the ck using Eq. 3
16:
end for
17:
Compare E i for optimum entropy drop E using elbow
18: end for
19: Detached = cells causing optimum entropy E for each ck
20: OU T = Mapping(Detached → objects)
21: N on OU T = Mapping(U − Detached → objects)
Algorithm 4 An algorithm for group outlier-based EGO
Input Set of inliers N on OU T and set of inliers from Algorithm 2
Algorithm 3.
Output The set OU Tclique representing the group outliers.
1: Get Mapping for U using Algorithm 1.
2: for each cell ∈ G
3:
if cells of cell contain outliers by elbow ∨ ∃cells |
Max(cells (cell))
contain outliers by elbow then
4:
A group of outliers
5:
else if ∃cells | Max(cells (cell)) contain inliers then
6:
Inliers
7:
Calculate the entropy E i using Eq. 3
8:
end if
9: end for
10: Check optimum entropy E drop for each group(s)
11: Detached = Remove the cells
12: if E is optimum then
13:
OU Tclique = Mapping(Detached → objects)
14: end if
or
Experimental results and evaluations
This section reports the results of our approach and its comparative analysis with some of the best outlier detection
approaches. We measure the performance of our new algorithms EGO and EGOC B in an unsupervised scenario. We
use the traditional clustering algorithms that perform cluster
generation along with outlier detection. Then, the application
of the proposed approaches on the initially purified clusters investigates its capability to further outlier detection
and cluster purification. These experiments are performed
using five unlabeled and three labeled datasets. The unlabeled datasets include t4.8k, t5.8k, t8.8k, t7.10k and A3
123
Environmental Science and Pollution Research
datasets where unlabeled datasets are D31, 20newsgroup
and 50 class amazon reviews. The t4.8k, t5.8k and t8.8k
are bi-featured datasets containing 8,000 objects each and
t7.10k dataset has 10,000 instances including many noise
points (Karypis et al. 1999; McInnes et al. 2017). The datasets
A3 and D31 are also bi-featured, containing 7, 500 and 3,100
objects, respectively (Krkkinen and Frnti 2002; Veenman
et al. 2002). The 20newsgroup (20NG) and 50 class amazon review (50CR) are textual dataset (Chen and Liu 2014;
Fei and Liu 2016). The 20NG dataset contains the news on
20 topics (Lang 1995). Each topic contains around one thousand news. The 50CR dataset contains amazon reviews for 50
electronic products. There are one thousand reviews for each
electronic product. The initial clusters can be obtained using
the most recent clustering algorithm HDBSCAN. Then, we
apply EGO and EGOC B on these initial clustering results to
evaluate its performance.
Experiments for EGO and EGOCB
We now evaluate the performance of EGO and EGOC B algorithms for cluster compactness in the presence of noise.
Tables 2 and 3 show the detailed analysis of the proposed approaches based on internal validation measures, i.e.,
Table 2 Comparison of BD
index, Dunn index and
silhouette for EGO
Dataset
t4.8k
t5.8k
t7.10k
t8.8k
A3
D31
20NG
50CR
123
Davies Boudlin (DB) index, Dunn index and silhouette. DB
index is a ratio between within-cluster scatteredness to the
separation between clusters in a dataset. Dunn index compares within-cluster variance to the separation between the
mean of different clusters. The silhouette compares the cohesion or consistency within-cluster to the clusters separation.
Overall, the internal validation measures are used to reflect
within-cluster consistency, connectedness and distance or
separation between clusters.
The unlabeled datasets are evaluated using DBSCAN
and HDBSCAN for metric evaluation and then the algorithm (2) and (3) are applied on the top of HDBSCAN
algorithm, which made a significant improvement in their
compactness, connectedness and separation between different clusters by detecting and removing outliers. In the case
of labeled datasets, the initial partitions are obtained using
data labels. The proposed algorithms are then applied on the
top of labeled based partitions, that improve the cluster or
class compactness, connectivity among objects and separation between clusters while detecting outliers.
The results for DBSCAN, HDBSCAN and EGO are
listed in Table 2. It may be noted that initial partitions are
obtained using HDBSCAN or data labels. The EGO algorithm adds further cohesion within-cluster and separation
Approaches
DB index
Dunn index
Silhouette
DBSCAN
1.030 ± 0.004
0.567 ± 0.001
0.309 ± 0.001
HDBSCAN
1.020 ± 0.002
0.601 ± 0.001
0.316 ± 0.002
EGO
1.010 ± 0.002
0.667 ± 0.002
0.341 ± 0.017
DBSCAN
0.637 ± 0.012
0.663 ± 0.001
0.457 ± 0.012
HDBSCAN
0.578 ± 0.003
0.696 ± 0.004
0.573 ± 0.014
EGO
0.539 ± 0.002
0.741 ± 0.003
0.589 ± 0.013
DBSCAN
0.631 ± 0.003
0.587 ± 0.002
0.424 ± 0.011
HDBSCAN
0.619 ± 0.013
0.617 ± 0.012
0.452 ± 0.011
EGO
0.588 ± 0.012
0.676 ± 0.011
0.499 ± 0.017
DBSCAN
0.626 ± 0.010
0.587 ± 0.012
0.431 ± 0.011
HDBSCAN
0.603 ± 0.010
0.631 ± 0.002
0.471 ± 0.003
EGO
0.573 ± 0.015
0.689 ± 0.020
0.501 ± 0.023
DBSCAN
0.611 ± 0.004
0.589 ± 0.005
0.510 ± 0.004
HDBSCAN
0.593 ± 0.043
0.641 ± 0.027
0.527 ± 0.022
EGO
0.525 ± 0.022
0.703 ± 0.011
0.632 ± 0.003
DBSCAN
0.581 ± 0.030
0.614 ± 0.003
0.545 ± 0.034
Labeled
0.548 ± 0.006
0.638 ± 0.002
0.576 ± 0.003
EGO
0.410 ± 0.020
0.687 ± 0.003
0.671 ± 0.004
DBSCAN
0.898 ± 0.041
0.510 ± 0.022
0.318 ± 0.021
Labeled
0.861 ± 0.007
0.545 ± 0.003
0.339 ± 0.005
EGO
0.723 ± 0.005
0.593 ± 0.002
0.401 ± 0.020
DBSCAN
0.917 ± 0.013
0.503 ± 0.032
0.300 ± 0.001
Labeled
0.894 ± 0.003
0.539 ± 0.002
0.302 ± 0.003
EGO
0.782 ± 0.002
0.601 ± 0.001
0.396 ± 0.007
Environmental Science and Pollution Research
Table 3 Comparison of BD
index, Dunn index and
silhouette for EGOC B
Dataset
t4.8k
t5.8k
t7.10k
t8.8k
A3
D31
20NG
50CR
Approaches
DB index
Dunn index
Silhouette
0.309 ± 0.001
DBSCAN
1.030 ± 0.004
0.567 ± 0.001
HDBSCAN
1.020 ± 0.002
0.601 ± 0.001
0.316 ± 0.002
EGOC B
0.998 ± 0.002
0.693 ± 0.002
0.355 ± 0.019
0.457 ± 0.012
DBSCAN
0.637 ± 0.012
0.663 ± 0.001
HDBSCAN
0.578 ± 0.003
0.696 ± 0.004
0.573 ± 0.014
EGOC B
0.522 ± 0.002
0.756 ± 0.031
0.600 ± 0.015
DBSCAN
0.631 ± 0.003
0.587 ± 0.002
0.424 ± 0.011
HDBSCAN
0.619 ± 0.013
0.617 ± 0.012
0.452 ± 0.011
EGOC B
0.540 ± 0.004
0.693 ± 0.012
0.510 ± 0.016
DBSCAN
0.626 ± 0.010
0.587 ± 0.012
0.431 ± 0.011
HDBSCAN
0.603 ± 0.010
0.631 ± 0.002
0.471 ± 0.003
EGOC B
0.569 ± 0.017
0.694 ± 0.003
0.504 ± 0.024
DBSCAN
0.611 ± 0.004
0.589 ± 0.005
0.510 ± 0.004
HDBSCAN
0.593 ± 0.043
0.641 ± 0.027
0.527 ± 0.022
EGOC B
0.500 ± 0.024
0.723 ± 0.017
0.641 ± 0.003
DBSCAN
0.581 ± 0.030
0.614 ± 0.003
0.545 ± 0.034
Labeled
0.548 ± 0.006
0.638 ± 0.002
0.576 ± 0.003
EGOC B
0.400 ± 0.021
0.691 ± 0.004
0.677 ± 0.004
DBSCAN
0.898 ± 0.041
0.510 ± 0.022
0.318 ± 0.021
Labeled
0.861 ± 0.007
0.545 ± 0.003
0.339 ± 0.005
EGOC B
0.710 ± 0.003
0.599 ± 0.003
0.405 ± 0.012
DBSCAN
0.917 ± 0.013
0.503 ± 0.032
0.300 ± 0.001
Labeled
0.894 ± 0.003
0.539 ± 0.002
0.302 ± 0.003
EGOC B
0.771 ± 0.002
0.617 ± 0.002
0.404 ± 0.001
between clusters. The loose instances are detached and similar instances became a separate cluster. In CHAMELEON
datasets, the t5.8k got more affected where t4.8k has a small
improvement after the application of EGO. EGO detected
more outliers for A3 dataset, enhancing their compactness
and improving the result at a maximum among the unlabeled
datasets. The labeled datasets include D31 and two textual
datasets. The performance of EGO for labeled datasets is
comparatively higher among the considered datasets. The
textual datasets are highly congested and therefore, require
a careful observation for outlier detection. The proposed
approaches detect the outlier in a unique way, avoiding the
space loss and complexity reduction.
Table 3 reports the results for EGOC B in terms of three
evaluation metrics namely, DB index, Dunn index and silhouette coefficient. It may be noted that for all datasets the
results are improved by eliminating more outliers. The compactness of initial clusters and outlier detection performance
EGOC B is a bit high in comparison to EGO. The outliers
are more precisely detected and clusters are further cleaned
and more compactness is added to each cluster. The EGOC B
performance for labeled datasets is more realized due to their
highly congested nature.
Visual demonstration of the proposed approach
This section visually demonstrates the performance of the
proposed approach EGOC B . In particular, the cluster representation is elaborated in the presence of noise using
HDBSCAN and EGOC B . The visuals show the superiority of
the performance of EGOC B over the highly reputable algorithm ‘HDBSCAN’. For a visual demonstration, we consider
two unlabeled and a labeled dataset. In the case of unlabeled
datasets t5.8k and t7.10k, the initial clusters are obtained
using HDBSCAN while the primary partitions are obtained
using data labels for labeled dataset D31.
Figure 6 shows the visuals for unlabeled dataset t5.8k.
The initial labels are obtained for a supervised setting. The
clusters obtained after HDBSCAN are shown in Fig. 6(b)–
(c). There are six clusters in this dataset (represented by
different colors) with noise all around them represented
by black points. We may observe that the compactness
of these clusters is remarkably improved by EGOC B by
eliminating more outliers as shown in Fig. 6(d). The last
Fig. 6(e) shows the most related instances of all the six clusters. These instances are more likely to increase the cluster
consistency.
123
Environmental Science and Pollution Research
Fig. 6 Visual results of
cluster-based EGO for t5.8k
dataset
Next, we get into another unlabeled dataset t7.10k as
shown in Fig. 7(a). Figure 7(b) shows the HDBSCAN process
to detect outliers for the sake of compact cluster formation. Figure 7(c) visualizes the processing of EGC B to detect
outliers from scratch that shows its ability as a standalone
method. There are nine clusters in this dataset (represented
by different colors) with noise all around them represented by
black points. We may note that the EGOC B may, however,
detect more outliers and improve cluster quality as shown
in Fig. 7(c). Figure 7(d) shows the most compacted nine
clusters. These clusters are much cohesive with maximum
distance from each other.
Now we look into the performance of EGOC B algorithm for a labeled dataset D31 as shown in Fig. 8(a). This
labeled dataset arranged 3,100 instances in thirty-one classes
as shown in Fig. 8(b). In the case of labeled dataset, the
partitions are obtained using data labels without applying
any clustering algorithm. The EGOC B is then applied on
the top of labeled separated dataset. The classes are con-
123
densed and more apart due to the application of EGOC B as
shown in Fig. 8(c). The loosely attached instances especially
at the boundaries are detached from the cluster. Figure 8(d)
shows the instances that belong to each cluster or class. The
compactness of these clusters can be observed visually and
mathematically in Tables (2 and 3).
Comparative analysis of the proposed approach(es)
The comparative analysis of outlier detection algorithms is
a tangled and chaotic task. This is because of some basic
issues that include, the unavailability of suitable datasets with
proper ground truth, the consideration of optimal parameters
for outlier determination and the choice of outlier discrimination from the rest of the data (Campos et al 2016a; Xu et al.
2018). Despite all the discussed difficulties, we did compare the proposed method with the well-known approaches
to realize its effectiveness under the same set of conditions.
Environmental Science and Pollution Research
Fig. 7 Visual results of
cluster-based EGO for t7.10k
Dataset
Fig. 8 Visual results for
cluster-based EGO for D31
Dataset
123
Environmental Science and Pollution Research
It may be noted that the proposed approach has the advantage of additional outlier detection because of its application
on top of improved hard clusters. Besides this, for a fair
comparison, we, however, consider the self-contained nature
of the proposed approach. The datasets used in this set of
experiments are cardiotocography and Pima Indians Diabetes
(Campos et al 2016a). These datasets are semantically meaningful, i.e., realizing the real world data including instances
that are deviating from the normal pattern. For example, a
dataset of players with a particular class of badminton players. The Cardiotocography dataset containing three classes
of individually collected data from people in search of heart
disease. These classes contain normal, suspected and pathological patients with heart disease. The normal people are
inliers where the other two classes of suspected and pathological patients are considered as outliers. The Pima Indian
Diabetes dataset has two classes containing normal and
patients with diabetes. The normal people are inliers where
the diabetic patients are considered as outliers. To realize
the rare nature of outliers in the considered datasets, we randomly downsampled the classes containing outliers at 5%,
10%, 15% and 20% Campos et al (2016a). The experiments
are repeated ten times and taking the mean of all results. The
following measures evaluate the performance of the proposed
approach(es) (Ali et al. 2021; Campos et al 2016a; Xu et al.
2018),
Correctly detected outliers and inliers
, (8)
Total objects
Correctly detected outliers
,
(9)
Pr ecision =
Total identified outliers
Correctly detected outliers
Recall =
,
(10)
Total actual outliers
2 × Pr ecision × Recall
,
(11)
F1 =
Pr ecision + Recall
Accuracy =
The above equation means that how accurately the EGO and
EGOC B classify a given instance as outlier or inlier. In particular, the precision shows the proportion of predicted outliers
that are actually outliers, i.e., how many of the predicted outliers (true outliers plus false outliers) are actually outliers
(true outliers). The recall shows the proportion of correctly
identified outliers out of actual outliers. The accuracy shows
the proportion of correctly classified instances, both outliers
and inliers. The F1 score summarizes the performance into
a single metric and therefore, computed as a harmonic mean
of precision and recall. The detailed results are reported in
Tables 4 and 5.
The performance of the proposed approach(es) is compared with six well-known outlier detection approaches
including RE3WC (reduction- and elevation-based threeway clustering (Ali et al. 2021)) LOF (local outlier factor
(Breunig et al. 2000)), LoOP (local outlier probabilities
123
(Kriegel et al. 2009)), ABOD (angle-based outlier detection
(Kriegel et al. 2008)), CBLOF (cluster-based local outlier
factor (He et al. 2003)) and HBOS (histogram-based outlier
score (Goldstein and Dengel 2012)). These approaches based
on criterion-based ranking of each object to detect outliers.
Table 4 shows the results for 5% and 10% downsampled
outlier class in cardiotocography and Indian Pima datasets.
The results for both EGO and EGOC B are listed on the top
of each comparison. The listed values for different metrics
are the mean of ten times repeated experiments for each
approach. In the case of Indian Pima Diabetes dataset, the
accuracy of EGO for 5% and 10% is the same as RE3WC.
The proportion of classification of actual outliers is a bit low
for 10% downsampled outliers as compared to RE3WC. The
EGOC B work on cluster basis and therefore, has a maximum
accuracy for both 5% and 10% downsampled outliers. The
precision values for LoOP are high for 5% and very close
for 10% downsampled outliers on the Indian Pima Diabetes
dataset. In the case of cardiotocography dataset, the proposed
approaches perform better in comparison. The recall values
of ABOD for 5%, 10% and CBLOF for 10% downsampled
outliers are high as compared to the proposed approaches.
The F1 score of the proposed approaches is always better in
comparison to the participated outlier detection algorithms
for 5% and 10% downsampled outliers.
Table 5 shows the results for 15% and 20% downsampled
outlier class in cardiotocography and Indian Pima datasets.
The computed metric values for EGO and EGOC B are listed
on the top of each comparison. The tabulated values are the
computed mean of ten times repeated experiments for each
approach. In the case of
Indian Pima Diabetes dataset, the proposed approaches
perform better except for a beat in precision. The precision value of LOF for 15% is a bit high as compared to
EGO, however, RE3WC, LOF and ABOD beat both EGO
and EGOC B for 20% downsampled outliers. In the case of
cardiotocography dataset, the EGOC B outperforms all the
compared approaches while EGO has a beat in precision. The
F1 score result shows that the proposed approaches consistently produce better results as compared to the participated
approaches and therefore can be considered as a better solution towards outlier detection.
Experiments on environmental monitoring data
This section shows the experiments related to the detecting
capability of the proposed approach in environmental monitoring data. These experiments are conducted for both real
and synthetically prepared datasets. First, the attributes in
the dataset have been analyzed based on their correlation.
The correlation is computed using the Pearson Correlation
Coefficient (PCC) (Benesty et al. 2009). In order to compute
the dependencies of various variables, a correlation matrix is
Environmental Science and Pollution Research
Table 4 Results on Pima Indian Diabetes and Cardiotocography datasets with 5% and 10% downsampled outliers
5% Outliers
Approach
EGOC B
Pima
Cardiotography
Accuracy
Recall
Precision
F1-Score
Accuracy
Recall
Precision
F1-Score
0.91 ± 0.02
0.53 ± 0.03
0.55 ± 0.01
0.54 ± 0.02
0.94 ± 0.04
0.57 ± 0.03
0.58 ± 0.03
0.57 ± 0.02
EGO
0.87 ± 0.01
0.47 ± 0.02
0.51 ± 0.02
0.49 ± 0.02
0.91 ± 0.03
0.51 ± 0.03
0.49 ± 0.02
0.50 ± 0.02
RE3WC
0.87 ± 0.01
0.43 ± 0.04
0.50 ± 0.03
0.46 ± 0.04
0.88 ± 0.01
0.51 ± 0.01
0.29 ± 0.01
0.37 ± 0.13
LOF
0.86 ± 0.02
0.08 ± 0.05
0.31 ± 0.08
0.13 ± 0.07
0.88 ± 0.01
0.51 ± 0.01
0.29 ± 0.01
0.37 ± 0.13
LoOP
0.88 ± 0.09
0.10 ± 0.03
0.70 ± 0.31
0.17 ± 0.09
0.93 ± 0.03
0.22 ± 0.02
0.53 ± 0.04
0.31 ± 0.02
ABOD
0.85 ± 0.02
0.29 ± 0.08
0.37 ± 0.05
0.32 ± 0.04
0.89 ± 0.03
0.62 ± 0.02
0.34 ± 0.02
0.44 ± 0.02
CBLOF
0.75 ± 0.07
0.13 ± 0.04
0.11 ± 0.04
0.12 ± 0.04
0.86 ± 0.02
0.42 ± 0.02
0.21 ± 0.01
0.28 ± 0.07
HBOS
0.71 ± 0.08
0.14 ± 0.09
0.09 ± 0.03
0.11 ± 0.07
0.85 ± 0.03
0.22 ± 0.07
0.14 ± 0.02
0.17 ± 0.02
Accuracy
Recall
Precision
F1-Score
Accuracy
Recall
Precision
F1-Score
EGOC B
0.87 ± 0.07
0.54 ± 0.06
0.73 ± 0.07
0.62 ± 0.08
0.91 ± 0.06
0.51 ± 0.05
0.66 ± 0.06
0.58 ± 0.05
EGO
0.84 ± 0.08
0.44 ± 0.05
0.69 ± 0.06
0.54 ± 0.05
0.89 ± 0.06
0.51 ± 0.05
0.55 ± 0.07
0.53 ± 0.06
RE3WC
0.84 ± 0.09
0.45 ± 0.09
0.69 ± 0.03
0.54 ± 0.02
0.89 ± 0.02
0.48 ± 0.01
0.55 ± 0.01
0.51 ± 0.01
10% Outliers
Approach
Pima
Cardiotography
LOF
0.80 ± 0.08
0.16 ± 0.02
0.68 ± 0.11
0.26 ± 0.04
0.88 ± 0.03
0.24 ± 0.03
0.59 ± 0.03
0.34 ± 0.03
LoOP
0.81 ± 0.05
0.19 ± 0.02
0.71 ± 0.07
0.30 ± 0.09
0.88 ± 0.04
0.25 ± 0.03
0.56 ± 0.04
0.34 ± 0.02
ABOD
0.80 ± 0.09
0.33 ± 0.02
0.53 ± 0.03
0.41 ± 0.07
0.87 ± 0.02
0.65 ± 0.08
0.50 ± 0.03
0.56 ± 0.07
CBLOF
0.73 ± 0.08
0.15 ± 0.02
0.26 ± 0.04
0.19 ± 0.03
0.86 ± 0.03
0.58 ± 0.03
0.47 ± 0.02
0.52 ± 0.09
HBOS
0.74 ± 0.09
0.26 ± 0.05
0.35 ± 0.03
0.30 ± 0.02
0.84 ± 0.03
0.26 ± 0.08
0.37 ± 0.02
0.31 ± 0.02
F1-Score
Table 5 Results on Pima Indian Diabetes and Cardiotocography datasets with 15% and 20% downsampled outliers
15% Outliers
Approach
Pima
Cardiotography
Accuracy
Recall
Precision
F1-Score
Accuracy
Recall
Precision
EGOC B
0.74 ± 0.07
0.37 ± 0.03
0.59 ± 0.05
0.45 ± 0.05
0.89 ± 0.07
0.61 ± 0.04
0.76 ± 0.03
0.68 ± 0.05
EGO
0.72 ± 0.08
0.31 ± 0.04
0.55 ± 0.03
0.40 ± 0.04
0.85 ± 0.05
0.50 ± 0.05
0.66 ± 0.04
0.57 ± 0.06
RE3WC
0.71 ± 0.08
0.29 ± 0.04
0.50 ± 0.02
0.36 ± 0.04
0.84 ± 0.08
0.45 ± 0.04
0.57 ± 0.02
0.51 ± 0.07
LOF
0.72 ± 0.08
0.14 ± 0.05
0.58 ± 0.04
0.23 ± 0.02
0.85 ± 0.04
0.25 ± 0.01
0.73 ± 0.12
0.37 ± 0.08
LoOP
0.70 ± 0.07
0.19 ± 0.02
0.48 ± 0.09
0.27 ± 0.07
0.84 ± 0.02
0.27 ± 0.01
0.61 ± 0.08
0.37 ± 0.09
ABOD
0.68 ± 0.07
0.32 ± 0.04
0.45 ± 0.04
0.37 ± 0.05
0.85 ± 0.09
0.59 ± 0.06
0.57 ± 0.05
0.58 ± 0.05
CBLOF
0.64 ± 0.09
0.25 ± 0.06
0.34 ± 0.05
0.28 ± 0.01
0.82 ± 0.05
0.42 ± 0.06
0.51 ± 0.07
0.47 ± 0.09
HBOS
0.70 ± 0.06
0.26 ± 0.08
0.46 ± 0.04
0.33 ± 0.03
0.79 ± 0.03
0.26 ± 0.04
0.38 ± 0.07
0.31 ± 0.07
Recall
Precision
F1-Score
Accuracy
Recall
Precision
F1-Score
20% Outliers
Approach
Pima
Accuracy
Cardiotography
EGOC B
0.72 ± 0.04
0.31 ± 0.08
0.40 ± 0.04
0.35 ± 0.05
0.88 ± 0.05
0.58 ± 0.03
0.83 ± 0.04
0.68 ± 0.05
EGO
0.69 ± 0.05
0.29 ± 0.04
0.38 ± 0.05
0.33 ± 0.04
0.84 ± 0.07
0.52 ± 0.03
0.72 ± 0.04
0.60 ± 0.03
RE3WC
0.69 ± 0.04
0.26 ± 0.09
0.64 ± 0.05
0.37 ± 0.06
0.84 ± 0.09
0.43 ± 0.06
0.75 ± 0.08
0.55 ± 0.05
LOF
0.65 ± 0.09
0.06 ± 0.01
0.53 ± 0.08
0.11 ± 0.09
0.79 ± 0.03
0.10 ± 0.08
0.82 ± 0.12
0.17 ± 0.07
LoOP
0.67 ± 0.09
0.23 ± 0.04
0.58 ± 0.05
0.33 ± 0.06
0.81 ± 0.03
0.26 ± 0.07
0.73 ± 0.12
0.39 ± 0.07
ABOD
0.68 ± 0.04
0.24 ± 0.01
0.63 ± 0.05
0.35 ± 0.09
0.82 ± 0.03
0.52 ± 0.05
0.63 ± 0.06
0.57 ± 0.03
CBLOF
0.61 ± 0.04
0.17 ± 0.05
0.34 ± 0.08
0.23 ± 0.01
0.79 ± 0.09
0.38 ± 0.05
0.56 ± 0.03
0.45 ± 0.07
HBOS
0.61 ± 0.07
0.17 ± 0.05
0.34 ± 0.02
0.23 ± 0.04
0.71 ± 0.03
0.10 ± 0.06
0.22 ± 0.07
0.14 ± 0.05
123
Environmental Science and Pollution Research
Table 6 Results on environmental monitoring data for a sample size
2000 with 10% contamination
Table 8 Results on environmental monitoring data for a sample size
2000 with 20% contamination
Contamination
Number of
Outlier
Number of
detected
Number of False
Detected
Contamination
Number of
Outlier
Number of
detected
Number of False
Detected
Carbon
20
18
5
Carbon
50
49
5
Carbon monoxide
24
23
6
Carbon monoxide
40
38
4
Carbon dioxide
22
21
7
Carbon dioxide
40
38
4
Lead
25
23
6
Lead
40
39
6
Nitrogen monoxide
24
22
7
Nitrogen monoxide
40
37
3
Nitrogen dioxide
22
21
8
Nitrogen dioxide
40
39
6
Ground-level ozone
21
19
4
Ground-level ozone
40
40
8
Sulfur monoxide
22
20
3
Sulfur monoxide
55
53
5
Sulfur dioxide
20
18
3
Sulfur dioxide
55
54
4
computed based on the correlation coefficient. The correlation matrix defines the linear relationship between each pair
of variables. Mathematically, it is defined as follows:
P(a, a) P(a, b)
P(b, a) P(b, b)
Cov(a, b)
ρ=
σa σb
Matri x =
(12)
(13)
The above Eqs. 12–13 show the correlation between two
attributes, a and b. Here a denotes the standard deviation for
a and b where Cov(a, b) represents the covariance.
The correlation between various components determines
the balanced components in the fresh air. Secondly, it is also
observed that some contamination is sometimes allowed until
its severity. Therefore, this rule has been strictly followed
while computing correlations between various air components. The contamination has been added synthetically to
observe the outline detection performance of the proposed
approach. Next, the proposed approach has been employed
Table 7 Results on environmental monitoring data for a sample size
2000 with 15% contamination
Contamination
Number of
Outlier
Number of
detected
Number of False
Detected
Carbon
30
28
5
Carbon monoxide
30
28
5
Carbon dioxide
40
39
6
Lead
36
35
6
Nitrogen monoxide
25
23
5
Nitrogen dioxide
35
33
4
Ground-level ozone
35
32
2
Sulfur monoxide
36
33
1
Sulfur dioxide
33
31
2
123
on this contaminated dataset. The results are tabulated in
Tables 6, 7, and 8.
Table 6 represents the results for EGO on environmental monitoring data using a sample size of 2000 with 10%
contamination. The results for carbon, carbon monoxide,
carbon dioxide, lead, nitrogen monoxide, nitrogen dioxide, ground-level ozone, sulfur monoxide and sulfur dioxide
are 90.00%, 95.83%, 95.45%, 92.00%, 91.67%, 95.45%,
90.48%, 90.91% and 90.00%, respectively. The overall performance for 10% added contamination is 92.42%. There
is a relation between the observed outliers and false detections. The high outlier detection rate increases the error rate.
It means that these outliers are very much sparsed and connected boundaries with non-outlier components of the air.
Table 7 shows the results of the proposed approach
employed at environmental monitoring data using a sample size of 2000 with 15% contamination. The results
for the contaminations: carbon, carbon monoxide, carbon dioxide, lead, nitrogen monoxide, nitrogen dioxide,
ground-level ozone, sulfur monoxide and sulfur dioxide
are 93.33%, 93.33%, 97.50%, 97.22%, 92.00%, 94.29%,
91.43%, 91.67%, and 93.94%, respectively. The overall performance for 15% added contamination is 93.86%. The
results became improved with the addition of more contaminations. The same relationship can be observed between
detection and error rate. The number of true positives
increases; meanwhile, false positives also increase. It means
that these outliers are far from each other but near with overlapping boundaries of non-outlier components in the air.
Table 8 tabulates the performance of the proposed approach
for environmental monitoring data using a sample size of
2000 with 20% contamination. The performance of the
proposed approach for the contaminations: carbon, carbon
monoxide, carbon dioxide, lead, nitrogen monoxide, nitrogen dioxide, ground-level ozone, sulfur monoxide and sulfur
dioxide are 98.00%, 95.00%, 95.00%, 97.50%, 92.50%,
97.50%, 100.00%, 96.36%, and 98.18%, respectively. The
Environmental Science and Pollution Research
overall performance for 15% added contamination is 96.67%.
The results keep improving with the addition of more contamination. The observations are the same, i.e., there exists
a relationship between detection and error rate. An increase
in the detection rate increases the error rate. An increase in
the number of true positives may increase false positives.
That statement suggests that the outliers with high detection
are close to each other, with overlapping boundaries to the
non-contaminated components.
Ali. Supervision: Anwar Shah, Bahar Ali, and Muhammad Shafiq.
Project administration: Anwar Shah, Inam Ullah, Kassian T.T. Amesho,
and Muhammad Shafiq. Funding acquisition: Kassian T.T. Amesho,
Muhammad Shafiq, Shahid Anwar, and Ahyoung Choi.
Funding This work is supported by the National Natural Science Foundation of China, Grant No. 62250410365.
Availability of data and materials The raw data supporting the conclusions of this article will be made available by the authors, without undue
reservation.
Declarations
Conclusion
Entropy in machine learning is used in many applications.
The concept of randomness can be applied where uncertainty exists. We introduce EGO, which combines the two
concepts of grid and entropy. The grid effectively distributes
the data and facilitates easy analysis. Entropy plays a vital
role in measuring randomness. The entropy in this study is
considered as a distance function, the value of which became
dropped due to the separation of a member from its respective cluster. The EGO is applied on the top of hard clusters
obtained using a hard clustering algorithm or data labels. The
explicit outliers are removed initially. The entropy is applied
to the whole dataset to know the initial randomness. Then
the remaining sparse instances are examined for the implicit
outliers. The method is flexible and can be applied to clusterbased approach. This has a more time complexity at the
cost of good and improved results. The approach is clusterbased EGO or EGOC B . The proposed approach(es) are
comparatively analyzed against well-known outlier detection algorithms, including DBSCAN, HDBSCAN, RE3WC,
LOF, LoOP, ABOD, CBLOF and HBOS. Experimental
results indicate that the EGO and EGOC B effectively detect
outliers while producing compact clusters. The results of
other algorithms used to cluster in noise-polluted datasets
can be further refined by EGO and EGOC B . In particular, our
approach(es) detect an additional 4.5% to 8.6% outliers that
were undetected by well-known approaches. Moreover, outlier detection is carried out in the environmental data analysis
as a case study of the proposed approach. The experiments
suggest that the proposed approach(es) may be considered
a suitable choice for obtaining compact clusters in the presence of noise, whereas the robustness of environmental data
analysis encourages its industry orientation.
Author contribution Conceptualization: Anwar Shah and Bahar Ali.
Methodology: Anwar Shah, Bahar Ali, and Kassian T.T. Amesho.
Software: Anwar Shah, Bahar Ali, Fazal Wahab, and Inam Ullah.
Validation: Anwar Shah, Bahar Ali, and Muhammad Shafiq. Formal
analysis: Anwar Shah. Investigation: Anwar Shah, Bahar Ali, Inam
Ullah, and Muhammad Shafiq. Resources: Anwar Shah and Inam Ullah.
Data curation: Anwar Shah, Fazal Wahab, and Inam Ullah. Writing—
original draft preparation: Anwar Shah. Writing—review and editing:
Anwar Shah and Bahar Ali. Visualization: Anwar Shah and Bahar
Ethics statement The studies involving human participants were
reviewed and approved by the National Natural Science Foundation
of China, Grant No. 62250410365. The ethics committee waived the
requirement of written informed consent for participation. Written
informed consent was obtained from the individual(s) for the publication of any potentially identifiable images or data included in this
article.
Consent to participate Informed consent was obtained from all individual participants included in the study.
Consent for publication The authors of the article “Entropy Based Grid
Approach for Handling Outliers: A Case Study to Environmental Monitoring Data” give consent for the publication of all the identifiable
details, including text, images, and materials, to be published in the
journal Environmental Science and Pollution Research.
Conflict of interest The authors declare no competing interests.
References
Agrawal R, Gehrke J, Gunopulos D, et al (1998) Automatic subspace
clustering of high dimensional data for data mining applications.
In: Proceedings of the international conference on Management of
data. pp 94–105
Alameddine I, Kenney MA, Gosnell RJ et al (2010) Robust multivariate
outlier detection methods for environmental data. J Environ Eng
136(11):1299–1304
Ali B, Azam N, Shah A et al (2021) A spatial filtering inspired threeway clustering approach with application to outlier detection. Int
J Approx Reason 130:1–21
Amini A, Wah TY, Saboohi H (2014) On density-based data streams
clustering algorithms: A survey. J Comput Sci Technol 29(1):116–
141
Andersson JL, Graham MS, Zsoldos E et al (2016) Incorporating outlier detection and replacement into a non-parametric framework
for movement and distortion correction of diffusion mr images.
NeuroImage 141:556–572
Bai M, Wang X, Xin J et al (2016) An efficient algorithm for distributed
density-based outlier detection on big data. Neurocomputing
181:19–28
Batra R, Ko KI (1992) An adaptive mesh refinement technique for the
analysis of shear bands in plane strain compression of a thermoviscoplastic solid. Comput Mech 10(6):369–379
Benesty J, Chen J, Huang Y, et al (2009) Pearson correlation coefficient.
In: Noise reduction in speech processing. Springer, p 1–4
Berger MJ, Oliger J (1984) Adaptive mesh refinement for hyperbolic
partial differential equations. J Comput Phys 53(3):484–512
123
Environmental Science and Pollution Research
Berger MJ, Colella P et al (1989) Local adaptive mesh refinement for
shock hydrodynamics. J Comput Phys 82(1):64–84
Bharti S, Pattanaik K, Pandey A (2019) Contextual outlier detection for
wireless sensor networks. J Ambient Intell Humanized Comput
1–20
Birant D, Kut A (2007) St-dbscan: An algorithm for clustering spatialtemporal data. Data Knowl Eng 60(1):208–221
Blythe J, Jain S, Deelman E et al (2005) Task scheduling strategies
for workflow-based applications in grids. In: IEEE International
Symposium on Cluster Computing and the Grid, vol 2005. pp 759–
767
Borah B, Bhattacharyya D (2004) An improved sampling-based dbscan
for large spatial databases. In: Proceedings of the International
conference on intelligent sensing and information processing. pp
92–96
Breunig MM, Kriegel HP, Ng RT, et al (2000) Lof: identifying densitybased local outliers. In: Proceedings of the international conference
on Management of data. pp 93–104
Campello RJ, Moulavi D, Sander J (2013) Density-based clustering
based on hierarchical density estimates. In: Proceedings of the
Pacific-Asia conference on knowledge discovery and data mining.
pp 160–172
Campos GO, Zimek A, Sander J et al (2016) On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical
study. Data Min Knowl Discov 30(4):891–927
Campos GO, Zimek A, Sander J et al (2016) On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical
study. Data Min Knowl Discov 30(4):891–927
Chen J, Sathe S, Aggarwal C, et al (2017) Outlier detection with autoencoder ensembles. In: Proceedings of the international conference
on data mining. pp 90–98
Chen Z, Liu B (2014) Mining topics in documents: standing on the
shoulders of big data. In: Proceedings of the international conference on Knowledge discovery and data mining. pp 1116–1125
Christy A, Gandhi GM, Vaithyasubramanian S (2015) Cluster based
outlier detection algorithm for healthcare data. Procedia Comput
Sci 50:209–215
Duan L, Xu L, Guo F et al (2007) A local-density based spatial clustering
algorithm with noise. Inf Syst 32(7):978–986
Eiseman PR (1987) Adaptive grid generation. Comput Methods Appl
Mech Eng 64(1–3):321–376
Erskine RH, Green TR, Ramirez JA, et al (2006) Comparison of gridbased algorithms for computing upslope contributing area. Water
Resour Res 42(9)
Ester M, Kriegel HP, Sander J, et al (1996) A density-based algorithm
for discovering clusters in large spatial databases with noise. In:
Knowledge Discovery and Data Mining. pp 226–231
Fakhari A, Lee T (2014) Finite-difference lattice boltzmann method
with a block-structured adaptive-mesh-refinement technique. Phys
Rev E 89(3):033310
Fei G, Liu B (2016) Breaking the closed world assumption in text classification. In: Proceedings of the Conference of the North American
Chapter of the Association for Computational Linguistics: Human
Language Technologies. pp 506–514
Fuchs L (1986) A local mesh-refinement technique for incompressible
flows. Comput Fluids 14(1):69–81
Gan G, Ng MKP (2017) K-means clustering with outlier removal. Pattern Recog Lett 90:8–14
Garces H, Sbarbaro D (2009) Outliers detection in environmental monitoring data. IFAC Proc 42(23):330–335
Goldstein M, Dengel A (2012) Histogram-based outlier score (hbos):
A fast unsupervised anomaly detection algorithm. Poster Demo
Track 59–63
Goldstein MB (2014) Anomaly detection in large datasets. Verlag Dr.
Hut
123
Gu Y, Ganesan RK, Bischke B, et al (2017) Grid-based outlier detection
in large data sets for combine harvesters. In: Proceedings of the
International Conference on Industrial Informatics. pp 811–818
Güngör E, Özmen A (2017) Distance and density based clustering algorithm using gaussian kernel. Expert Syst Appl 69:10–20
Guseva AI, Kuznetsov IA (2017) The use of entropy measure for higher
quality machine learning algorithms in text data processing. In:
Proceedings of the International Conference on Future Internet of
Things and Cloud Workshops. pp 47–52
Hautamäki V, Cherednichenko S, Kärkkäinen I, et al (2005) Improving k-means by outlier removal. In: Scandinavian Conference on
Image Analysis. Springer, pp 978–987
He Y, Tan H, Luo W et al (2014) Mr-dbscan: a scalable mapreducebased dbscan algorithm for heavily skewed data. Front Comput
Sci 8(1):83–99
He Z, Xu X, Deng S (2003) Discovering cluster-based local outliers.
Pattern Recogn Lett 24(9–10):1641–1650
Jabez J, Muthukumar B (2015) Intrusion detection system (ids):
anomaly detection using outlier detection approach. Procedia
Comput Sci 48:338–346
Jiang MF, Tseng SS, Su CM (2001) Two-phase clustering process for
outliers detection. Pattern Pattern Recognit 22(6–7):691–700
Kadlec P, Gabrys B, Strandt S (2009) Data-driven soft sensors in the
process industry. Comput Chem Eng 33(4):795–814
Karypis G, Han EH, Kumar V (1999) Chameleon: Hierarchical clustering using dynamic modeling. Computer 32(8):68–75
Kotsiantis S, Pintelas P (2004) Recent advances in clustering: A brief
survey. Trans Inf Sci Appl 1(1):73–81
Kriegel HP, Schubert M, Zimek A (2008) Angle-based outlier detection in high-dimensional data. In: Proceedings of the international
conference on Knowledge discovery and data mining. pp 444–
452
Kriegel HP, Kröger P, Schubert E, et al (2009) Loop: local outlier probabilities. In: Proceedings of the conference on Information and
knowledge management. pp 1649–1652
Krkkinen I, Frnti P (2002) Dynamic local search algorithm for the clustering problem. Department of Computer Science, University of
Joensuu, Tech Rep A-2002-6
Lang K (1995) Newsweeder: Learning to filter netnews. In: Machine
Learning Proceedings 1995. Elsevier, p 331–339
Lee J, Cho NW (2016) Fast outlier detection using a grid-based algorithm. PLoS ONE 11(11):e0165972
Liao Wk, Liu Y, Choudhary A (2004) A grid-based clustering algorithm
using adaptive mesh refinement. In: Proceedings of the international conference on data mining. pp 61–69
Lin S, Brown DE (2006) An outlier-based data association method for
linking criminal incidents. Decis Support Syst 41(3):604–615
Liu B, Yin J, Xiao Y, et al (2010) Exploiting local data uncertainty to
boost global outlier detection. In: Proceedings of the International
Conference on Data Mining, pp 304–313
Louhichi S, Gzara M, Abdallah HB (2014) A density based algorithm
for discovering clusters with varied density. In: Proceedings of
World Congress on Computer Applications and Information Systems). pp 1–6
Lucas Y, Portier PE, Laporte L et al (2020) Towards automated feature
engineering for credit card fraud detection using multi-perspective
hmms. Futur Gener Comput Syst 102:393–402
Luo J, Xu L, Jamont JP et al (2007) Flood decision support system on
agent grid: method and implementation. Enterp Inf Syst 1(1):49–
68
Ma EW, Chow TW (2004) A new shifting grid clustering algorithm.
Pattern Recogn 37(3):503–514
Mahmoud E, Elmogy AM, Sarhan A (2016) Enhancing grid local outlier factor algorithm for better outlier detection. Artif Intell Mach
Learn J 16(1):13–21
Environmental Science and Pollution Research
Malini N, Pushpa M (2017) Analysis on credit card fraud identification
techniques based on knn and outlier detection. In: Proceedings
of the third International Conference on Advances in Electrical,
Electronics, Information, Communication and Bio-Informatics. pp
255–258
McInnes L, Healy J, Astels S (2017) hdbscan: Hierarchical density
based clustering. J Open Source Softw 2(11):205
Mia Hubert PR, Segaert P (2015) Discussion of multivariate functional
outlier detection. Stat Methods Appl 24(2):177–202
Ohadi N, Kamandi A, Shabankhah M, et al (2020) Sw-dbscan: A gridbased dbscan algorithm for large datasets. In: Proceddings of the
International Conference on Web Research (ICWR). pp 139–145
Osekowska E, Johnson H, Carlsson B (2014) Grid size optimization
for potential field based maritime anomaly detection. Transp Res
Procedia 3:720–729
Park NH, Lee WS (2004) Statistical grid-based clustering over data
streams. ACM Sigmod Rec 33(1):32–37
Pearson RK (2002) Outliers in process modeling and identification.
IEEE Trans Control Syst Technol 10(1):55–63
Pilevar AH, Sukumar M (2005) Gchl: A grid-clustering algorithm for
high-dimensional very large spatial data bases. Pattern Recogn Lett
26(7):999–1010
Qiu GF, Li HZ, Xu LD et al (2003) A knowledge processing method
for intelligent systems based on inclusion degree. Expert Syst
20(4):187–195
Rai P, Singh S (2010) A survey of clustering techniques. Int J Comput
Appl 7(12):1–5
Rajeswari A, Yalini S, Janani R, et al (2018) A comparative evaluation
of supervised and unsupervised methods for detecting outliers. In:
Proceedings of the Second International Conference on Inventive
Communication and Computational Technologies. pp 1068–1073
Rehm F, Klawonn F, Kruse R (2007) A novel approach to noise clustering for outlier detection. Soft Comput 11(5):489–494
Rencis JJ, Mullen RL (1986) Solution of elasticity problems by a
self-adaptive mesh refinement technique for boundary element
computation. Int J Numer Methods Eng 23(8):1509–1527
Rokach L (2009) A survey of clustering algorithms. In: Data mining
and knowledge discovery handbook. p 269–298
Sandosh S, Govindasamy V, Akila G (2020) Enhanced intrusion detection system via agent clustering and classification based on outlier
detection. Peer-to-Peer Netw Appl 1–8
Shafiq M, Tian Z, Bashir AK et al (2020) Corrauc: a malicious botiot traffic detection method in iot network using machine-learning
techniques. IEEE Internet Things J 8(5):3242–3254
Shafiq M, Tian Z, Bashir AK et al (2020) Iot malicious traffic identification using wrapper-based feature selection mechanisms. Comput
Secur 94:101863
Shafiq M, Tian Z, Sun Y et al (2020) Selection of effective machine
learning algorithm and bot-iot attacks traffic identification for
internet of things in smart city. Futur Gener Comput Syst 107:433–
442
Shah A, Azam N, Ali B et al (2021) A three-way clustering approach
for novelty detection. Inf Sci 569:650–668
Shah A, Azam N, Alanazi E, et al (2022) Image blurring and sharpening
inspired three-way clustering approach. Appl Intell 1–25
Sheikholeslami S, Chatterjee S, Zhang A (2002) A multi-resolution
clustering approach for very large spatial databases. In: Proceedings of the International Conference on Formal Ontology in
Information Systems. pp 622–630
Sitanggang IS, Baehaki DAM (2015) Global and collective outliers
detection on hotspot data as forest fires indicator in riau province,
indonesia. In: Proceedings of the International Conference on Spatial Data Mining and Geographical Knowledge Services. pp 66–70
Tran TN, Drab K, Daszykowski M (2013) Revised dbscan algorithm
to cluster data with dense adjacent clusters. Chemometr Intell Lab
Syst 120:92–96
Veenman CJ, Reinders MJT, Backer E (2002) A maximum variance cluster algorithm. IEEE Trans Pattern Anal Mach Intell
24(9):1273–1280
Veselík P, Sejkorová M, Nieoczym A, et al (2020) Outlier identification
of concentrations of pollutants in environmental data using modern
statistical methods. Pol J Environ Stud 29(1)
Wang B, Xiao G, Yu H, et al (2009) Distance-based outlier detection
on uncertain data. In: Proceddings of the International Conference
on Computer and Information Technology. pp 293–298
Wang W, Yang J, Muntz R, et al (1997) Sting: A statistical information grid approach to spatial data mining. In: Proceeding of the
conference very large data bases. pp 186–195
Wang X, Davidson I (2009) Discovering contexts and contextual
outliers using random walks in graphs. In: Proceedings of the International Conference on Data Mining. pp 1034–1039
Warne K, Prasad G, Rezvani S et al (2004) Statistical and computational intelligence techniques for inferential model development:
a comparative evaluation and a novel proposition for fusion. Eng
Appl Artif Intell 17(8):871–885
Xu D, Tian Y (2015) A comprehensive survey of clustering algorithms.
Ann Data Sci 2(2):165–193
Xu X, Yuruk N, Feng Z, et al (2007) Scan: a structural clustering algorithm for networks. In: Proceedings of the international conference
on Knowledge discovery and data mining. pp 824–833
Xu X, Liu H, Li L et al (2018) A comparison of outlier detection
techniques for high-dimensional data. Int J Comput Intell Syst
11(1):652–662
Yang H, Antonante P, Tzoumas V et al (2020) Graduated non-convexity
for robust spatial perception: From non-minimal solvers to global
outlier rejection. IEEE Robot Autom Lett 5(2):1127–1134
Yang X, Zhang G, Lu J et al (2010) A kernel fuzzy c-means
clustering-based fuzzy support vector machine algorithm for classification problems with outliers or noises. IEEE Trans Fuzzy Syst
19(1):105–115
Yap P (2002) Grid-based path-finding. In: Conference of the Canadian
Society for Computational Studies of Intelligence. pp 44–55
Zhang JS, Leung YW (2003) Robust clustering by pruning outliers.
IEEE Trans Syst Man Cybern 33(6):983–998
Zhu Y, Ting KM, Carman MJ (2016) Density-ratio based clustering for
discovering clusters with varying densities. Patt Recogn 60:983–
997
Zhu Y, Ting KM, Angelova M (2018) A distance scaling method to
improve density-based clustering. In: Proceedings of the PacificAsia Conference on Knowledge Discovery and Data Mining. pp
389–400
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Springer Nature or its licensor (e.g. a society or other partner) holds
exclusive rights to this article under a publishing agreement with the
author(s) or other rightsholder(s); author self-archiving of the accepted
manuscript version of this article is solely governed by the terms of such
publishing agreement and applicable law.
123
Environmental Science and Pollution Research
Authors and Affiliations
Anwar Shah1 · Bahar Ali1 · Fazal Wahab2 · Inam Ullah3 · Kassian T.T. Amesho4,5,6,7,8,9 · Muhammad Shafiq10,11
Anwar Shah
anwar.shah@nu.edu.pk
4
Institute of Environmental Engineering, National Sun Yat-Sen
University, Kaohsiung 804, Taiwan
Bahar Ali
bahar.ali@nu.edu.pk
5
Center for Emerging Contaminants Research, National Sun
Yat-Sen University, Kaohsiung 804, Taiwan
Fazal Wahab
1728039@stu.edu.cn
6
Tshwane School for Business and Society, Faculty of
Management of Sciences, Tshwane University of Technology,
Pretoria, South Africa
7
Centre for Environmental Studies, The International
University of Management, Main Campus, Dorado Park Ext
1, Windhoek, Namibia
Inam Ullah
inam.fragrance@gmail.com
Kassian T.T. Amesho
kassian.amesho@gmail.com
1
2
3
8
Regent Business School, Durban 4001, South Africa
National University of Computer and Emerging Sciences,
Karachi, Pakistan
9
Destinies Biomass Energy and Farming Pty Ltd, P.O. Box
7387, Swakopmund, Namibia
College of Computer Science and Technology, Northeastern
University Shenyang, Shenyang, China
10
Cyberspace Institute of Advanced Technology, Guangzhou
University, Guangzhou, China
11
Shenyang Normal University, Shenyang, China
BK21 Chungbuk Information Technology Education and
Research Center, Chungbuk National University, Cheongju,
South Korea
123
Download