Uploaded by Sunita Varma

11b2c1c5-a98d-4cd2-b718-487a322e8a46

advertisement
Natl. Acad. Sci. Lett.
https://doi.org/10.1007/s40009-023-01249-4
NEWS/VIEWS AND COMMENTS
Classify‑Imbalance Data Sets in IoT Framework of Agriculture
Field with Multivariate Sensors Using Centroid‑Based
Oversampling Method
Namrata Bhatt1
· Sunita Varma2
© The Author(s), under exclusive licence to The National Academy of Sciences, India 2023
Abstract The use of effective IOT devices and decision
learning for the prediction of crop growth in the agriculture
field is encouraging ways to boost economic growth in the
farming sector. Increasing operating costs and degradation
of the atmosphere are the key issues in the area of agriculture. A predictive model with advanced data analysis is
needed to process massive amounts of data collected through
multivariate sensors deployed in the agriculture field. In
classification predictive modeling achieving high accuracy
is extremely challenging due to the high-imbalance characteristics of training data. There is a need to improve the classification performance of imbalanced data, which happens
when there are insufficient instances of the data that represent either of the class labels. That also affects the robustness
of the predictive model and significantly causes the loss of
essential crop growth information and crucial details from
an abnormal class. Therefore, there is a need to establish
an effective classification model approach for limited and
imbalanced agriculture datasets that are getting distorted
in favor of the majority class while becoming unfavorably
insensitive to the minority class target. This paper introduces
Synthetic Minority Oversampling Technique (SMOTE), a
new attribute selection methodology based on the centroidbased oversampling method, and the k-nearest neighbor
(kNN) classifier. The collected results were compared with
the traditional oversampling technique. The experimental
* Namrata Bhatt
n.03633sharma@gmail.com
Sunita Varma
sunita.varma19@gmail.com
1
Rajiv Gandhi Proudyogiki Vishwavidyalaya, Bhopal,
Madhya Pradesh, India
2
Information Technology Department, SGSITS, Indore,
Madhya Pradesh, India
results show that the proposed algorithm increases the
overall efficiency in terms of accuracy by 1–4%, precision
improvement by 2–4%, and recall value of 2–10%.
Keywords IOT · Multivariate sensors · Class imbalance ·
SMOTE · Centroid method · Predictive model
The Internet of Things (IoT) is not only restricted to interfacing things together, in addition, permits things to convey
and deal with information. Many researchers are interested
now in the potential of the IoT to transform key businesses
for the betterment of the world through smart devices [1]. It
also gives a controllable and smart framework, dependent
on the sensor data. Data collected from various sensors are
combined to increase the computational capacity and performance of algorithms [2]. However, there are numerous
difficulties in building up an IoT framework, for example,
interoperability, robustness, and security, and still ensuring
the accuracy and reliability of the predictive model.
The increasing growth of IoT systems has a large impact
on the agriculture industry. Agriculture has a major role
in India’s economic growth. A healthy society and a sustainable environment necessitate the production of healthy
crops and the development of efficient plant-growing methods. Government promotes many policies to drive IoT 4.0
through smart devices and system development in the agriculture industry [3].
Most data sets obtained from the real world are not perfect; they suffer from missing values that affect the prediction. The system generates implicit as well as random errors
which are processing errors generated by sensors. So the
quality of the dataset depends on both the class label and
the attribute label. Due to the inherently complex nature of
the dataset, learning from imbalanced data requires some
13
Vol.:(0123456789)
N. Bhatt, S. Varma
new approaches, understandings, and tools to transform that
data. Moreover, this can’t guarantee an efficient solution to
the problem. In the worst cases, it will turn into complete
waste with zero residues to reuse [4].
The main aim is to overcome the problem of imbalance
in the dataset and for that apply the centroid-based oversampling method by which the outcomes showed a great
improvement in the performance with the classifier algorithm. Following are some novel contributions to the proposed work:
1. Choose a minority class input vector by oversampling.
2. Find its k-nearest neighbor (kNN) by replacing the
Euclidean distance calculation formula with the proposed centroid-based method.
3. The suggested approach also enhances the model’s decision-making and predictability at a fairly comparable
classification level.
4. As a result of validation, it increases the overall efficiency in terms of accuracy, precision, sensitivity
(recall), and f-score.
The methods developed to address class imbalance are
often divided into two groups: internal approaches and external approaches. While classifier knowledge is needed to balance the distribution, the internal approach simply changes
the data set to fix the imbalance problem [5]. Instead of
learning processes that rebalance the class distribution via
a pre-processing step, the external approach relies on data
where knowledge of classifiers is not required. This approach
is versatile and further divided into two processes: (i) Undersampling which balances a dataset by reducing instances in
the majority class and (ii) Oversampling, which equalizes
the class ratio by including similar examples of the minority class [6].
The basic sampling approaches which deal with undersampling or oversampling in the dataset are: (i) Random
undersampling (RUS) and (ii) Random oversampling (ROS).
ROS involves choosing random samples from the minority
class instances by adding multiple copies; therefore, a particular instance will be chosen more than once [7]. Chawla
et al. [8] proposed the SMOTE method, an intelligent oversampling strategy that gets around the ROS issue and facilitates the over-fitting of the classifier. The fundamental idea
behind creating new minority class samples is to calculate
several minority class samples that are close to one another.
[9].
H. Zheng et al. [10] focus on imbalance issues and
work on three distinct sampling methods, such as SMOTETomek hybrid sampling, Cluster centroids undersampling, and Borderline-SMOTE oversampling. Also, used
a combination of classifier learners including kNN for
prediction. C.-R. Wang et al. [11] investigate the number
13
of clusters and neighbors of the minority samples with
the BIRCH and BMCSMOTE (Over-Sampling Technique)
methods, respectively, using different classifiers. Liu
et al. [12] build fuzzy-based information decomposition
that simultaneously addresses the issues of missing value
estimates and imbalanced data in pattern classification
(FID). Cheng et al. [13] modified SMOTE algorithm with
the help of the Gaussian mixture model. Noorhalim et al.
[14] identify a class imbalance problem (CIP) that can be
solved by using a decision tree and kNN with and without
sampling in learning classification.
Y. Long et al. [15] suggested that three separate models—linear discriminant analysis, SVM, and kNN—be
used to track various types of drought stress and reveal
chances for non-destructive plant drought stress. Eide
et al. [16] explain multivariate sensors require multivariate analysis so that data input from various blocks can
combine and notches in the final model and apply principal
component analysis. A. S. Palli et al. [17] suggest finding
synthetic samples for every sub-cluster by data observations and applying the centroid method to accelerate the
number of samples in the minority class. L. Wang et al.
[18] investigate that due to the extremely tiny size of the
minority class, oversampling approaches, i.e., Smote, Random oversampling, Minority Weighted, and Cluster-based
are unable to generate enough effective synthetic cases.
A systematic study shows that most of the researchers focus on creating synthetic samples using the SMOTE
[10–13] and kNN [14, 15] to solve the issue of oversampling and classification, which have almost similar performance. Instead of fully focusing on creating synthetic samples, adequate attention should also be given to improve
the model performance.
SMOTE [8, 13] is the much more common and extensively used oversampling method. It creates synthetic data
using the k-nearest neighbor (kNN) technique. To begin,
locate its K closest neighbors in the minority class (m1,
m2… mk). Then, among these K closest neighbors, one
instance N is randomly selected. Random number n is created between 0 and 1 to multiply the variance between m
and N, i.e., n (N − m). Finally, M can be used to create a
new synthetic minority class instance. The equation is,
M = m + n × (N − m)
(1)
Synthetically created samples used any metric in which
the difference in distance between the feature and its
neighbors is calculated, at random along the line segments
joining all its neighbors as illustrated in Fig. 1.
Rather than applying randomized sampling strategies,
SMOTE can improve the over-fitting effect by introducing
synthetic instances on new positions. It also widens the
decision area of minority class examples; there are still
Classify‑Imbalance Data Sets in IoT Framework of Agriculture Field with Multivariate Sensors…
two flaws in it. The first is that it can transmit noisy information, resulting in new instances emerging in inconvenient locations. The second is that all instances in SMOTE
have the same locality parameter K, but the dissemination
features are ignored. This problem can be solved by a proposed solution.
The basic idea is to balance the class distribution in the
training and testing dataset with the proposed centroidbased oversampling method and apply the kNN classifier for
accurate prediction of the model with the high-dimensional
multivariate dataset. However, kNN shows great results with
the Euclidean distance method when data are normalized
and have a low dimension. Thus, the proposed technique
is recommended to address all of the previously described
shortcomings rather than the Euclidean distance method.
A suggested methodology that identifies suitable synthetic
samples with decision regions of the minority class uncovers
how similarity metrics respond to low- and high-dimensional
datasets. The technique for the centroid-based algorithm is
briefly detailed below,
This approach is used to oversample the data set which
is based on the concept of calculating the median of three
points as a distance measure. It is highly clear and holds up
to a lot of data sets with high dimensionality. In classification, the main focus of the predictive model is not only to
assign a class label to two or more classes but also to work in
conjunction with good features that appropriately decide the
classes. The effectiveness of the model has been evaluated
using K-fold cross-validation.
Algorithm 2: kNN Algorithm for the classification model.
Step 1:
Step 2:
Step 3:
Step 4:
Step 5:
Step 6:
Step 7:
Load and split the dataset into training and testing sets; apply tenfold cross-validations.
Choose the value of k. (i.e., k = 3 and k = 2)
Repeat step 1 until the required number of training points for the desired class is obtained.
Instead of using Euclidean distance, calculate the distance measurements using the centroid-based method.
Sort the distances in ascending order for the collection of distances and indices.
Based on the closest distance value from the existing data points, classify the category of the predicted data point.
Using feature similarity, the data point is categorized into one of the K groups.
13
N. Bhatt, S. Varma
kNN is a very simple, easy-to-understand, and frequently
used learning algorithm used in classification to build powerful classifiers and also does not require any training to
make real-time predictions. The centroid-based method used
as the distance metric, to find the distance between the point
and a distribution measure in both oversampling and classification can be formulated as(( K
)( k
))
∑
∑
Dcentroid (X, Y) =
Xi∕k)
Yi∕k
(2)
i
i
where X and Y are the data points that have to be predicted.
k is the number of neighbors.
In this direction, we proposed methodologies which are
shown in Fig. 2. The step-by-step procedure is as follows;
Step 1: A multivariate sensor [14, 16] produces a lot of
data at first.
Step 2: First, examine the imbalance factor. If so, go to
step 3.
Step 3: Apply the centroid-based oversampling approach
to limit outliers in the data set.
Step 4: Go to step 5, if a dataset is already balanced.
Step 5: Apply tenfold cross-validation with three repetitions done after the oversampling of the training, preventing data leakage.
Step 6: Apply the kNN classifier for accurate prediction
of the model.
Step 7: The classification matrices are generated to evaluate the model performance.
There are so many parameters to evaluate the performance of a classification model. The key classification
metrics are: Accuracy, Recall, Precision, and F1- Score
can be calculated asAccuracy: It computes the overall percentage of correct
predictions.
[
]
Accuracy = TN + TP∕(TN + TP + FN + FP)
Fig. 1 SMOTE method creating synthetic instances
Multivariate Sensor dataset
Check
Imbalance
NO
Yes
Select Minority
Class Samples
Balance dataset
Apply Centroid-Based oversampling method on
training dataset using k=2 and k=3
Apply 10-Fold cross validation to separate
training and testing dataset
Apply kNN Classifier using K=2 and K=3 on
oversampled training dataset for model
Classification matrices used to evaluate the
model performance
Output
Fig. 2 Flow diagram of a proposed centroid-based oversampling
method
13
Classify‑Imbalance Data Sets in IoT Framework of Agriculture Field with Multivariate Sensors…
Table 1 Comparison of methods of different parameters with different values of k
Dataset1
K=2
K=3
Parameters
KNN
SMOTE + KNN
SMOTE + KNN
Centroid-based
Parameters
KNN
SMOTE + KNN
SMOTE + KNN
Centroid-based
Accuracy
Recall
Precision
F-measure
0.95
0.85
0.86
0.85
0.96
0.84
0.85
0.84
0.97
0.94
0.89
0.91
Accuracy
Recall
Precision
F-measure
0.93
0.89
0.86
0.87
0.94
0.91
0.89
0.89
0.98
0.93
0.91
0.91
Dataset2
K=2
K=3
Parameters
KNN
SMOTE + KNN
SMOTE + KNN
Centroid-based
Parameters
KNN
SMOTE + KNN
SMOTE + KNN
Centroid-based
Accuracy
Recall
Precision
F-measure
0.94
0.78
0.89
0.83
0.93
0.81
0.84
0.82
0.95
0.96
0.85
0.9
Accuracy
Recall
Precision
F-measure
0.92
0.74
0.87
0.79
0.91
0.81
0.84
0.82
0.97
0.92
0.81
0.86
Dataset1 [19]
Dataset2 [20]
Fig. 3 Comparison of classification methods with the proposed method for datasets [19, 20] with different values of k
Recall: It gives the fraction you correctly identified as
actual positive out of all positives.
[
]
Recall = TP∕(TP + FN)
Precision: The calculation of successfully identified positives out of all positives predicted.
[
]
Precision = TP∕(TP + FP)
13
N. Bhatt, S. Varma
F1-score: It is determined as the harmonic mean of the
model’s precision and recall.
[
]
F1 − score = (2 ∗ Precision ∗ Recall)∕(Precision + Recall)
Several experiments were carried out in python to test
the performance of the suggested technique. Two datasets
related to agricultural crop production are used to verify it.
The dataset1[19] was generated using information from the
feed grains database, which has two separate files and 1154
columns with values for production rate, yield, production
year, and farm price relative to global production. Another
dataset2 [20] was obtained from publicly available records
of the Indian government with an attribute value of rainfall,
temperature, production, irrigation area, and yield of crops.
The results of both datasets are shown below.
The result obtained in Table 1 for different values of k
(i.e., k = 2 and 3) for dataset [19, 20] can be represented in
a graph for proper visualization, which is shown in Fig. 3.
Table 1 shows that the proposed centroid-based oversampling method has the greatest classification accuracy when
matched with SMOTE and kNN for all values of k.
We present a centroid-based oversampling method to handle the difficulty of an imbalanced dataset by creating synthetic samples in a distributed manner. The proposed method
can also work in recovering missing values in datasets by
creating synthetic samples in affected fields. According to
the experimental findings, the suggested algorithm outperforms the traditional approach in terms of model accuracy
by (1–4%), precision improvement by (2–4%), and Recall
& F-score by (2–4%). Once the best oversampling strategy
for a particular dataset has been identified, it can be used
to increase classifier precision. Future work will focus on
applying the methodology to multi-dimensional data and
combining the centroid oversampling method with various
classifiers to improve classification accuracy.
References
1. Siegel JE, Kumar S, Sharma SE, Member IEEE (2018) The future
internet of things: secure, efficient, and model-based. Proc IEEE
Int Things J 5(4):2386–2398
2. Luciani R, Laneve G (2019) Agricultural monitoring, an automation procedure for crop mapping and yield estimation: The
Great Rift Valley of Kenya case. IEEE J Select Top Appl Earth
Observ Remote Sens 12(7):2196–2208
3. Manogaran G, Hsu C-H, Rawal BS, Muthu B, Mavromoustakis CX, Mastorakis G (2021) ISOF: information scheduling
and optimization framework for improving the performance of
agriculture systems aided by industry 4.0. IEEE Int Things J
8(5):3120–3129
4. Loyola-González O, Martínez-Trinidad JFCO, Carrasco-Ochoa
JA, García-Borroto M (2019) Cost-sensitive pattern-based classification for class imbalance problems. IEEE Access 7:60411–60427
13
5. Lin C et al (2018) Minority oversampling in kernel adaptive subspaces for class imbalanced datasets. IEEE Trans Knowl Data Eng
30(5):950–962
6. Ibrahim MH (2021) ODBOT: outlier detection-based oversampling technique for imbalanced datasets learning. Neural Comput Appl 33:15781–15806. https://​d oi.​o rg/​1 0.​1 007/​
s00521-​021-​06198-x
7. Chen M-Y, Chiang H-S, Huang W-K (2022) Efficient generative
adversarial networks for imbalanced traffic collision datasets.
IEEE Trans Intell Transp Syst 23(10):19864–19873. https://​doi.​
org/​10.​1109/​TITS.​2022.​31623​95
8. Chawla V, Bowyer KW, Hall LO, Kegelmeyer WP (2002)
SMOTE: synthetic minority over-sampling technique. J Artif
Intell Res 16(1):321–357
9. Sharma A, Singh PK, Chandra R (2022) SMOTified-GAN for
class imbalanced pattern classification problems. IEEE Access
10:30655–30665. https://​doi.​org/​10.​1109/​ACCESS.​2022.​31589​
77
10. Zheng H, Sherazi SWA, Lee JY (2021) A stacking ensemble prediction model for the occurrences of major adverse cardiovascular
events in patients with acute coronary syndrome on imbalanced
data. IEEE Access 9:113692–113704. https://​doi.​org/​10.​1109/​
ACCESS.​2021.​30997​95
11. Wang C-R, Shao X-H (2021) An improving majority weighted
minority oversampling technique for imbalanced classification
problem. IEEE Access 9:5069–5082. https://​doi.​org/​10.​1109/​
ACCESS.​2020.​30479​23
12. Liu S, Zhang J, Xiang Y, Zhou W (2017) Fuzzy-based information
decomposition for incomplete and imbalanced data learning. IEEE
Trans Fuzzy Syst 25(6):1476–1490
13. Cheng K, Zhang C, Yu H, Yang X, Zou H, Gao S (2019) Grouped
SMOTE with noise filtering mechanism for classifying imbalanced data. IEEE Access 7:170668–170681. https://​doi.​org/​10.​
1109/​ACCESS.​2019.​29550​86
14. Noorhalim N, Ali A, Shamsuddin SM (2019) Handling Imbalanced Ratio for Class Imbalance Problem Using SMOTE. In: Kor
LK, Ahmad AR, Idrus Z, Mansor K (eds) Proceedings of the third
international conference on computing, mathematics and statistics
(iCMS2017). Springer, Singapore
15. Long Y, Ma M (2022) Recognition of drought stress state of
tomato seedling based on chlorophyll fluorescence imaging. IEEE
Access 10:48633–48642. https://d​ oi.o​ rg/1​ 0.1​ 109/A
​ CCESS.2​ 022.3
16. Eide I, Westad F (2018) Automated multivariate analysis of multisensor data submitted online: real-time environmental monitoring.
PLoS ONE 13(1):e0189443
17. Palli AS, Jaafar J, Hashmani MA, Gomes HM, Gilal AR (2022) A
hybrid sampling approach for imbalanced binary and multi-class
data using clustering analysis. IEEE Access 10:118639–118653.
https://​doi.​org/​10.​1109/​ACCESS.​2022.​32184​63
18. Wang L, Hybrid JY (2020) Algorithm of DBSCAN and improved
SMOTE for oversampling. Comput Eng Appl 56(18):111–118
19. Agriculture (2016) Attribution: Department of Agriculture, Economic Research Service Feed Grains Database, Version ade449eb.
Retrieved from https://d​ ata.w
​ orld/a​ gricu​ lture/f​ eed-g​ rains-d​ ataba​ se
20. Jambekar S, Nema S, Saquib Z (2018) Prediction of crop production in india using data mining techniques. In: 2018 fourth
international conference on computing communication control
and automation (ICCUBEA), Pune, India, 2018, pp 1–5, https://​
doi.​org/​10.​1109/​ICCUB​EA.​2018.​86974​46
Publisher’s Note Springer Nature remains neutral with regard to
jurisdictional claims in published maps and institutional affiliations.
Download