Natl. Acad. Sci. Lett. https://doi.org/10.1007/s40009-023-01249-4 NEWS/VIEWS AND COMMENTS Classify‑Imbalance Data Sets in IoT Framework of Agriculture Field with Multivariate Sensors Using Centroid‑Based Oversampling Method Namrata Bhatt1 · Sunita Varma2 © The Author(s), under exclusive licence to The National Academy of Sciences, India 2023 Abstract The use of effective IOT devices and decision learning for the prediction of crop growth in the agriculture field is encouraging ways to boost economic growth in the farming sector. Increasing operating costs and degradation of the atmosphere are the key issues in the area of agriculture. A predictive model with advanced data analysis is needed to process massive amounts of data collected through multivariate sensors deployed in the agriculture field. In classification predictive modeling achieving high accuracy is extremely challenging due to the high-imbalance characteristics of training data. There is a need to improve the classification performance of imbalanced data, which happens when there are insufficient instances of the data that represent either of the class labels. That also affects the robustness of the predictive model and significantly causes the loss of essential crop growth information and crucial details from an abnormal class. Therefore, there is a need to establish an effective classification model approach for limited and imbalanced agriculture datasets that are getting distorted in favor of the majority class while becoming unfavorably insensitive to the minority class target. This paper introduces Synthetic Minority Oversampling Technique (SMOTE), a new attribute selection methodology based on the centroidbased oversampling method, and the k-nearest neighbor (kNN) classifier. The collected results were compared with the traditional oversampling technique. The experimental * Namrata Bhatt n.03633sharma@gmail.com Sunita Varma sunita.varma19@gmail.com 1 Rajiv Gandhi Proudyogiki Vishwavidyalaya, Bhopal, Madhya Pradesh, India 2 Information Technology Department, SGSITS, Indore, Madhya Pradesh, India results show that the proposed algorithm increases the overall efficiency in terms of accuracy by 1–4%, precision improvement by 2–4%, and recall value of 2–10%. Keywords IOT · Multivariate sensors · Class imbalance · SMOTE · Centroid method · Predictive model The Internet of Things (IoT) is not only restricted to interfacing things together, in addition, permits things to convey and deal with information. Many researchers are interested now in the potential of the IoT to transform key businesses for the betterment of the world through smart devices [1]. It also gives a controllable and smart framework, dependent on the sensor data. Data collected from various sensors are combined to increase the computational capacity and performance of algorithms [2]. However, there are numerous difficulties in building up an IoT framework, for example, interoperability, robustness, and security, and still ensuring the accuracy and reliability of the predictive model. The increasing growth of IoT systems has a large impact on the agriculture industry. Agriculture has a major role in India’s economic growth. A healthy society and a sustainable environment necessitate the production of healthy crops and the development of efficient plant-growing methods. Government promotes many policies to drive IoT 4.0 through smart devices and system development in the agriculture industry [3]. Most data sets obtained from the real world are not perfect; they suffer from missing values that affect the prediction. The system generates implicit as well as random errors which are processing errors generated by sensors. So the quality of the dataset depends on both the class label and the attribute label. Due to the inherently complex nature of the dataset, learning from imbalanced data requires some 13 Vol.:(0123456789) N. Bhatt, S. Varma new approaches, understandings, and tools to transform that data. Moreover, this can’t guarantee an efficient solution to the problem. In the worst cases, it will turn into complete waste with zero residues to reuse [4]. The main aim is to overcome the problem of imbalance in the dataset and for that apply the centroid-based oversampling method by which the outcomes showed a great improvement in the performance with the classifier algorithm. Following are some novel contributions to the proposed work: 1. Choose a minority class input vector by oversampling. 2. Find its k-nearest neighbor (kNN) by replacing the Euclidean distance calculation formula with the proposed centroid-based method. 3. The suggested approach also enhances the model’s decision-making and predictability at a fairly comparable classification level. 4. As a result of validation, it increases the overall efficiency in terms of accuracy, precision, sensitivity (recall), and f-score. The methods developed to address class imbalance are often divided into two groups: internal approaches and external approaches. While classifier knowledge is needed to balance the distribution, the internal approach simply changes the data set to fix the imbalance problem [5]. Instead of learning processes that rebalance the class distribution via a pre-processing step, the external approach relies on data where knowledge of classifiers is not required. This approach is versatile and further divided into two processes: (i) Undersampling which balances a dataset by reducing instances in the majority class and (ii) Oversampling, which equalizes the class ratio by including similar examples of the minority class [6]. The basic sampling approaches which deal with undersampling or oversampling in the dataset are: (i) Random undersampling (RUS) and (ii) Random oversampling (ROS). ROS involves choosing random samples from the minority class instances by adding multiple copies; therefore, a particular instance will be chosen more than once [7]. Chawla et al. [8] proposed the SMOTE method, an intelligent oversampling strategy that gets around the ROS issue and facilitates the over-fitting of the classifier. The fundamental idea behind creating new minority class samples is to calculate several minority class samples that are close to one another. [9]. H. Zheng et al. [10] focus on imbalance issues and work on three distinct sampling methods, such as SMOTETomek hybrid sampling, Cluster centroids undersampling, and Borderline-SMOTE oversampling. Also, used a combination of classifier learners including kNN for prediction. C.-R. Wang et al. [11] investigate the number 13 of clusters and neighbors of the minority samples with the BIRCH and BMCSMOTE (Over-Sampling Technique) methods, respectively, using different classifiers. Liu et al. [12] build fuzzy-based information decomposition that simultaneously addresses the issues of missing value estimates and imbalanced data in pattern classification (FID). Cheng et al. [13] modified SMOTE algorithm with the help of the Gaussian mixture model. Noorhalim et al. [14] identify a class imbalance problem (CIP) that can be solved by using a decision tree and kNN with and without sampling in learning classification. Y. Long et al. [15] suggested that three separate models—linear discriminant analysis, SVM, and kNN—be used to track various types of drought stress and reveal chances for non-destructive plant drought stress. Eide et al. [16] explain multivariate sensors require multivariate analysis so that data input from various blocks can combine and notches in the final model and apply principal component analysis. A. S. Palli et al. [17] suggest finding synthetic samples for every sub-cluster by data observations and applying the centroid method to accelerate the number of samples in the minority class. L. Wang et al. [18] investigate that due to the extremely tiny size of the minority class, oversampling approaches, i.e., Smote, Random oversampling, Minority Weighted, and Cluster-based are unable to generate enough effective synthetic cases. A systematic study shows that most of the researchers focus on creating synthetic samples using the SMOTE [10–13] and kNN [14, 15] to solve the issue of oversampling and classification, which have almost similar performance. Instead of fully focusing on creating synthetic samples, adequate attention should also be given to improve the model performance. SMOTE [8, 13] is the much more common and extensively used oversampling method. It creates synthetic data using the k-nearest neighbor (kNN) technique. To begin, locate its K closest neighbors in the minority class (m1, m2… mk). Then, among these K closest neighbors, one instance N is randomly selected. Random number n is created between 0 and 1 to multiply the variance between m and N, i.e., n (N − m). Finally, M can be used to create a new synthetic minority class instance. The equation is, M = m + n × (N − m) (1) Synthetically created samples used any metric in which the difference in distance between the feature and its neighbors is calculated, at random along the line segments joining all its neighbors as illustrated in Fig. 1. Rather than applying randomized sampling strategies, SMOTE can improve the over-fitting effect by introducing synthetic instances on new positions. It also widens the decision area of minority class examples; there are still Classify‑Imbalance Data Sets in IoT Framework of Agriculture Field with Multivariate Sensors… two flaws in it. The first is that it can transmit noisy information, resulting in new instances emerging in inconvenient locations. The second is that all instances in SMOTE have the same locality parameter K, but the dissemination features are ignored. This problem can be solved by a proposed solution. The basic idea is to balance the class distribution in the training and testing dataset with the proposed centroidbased oversampling method and apply the kNN classifier for accurate prediction of the model with the high-dimensional multivariate dataset. However, kNN shows great results with the Euclidean distance method when data are normalized and have a low dimension. Thus, the proposed technique is recommended to address all of the previously described shortcomings rather than the Euclidean distance method. A suggested methodology that identifies suitable synthetic samples with decision regions of the minority class uncovers how similarity metrics respond to low- and high-dimensional datasets. The technique for the centroid-based algorithm is briefly detailed below, This approach is used to oversample the data set which is based on the concept of calculating the median of three points as a distance measure. It is highly clear and holds up to a lot of data sets with high dimensionality. In classification, the main focus of the predictive model is not only to assign a class label to two or more classes but also to work in conjunction with good features that appropriately decide the classes. The effectiveness of the model has been evaluated using K-fold cross-validation. Algorithm 2: kNN Algorithm for the classification model. Step 1: Step 2: Step 3: Step 4: Step 5: Step 6: Step 7: Load and split the dataset into training and testing sets; apply tenfold cross-validations. Choose the value of k. (i.e., k = 3 and k = 2) Repeat step 1 until the required number of training points for the desired class is obtained. Instead of using Euclidean distance, calculate the distance measurements using the centroid-based method. Sort the distances in ascending order for the collection of distances and indices. Based on the closest distance value from the existing data points, classify the category of the predicted data point. Using feature similarity, the data point is categorized into one of the K groups. 13 N. Bhatt, S. Varma kNN is a very simple, easy-to-understand, and frequently used learning algorithm used in classification to build powerful classifiers and also does not require any training to make real-time predictions. The centroid-based method used as the distance metric, to find the distance between the point and a distribution measure in both oversampling and classification can be formulated as(( K )( k )) ∑ ∑ Dcentroid (X, Y) = Xi∕k) Yi∕k (2) i i where X and Y are the data points that have to be predicted. k is the number of neighbors. In this direction, we proposed methodologies which are shown in Fig. 2. The step-by-step procedure is as follows; Step 1: A multivariate sensor [14, 16] produces a lot of data at first. Step 2: First, examine the imbalance factor. If so, go to step 3. Step 3: Apply the centroid-based oversampling approach to limit outliers in the data set. Step 4: Go to step 5, if a dataset is already balanced. Step 5: Apply tenfold cross-validation with three repetitions done after the oversampling of the training, preventing data leakage. Step 6: Apply the kNN classifier for accurate prediction of the model. Step 7: The classification matrices are generated to evaluate the model performance. There are so many parameters to evaluate the performance of a classification model. The key classification metrics are: Accuracy, Recall, Precision, and F1- Score can be calculated asAccuracy: It computes the overall percentage of correct predictions. [ ] Accuracy = TN + TP∕(TN + TP + FN + FP) Fig. 1 SMOTE method creating synthetic instances Multivariate Sensor dataset Check Imbalance NO Yes Select Minority Class Samples Balance dataset Apply Centroid-Based oversampling method on training dataset using k=2 and k=3 Apply 10-Fold cross validation to separate training and testing dataset Apply kNN Classifier using K=2 and K=3 on oversampled training dataset for model Classification matrices used to evaluate the model performance Output Fig. 2 Flow diagram of a proposed centroid-based oversampling method 13 Classify‑Imbalance Data Sets in IoT Framework of Agriculture Field with Multivariate Sensors… Table 1 Comparison of methods of different parameters with different values of k Dataset1 K=2 K=3 Parameters KNN SMOTE + KNN SMOTE + KNN Centroid-based Parameters KNN SMOTE + KNN SMOTE + KNN Centroid-based Accuracy Recall Precision F-measure 0.95 0.85 0.86 0.85 0.96 0.84 0.85 0.84 0.97 0.94 0.89 0.91 Accuracy Recall Precision F-measure 0.93 0.89 0.86 0.87 0.94 0.91 0.89 0.89 0.98 0.93 0.91 0.91 Dataset2 K=2 K=3 Parameters KNN SMOTE + KNN SMOTE + KNN Centroid-based Parameters KNN SMOTE + KNN SMOTE + KNN Centroid-based Accuracy Recall Precision F-measure 0.94 0.78 0.89 0.83 0.93 0.81 0.84 0.82 0.95 0.96 0.85 0.9 Accuracy Recall Precision F-measure 0.92 0.74 0.87 0.79 0.91 0.81 0.84 0.82 0.97 0.92 0.81 0.86 Dataset1 [19] Dataset2 [20] Fig. 3 Comparison of classification methods with the proposed method for datasets [19, 20] with different values of k Recall: It gives the fraction you correctly identified as actual positive out of all positives. [ ] Recall = TP∕(TP + FN) Precision: The calculation of successfully identified positives out of all positives predicted. [ ] Precision = TP∕(TP + FP) 13 N. Bhatt, S. Varma F1-score: It is determined as the harmonic mean of the model’s precision and recall. [ ] F1 − score = (2 ∗ Precision ∗ Recall)∕(Precision + Recall) Several experiments were carried out in python to test the performance of the suggested technique. Two datasets related to agricultural crop production are used to verify it. The dataset1[19] was generated using information from the feed grains database, which has two separate files and 1154 columns with values for production rate, yield, production year, and farm price relative to global production. Another dataset2 [20] was obtained from publicly available records of the Indian government with an attribute value of rainfall, temperature, production, irrigation area, and yield of crops. The results of both datasets are shown below. The result obtained in Table 1 for different values of k (i.e., k = 2 and 3) for dataset [19, 20] can be represented in a graph for proper visualization, which is shown in Fig. 3. Table 1 shows that the proposed centroid-based oversampling method has the greatest classification accuracy when matched with SMOTE and kNN for all values of k. We present a centroid-based oversampling method to handle the difficulty of an imbalanced dataset by creating synthetic samples in a distributed manner. The proposed method can also work in recovering missing values in datasets by creating synthetic samples in affected fields. According to the experimental findings, the suggested algorithm outperforms the traditional approach in terms of model accuracy by (1–4%), precision improvement by (2–4%), and Recall & F-score by (2–4%). Once the best oversampling strategy for a particular dataset has been identified, it can be used to increase classifier precision. Future work will focus on applying the methodology to multi-dimensional data and combining the centroid oversampling method with various classifiers to improve classification accuracy. References 1. Siegel JE, Kumar S, Sharma SE, Member IEEE (2018) The future internet of things: secure, efficient, and model-based. Proc IEEE Int Things J 5(4):2386–2398 2. Luciani R, Laneve G (2019) Agricultural monitoring, an automation procedure for crop mapping and yield estimation: The Great Rift Valley of Kenya case. IEEE J Select Top Appl Earth Observ Remote Sens 12(7):2196–2208 3. Manogaran G, Hsu C-H, Rawal BS, Muthu B, Mavromoustakis CX, Mastorakis G (2021) ISOF: information scheduling and optimization framework for improving the performance of agriculture systems aided by industry 4.0. IEEE Int Things J 8(5):3120–3129 4. Loyola-González O, Martínez-Trinidad JFCO, Carrasco-Ochoa JA, García-Borroto M (2019) Cost-sensitive pattern-based classification for class imbalance problems. IEEE Access 7:60411–60427 13 5. Lin C et al (2018) Minority oversampling in kernel adaptive subspaces for class imbalanced datasets. IEEE Trans Knowl Data Eng 30(5):950–962 6. Ibrahim MH (2021) ODBOT: outlier detection-based oversampling technique for imbalanced datasets learning. Neural Comput Appl 33:15781–15806. https://d oi.o rg/1 0.1 007/ s00521-021-06198-x 7. Chen M-Y, Chiang H-S, Huang W-K (2022) Efficient generative adversarial networks for imbalanced traffic collision datasets. IEEE Trans Intell Transp Syst 23(10):19864–19873. https://doi. org/10.1109/TITS.2022.3162395 8. Chawla V, Bowyer KW, Hall LO, Kegelmeyer WP (2002) SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 16(1):321–357 9. Sharma A, Singh PK, Chandra R (2022) SMOTified-GAN for class imbalanced pattern classification problems. IEEE Access 10:30655–30665. https://doi.org/10.1109/ACCESS.2022.31589 77 10. Zheng H, Sherazi SWA, Lee JY (2021) A stacking ensemble prediction model for the occurrences of major adverse cardiovascular events in patients with acute coronary syndrome on imbalanced data. IEEE Access 9:113692–113704. https://doi.org/10.1109/ ACCESS.2021.3099795 11. Wang C-R, Shao X-H (2021) An improving majority weighted minority oversampling technique for imbalanced classification problem. IEEE Access 9:5069–5082. https://doi.org/10.1109/ ACCESS.2020.3047923 12. Liu S, Zhang J, Xiang Y, Zhou W (2017) Fuzzy-based information decomposition for incomplete and imbalanced data learning. IEEE Trans Fuzzy Syst 25(6):1476–1490 13. Cheng K, Zhang C, Yu H, Yang X, Zou H, Gao S (2019) Grouped SMOTE with noise filtering mechanism for classifying imbalanced data. IEEE Access 7:170668–170681. https://doi.org/10. 1109/ACCESS.2019.2955086 14. Noorhalim N, Ali A, Shamsuddin SM (2019) Handling Imbalanced Ratio for Class Imbalance Problem Using SMOTE. In: Kor LK, Ahmad AR, Idrus Z, Mansor K (eds) Proceedings of the third international conference on computing, mathematics and statistics (iCMS2017). Springer, Singapore 15. Long Y, Ma M (2022) Recognition of drought stress state of tomato seedling based on chlorophyll fluorescence imaging. IEEE Access 10:48633–48642. https://d oi.o rg/1 0.1 109/A CCESS.2 022.3 16. Eide I, Westad F (2018) Automated multivariate analysis of multisensor data submitted online: real-time environmental monitoring. PLoS ONE 13(1):e0189443 17. Palli AS, Jaafar J, Hashmani MA, Gomes HM, Gilal AR (2022) A hybrid sampling approach for imbalanced binary and multi-class data using clustering analysis. IEEE Access 10:118639–118653. https://doi.org/10.1109/ACCESS.2022.3218463 18. Wang L, Hybrid JY (2020) Algorithm of DBSCAN and improved SMOTE for oversampling. Comput Eng Appl 56(18):111–118 19. Agriculture (2016) Attribution: Department of Agriculture, Economic Research Service Feed Grains Database, Version ade449eb. Retrieved from https://d ata.w orld/a gricu lture/f eed-g rains-d ataba se 20. Jambekar S, Nema S, Saquib Z (2018) Prediction of crop production in india using data mining techniques. In: 2018 fourth international conference on computing communication control and automation (ICCUBEA), Pune, India, 2018, pp 1–5, https:// doi.org/10.1109/ICCUBEA.2018.8697446 Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.