Group-9 Project Proposal I. Dataset: We are using the New Plant Diseases Dataset: https://www.kaggle.com/datasets/vipoooool/new-plant-diseases-dataset This dataset consists of about 87K RGB images of healthy and diseased crop leaves and has 38 different classes. The total dataset is divided into 80/20 ratio of training and validation set preserving the directory structure. Goal: Image dataset contains different healthy and unhealthy crop leaves. There are 38 classes out of which we will select 10 classes to examine our proposed algorithm. II. Related Works: We have referred to five academic papers related to Plant/Leaf disease classification using K Nearest Neighbor. Below are our findings from the papers: Paper 1: Coffee Plant Disease Classification Using KNearest Neighbor Author: Muhammad Alif Naufal Yasin, Wikky Fawwaz Al Maki Link: https://ieeexplore.ieee.org/document/9914843 Researchers in this paper 1 aim at the dataset, technique/algorithm as detailed below: Dataset: This dataset has images of arabica coffee leaves symptoms divided into four classes, namely healthy, rust, phoma, and Cercospora, with1669 images in total. The dataset in this paper contains 2209 images of arabica coffee leaves symptoms divided into five classes but only four classes have been used. Summary: Firstly, image resizing is performed first before the classification process. Then, by using Color Moments and GLCM, feature extraction to the images is performed. Then the classification procedure is carried out using KNN. KNN is used as a classification algorithm due to its simplicity of implementation and high performance. Techniques used: Preprocessing data: Image resizing is done to 100 x 100 pixel to maintain consistency in size of all photos. Feature extraction: After resizing image feature extraction is done. Color Moment and GLCM are used for feature extraction where in color moment the image color space will be changed to RGB and YCrCb. The feature extraction results are used as a model. The model is then used to classify the data test using KNN, resulting in an accuracy value. For GLCM, the resized image's color space is changed to grayscale and then GLCM is used for feature extraction. The classification is then performed using KNN from various angles. KNN Classification: a) KNN with color moment: By using Color Moment with two color spaces and k value from 1 to 15. It is found that the classification method with YCrCb color space gets the highest accuracy, which is 82.3% in k=9. b) KNN with GLCM: The classification is then performed using KNN from various angles. An experiment is conducted using a single degree and a combination of two degrees by which we get 61%accuracy. c) KNN with Color Moment (RCB)+GLCM: To improve accuracy even further, two feature extraction methods: Color Moment and GLCM are combined. Color Moment combined with the RGB color space and GLCM gives 74.2 % accuracy and Color Moment combined with the YCRCB color space and GLCM gives 81.3% accuracy. After experimenting with different values of k with combination of color moment of YCrCb color space and GLCM highest accuracy of 83.5% is obtained. Paper-2: Disease Detection in Plants Using KNN Algorithm Author: Surbhi garg, Divya Dixit, Sudeept Singh Yadav Link: https://ieeexplore.ieee.org/document/10074491 Researchers in this paper aim at the dataset, technique/algorithm as detailed below: Dataset: Dataset used for is the Plant Village public set of data for plant disease detection. The dataset contains 87000 RGB image snips of healthy and unhealthy plant leaves which contains 38 classes out of which 25 classes are considered for examination. Techniques used: Pre-Processing: To decide which disease to detect, the image has been used as input. For feature extraction, the image is transformed to grayscale. Segmentation: To divide the image as input into sections, the k-mean clustering method is used. Feature Extraction: The GLCM algorithm has been used in the second phase to extract the feature image and store them with in database. is a tabular description that indicates the number of times different permutations of pixel intensity values happen within an image. This algorithm determines the total number of pixels within the image matrix. The computed pixels in the image matrix are saved. Compare the likeness of pixels in the matrix using the histogram procedure. The dissimilarity factor from the matrix is determined. And the elements are normalized by dividing the pixels. Classification: Once the features are extracted, knn algorithm is used for dividing the classes and finding the disease according to the image. Accuracy of 93% is achieved in finding the disease of a plant. Paper-3: Plant Leaf Disease Recognition Using Random Forest, KNN, SVM and CNN Authors: Bijaya Kumar Hatuwal, Aman Shakya, and Basanta Joshi Link: https://www.researchgate.net/profile/BijayaHatuwal/publication/351708837_Plant_Leaf_Disease_Recognition_Using_Random_Fore st_KNN_SVM_and_CNN/links/60a5d05092851c43da02c7d5/Plant-Leaf-DiseaseRecognition-Using-Random-Forest-KNN-SVM-and-CNN.pdf Researchers in this paper 3 aim at the dataset, technique/algorithm as detailed below: Dataset: The study used images in jpg format which have various plant species and diseases. These images were sourced from the Kaggle Plant Village dataset. The dataset is divided into ‘train’ folder which has images for training ML models, ‘Valid’ folder has images for model validation. The split ratio is 80% data for training and 20% for testing. The dataset covered various categories as shown below. Techniques Used: In this study, ten properties are extracted from color and textures as features from the images. The mean and standard deviation of each color channel (red, green, and blue) are calculated for each image and image pre-processing is performed to reduce noise levels in the images, they are first converted into grayscale, and then blurring is applied. The study employs the Haralick texture features algorithm to extract texture-based features from the grayscale images. These features include contrast, correlation, entropy, and inverse difference moments. The study uses these color and texture features as input for machine learning models like K-nearest Neighbors (KNN), Support Vector Machine (SVM), and Random Forest Classifier (RFC) to predict and classify plant diseases from images. In this process, the file path of image is provided as input and feature extraction is done for predictions. KNN algorithm is used for both classification and regression. In KNN model, a value of k =5 is selected, which gave an accuracy of 76.96%, though highest accuracy is at k =1, it is not considered to prevent over-reliance on a single nearest neighbors vote for prediction, and elbow criterion plot is also made to determine the mean error. The weighted average value for precision, recall, f1score, and support are 0.78, 0.77, 0.77 and 5914 respectively for testing images for KNN. The random forest model with accuracy of 87.436% is created with 250 numbers of estimators. The Convolutional Neural Network model has training accuracy of 97.89% and SVM produced accuracy of 78.61%. Among all the given models CNN produced the highest level of accuracy. Conclusion: Future work in this research domain involves expanding dataset to include a wide variety of plant species with various textures and diseases. The optimization of hyperparameters in the various machine learning models can be enhanced through techniques like grid search or other algorithms, allowing for more efficient model tuning and potentially leading to better predictive performance. Paper-4: Review on Emerging Trends in Detection of Plant Diseases using Image Processing with Machine Learning Authors: Punitha Kartikeyan, Gyanesh Shrivastava Link:https://www.researchgate.net/publication/348541626_Review_on_Emerging_Trend s_in_Detection_of_Plant_Diseases_using_Image_Processing_with_Machine_Learning Researchers in this paper 4 aim at the dataset, technique/algorithm as detailed below: Dataset: This paper has discussed an overview of various papers and classification techniques. One of the papers it referred is "A Color and Texture Based Approach for the Detection and Classification of Plant Leaf Disease Using KNN Classifier," which used 237 leaf images sourced from the Arkansas plant disease database. Another paper is "Recognition of diseases in paddy leaves using knn classifier," which used 300 images of diseased paddy plants. Techniques used: The detection of plant disease involves five major steps like Image Acquisition, Image pre-pre- processing, Image Segmentation, feature extraction and classification. Different techniques used for the classification of plant disease using various classifiers such as Support Vector Machine, Artificial Neural Network, K-Nearest Neighbors, and other classifier methods have been discussed. K-Nearest Neighbors Researchers have used KNN for spotting plant diseases. For instance, Xu et al. used it to find issues like not having enough Nitrogen or Potassium in Tomato plants. They used various techniques, including Fourier transform for texture and LAB color space for colors. They picked the best features using a genetic algorithm and applied a fuzzy version of KNN. This system had an accuracy of over 82.5% in diagnosing these plant problems. a) In one of the studies by Suresha et al., they focused on Paddy plant diseases, such as Blast and Brown Spot, using about 300 images of affected paddy plants. They employed image segmentation techniques to separate diseased areas from healthy ones and extracted features related to the shape of the affected regions. By applying KNN, they achieved a disease classification accuracy of 76.59%. b) The other study by Hossain et al. targeted various plant diseases, including Alternaria Alternata, Anthracnose, Bacterial Blight, Leaf Spot, and Canker. They used 237 leaf images from the Arkansas plant disease database, applying segmentation to isolate the diseased regions. Feature extraction was done using the Gray-Level Co-occurrence Matrix (GLCM), and they utilized 5-fold crossvalidation to avoid overfitting. This approach resulted in a high disease detection accuracy of 96.76%. c) Arivazhagan et al. adopted an image analysis technology for disease identification in plant leaves, achieving a remarkable accuracy rate of 94.74%. Al-Hiary et al. employed Otsu segmentation and K-means clustering to identify plant and stem diseases, utilizing color co-occurrence for feature extraction. Their approach resulted in a robust technique with precision values ranging from 83% to 94%. Conclusion and Future Work: Among various classification techniques, SVM and ANN methods have been widely recognized for their high accuracy in plant disease detection. Future advancements may involve the development of hybrid algorithms integrating genetic algorithms, ant colony optimization, cuckoo optimization, and particle swarm optimization with SVM, ANN, and KNN, promising enhanced efficiency in disease detection. Mobile applications with built-in remedial solutions could empower farmers to easily detect various plant issues, from leaf and stem diseases to nutrient deficiencies. Paper-5: Rice Leaf Disease Image Classifications Using KNN Based on GLCM Feature Extraction Author: R A Saputra, Suharyanto, S Wasiyanti, D F Saefudin, A Supriyatna, A Wibowo Link: https://iopscience.iop.org/article/10.1088/1742-6596/1641/1/012080/pdf Researchers in this paper 5 aim at the dataset, technique/algorithm as detailed below: Dataset: This dataset has120 images of rice leaf disease from the UCI repository. This paper's dataset determines how to classify images of rice leaf disease consisting of three diseases, namely Bacterial leaf blight, Brown spot, and Leaf smut. Summary: Initially, feature extraction for text analysis is done GLCM method with five feature values consisting of contrast, energy, entropy, homogeneity, and correlation. Then the classification procedure is carried out using KNN. KNN is used as a classification algorithm due to its simplicity of implementation and high performance. Techniques used: Feature extraction: GLCM is used for feature extraction for texture analysis. where the matrix will calculate the probability value of the results of the relationship between two pixels with a certain intensity in the distance and orientation of a certain angle in the image. The two-pixel coordinates have d distanced and θ angle orientation. The feature extractions such as Contrast, Energy, Entropy, Homogeneity, Correlation are calculated. KNN Algorithm: After feature extraction, the data set will be divided into 10 parts using ten-fold cross- validation, the data in the first part becomes the testing data and training data. Measurement of accuracy of algorithms: In this test, the confusion matrix is used as a measure of the performance of the KNN algorithm. Using the different values from confusion matrix, the accuracy and kappa values of an algorithm model can be calculated. Conclusion: GLMC for feature extraction and KNN algorithm is used for the classification of rice leaf disease, by finding the maximum k value from the experiment k value 1 to 20. The results of our experiments show that the value of k = 11 has the highest accuracy value compared to other k values of 65.83% and kappa 0.485. III. From your readings, summarize the techniques that can be applied to your dataset highlighting the pros and cons for each. We would be using KNN algorithm and GLCM algorithm. GLCM is used for feature extraction of the leaves and KNN algorithm will be used for classification of the leaves into various classes. Advantages for considering KNN algorithm: KNN performs well in multi-class classification problems, making it a suitable choice for image datasets with multiple categories or classes. It is adaptive, which means it can adapt to different datasets with complex and nonlinear patterns. KNN provides transparency in its decision-making process. You can visualize and interpret the classification decisions by examining the k-nearest neighbors of a test sample. Advantages for using GLCM for feature extraction: GLCM is highly effective in capturing texture information within an image. It can characterize the relationships between pixel values, allowing it to distinguish between various textures, patterns, and structures. IV. It focuses on local image regions and is extremely sensitive to small-scale variations in texture, which makes it ideal for applications where detailed local texture analysis is needed. It supports various programming languages and libraries, making it flexible to access. Methodology Implementation will be done in Jupyter notebook (python) using libraries like pandas, numpy, sklearn, matplotlib, etc., Below methodology needs to be followed stepwise: Pre-Processing: Initially, the image needs to be resized to maintain the consistency in size for all photos. Feature Extraction: Later, we will move to Feature extraction stage where it identifies the most discriminating characteristics, which a machine learning algorithm can more easily consume. We use GLCM technique to achieve it. Classification: We use KNN algorithm for classification which is used for dividing the leaf classes into different classes and help in finding the disease according to the features extracted.