Comparing Diabetes Prediction in the Pima Indians Dataset

Comparing Decision Trees, Artificial Neural Networks, and Support Vector Machines for Diabetes Prediction in the Pima Indians Dataset Sherief Elsowiny TSYS School of Computer Science Columbus State University Columbus, GA, USA elsowiny_sherief@columbusstate.edu Abstract—Diabetes is a major global health issue, affecting millions of people worldwide. Early prediction of diabetes can help in effective management and treatment, leading to improved patient outcomes. In this study, I compared three machine learning models, Artificial Neural Networks (ANN), Decision Trees (DT), and Support Vector Machines (SVM), for diabetes prediction using the Pima Indians Diabetes dataset. The dataset was preprocessed, including normalizing the data, and splitting into training and testing sets. The models were trained on the training set, and their performance was evaluated using accuracy, precision, recall, F1-score, and confusion matrices. The SVM model achieved the highest accuracy of 74.89%, followed by the ANN model with 74.03% accuracy, and the DT model with 70.56% accuracy. Model optimization was performed using hyperparameter tuning techniques like GridSearchCV and RandomizedSearchCV, which improved model performance. The study also analyzed feature importance and visualizations to better understand the data and the performance of the models. Keywords—Pima Indians Diabetes Dataset, Hyperparameter Optimization, Model Evaluation, Feature Importance, Feature Scaling, Model Visualization, Imbalanced Data I. I NTRODUCTION Diabetes is a major global health issue, affecting millions of people worldwide. Early prediction of diabetes can help in effective management and treatment, leading to imp roved patient outcomes. Machine learning techniques have shown great promise in the early detection of various diseases, including diabetes [1]. In this study, I compared three machine learning models, Artificial Neural Networks (ANN), Decision Trees, and Support Vector Machines (SVM), for diabetes prediction using the Pima Indians Diabetes dataset. a study by Khanam and Foo [1] applied ANN and SVM models to the Pima Indians Diabetes dataset and reported promising results. Another study by Nam [3] investigated the use of Decision Trees for diabetes prediction using the same dataset. These studies have demonstrated the potential of machine learning techniques in predicting diabetes; however, there is still room for improvement and exploration of other methods, such as feature selection and optimization. In this study, I seek to compare the performance of ANN, Decision Trees, and SVM models for diabetes prediction using the Pima Indians Diabetes dataset. Moreover, I investigate the use of hyperparameter tuning techniques, such as GridSearchCV and RandomizedSearchCV, for model optimization to enhance performance. The study also analyzes feature importance and visualizations to better understand the data and the performance of the models. In the subsequent sections, I describe the proposed method, present experimental data and results, and draw conclusions from the findings. II. PROPOSED M ETHOD A. Model Selection The three models chosen for this study, Artificial Neural Networks (ANN), Decision Trees (DT), and Support Vector Machines (SVM), were selected based on their popularity and proven effectiveness in solving classification problems, particularly in the healthcare domain. In recent years, machine learning has been increasingly applied to various aspects of healthcare, including early disease detection, diagnosis, personalized treatment planning, and prognosis [1]. Among the numerous machine learning techniques, ANN, Decision Trees, and SVM have gained popularity due to their effectiveness in solving classification problems [1]. ANN is an artificial intelligence technique that mimics the human brain's neural networks, while Decision Trees and SVM are statistical learning methods that aim to find the best decision boundaries for classification [2]. 1) Artificial Neural Networks (ANN): ANN is a computational model inspired by the human brain's neural networks. It consists of interconnected nodes or neurons that process and transmit information. ANN has shown great potential in pattern recognition and classification tasks due to its ability to learn complex, nonlinear relationships between features and outcomes. One of the strengths of ANN is its ability to generalize from the training data, making it robust to noise and variations in the input data. However, a potential weakness of ANN is the risk of overfitting, especially when the network architecture is too complex or when there is insufficient training data. Previous studies have explored the application of machine learning techniques for diabetes prediction [1, 3]. For instance, 2) Decision Trees (DT): DT is a hierarchical, tree-like structure that recursively splits the input space based on feature values to make decisions or predictions. DT is an interpretable model, allowing for easy visualization and understanding of the decision-making process. This transparency is particularly useful in medical applications where interpretability is crucial for decision-making. One of the strengths of DT is its simplicity and ease of implementation. However, a major weakness is its tendency to overfit, especially when the tree becomes too deep or complex. Pruning techniques are often employed to address this issue. 3) Support Vector Machines (SVM): SVM is a supervised learning algorithm that aims to find the best decision boundary or hyperplane that separates the data into different classes. SVM is particularly effective when dealing with highdimensional data and has shown robust performance in various classification tasks. The primary strength of SVM lies in its ability to handle nonlinear relationships through the use of kernel functions. However, one weakness of SVM is its sensitivity to the choice of the kernel function and its parameters, which may require extensive fine-tuning to achieve optimal performance. B. Data Preprocessing Data preprocessing is a crucial step in any machine learning project, as it ensures that the input data is clean, wellstructured, and suitable for model training. In this study, the following preprocessing steps were employed: Feature Scaling: Feature scaling was performed to normalize the range of the input features, ensuring that each feature contributes equally to the model's performance. This is particularly important for algorithms like SVM, which are sensitive to the scale of the input data. Train-Test Splitting: The dataset was split into training and testing sets to evaluate the performance of the models on unseen data. This helps to assess the model's generalizability and prevent overfitting. C. Feature Visualization To better understand the data and its distribution, we visualize the features using distribution plots and a correlation heatmap. The distribution plots show the distribution of each feature in the dataset, revealing any possible trends or patterns. The correlation heatmap illustrates the relationship between each pair of features, providing insights into potential multicollinearity issues. III. EXPERIMENTAL DATA AND RESULTS In the experiment, I compared the performance of three machine learning models, Artificial Neural Networks (ANN), Decision Trees (DT), and Support Vector Machines (SVM), for diabetes prediction using the Pima Indians Diabetes dataset. The results of the experiment demonstrated varying performance levels among the models. The SVM model outperformed the other models with an accuracy of 74.89%, closely followed by the ANN model with an accuracy of 74.03%, and finally the Decision Tree model with an accuracy of 70.56%. A detailed examination of the classification report revealed that the SVM model not only achieved the highest accuracy but also exhibited better F1-scores for both classes compared to the other models. This indicates a balanced performance between precision and recall for the SVM model. On the other hand, the Decision Tree model had the highest recall for class 1 (70%), signifying that it identified the highest proportion of actual positive cases. However, its precision for class 1 was the lowest (0.56), implying that it also misclassified more negative cases as positive. B. Decision Tree (DT) The experiment's results highlight the potential of machine learning techniques, particularly SVM and ANN, in predicting diabetes cases. These models can contribute to early diagnosis and treatment planning for patients, ultimately improving patient outcomes. The Decision Tree model, although not as accurate as the other models, still offers valuable insights into feature importance and could be further optimized to enhance its performance. Additionally, the optimized versions of the models showed improvements in their respective performances, demonstrating the benefits of optimization techniques in machine learning applications. 1) Model Performance The DT model showed a moderate performance in diabetes prediction, achieving an accuracy of 70.56%. Its precision, recall, and F1-score for class 0 were 0.82, 0.71, and 0.76, respectively, while for class 1, these values were 0.56, 0.70, and 0.62. Although the model can classify diabetes cases, it may have a higher rate of false positives or false negatives compared to the ANN model. A. Artificial Neural Network (ANN) The ANN model demonstrated a satisfactory performance in predicting diabetes, with an accuracy of 74.03%. Its precision, recall, and F1-score for class 0 were 0.79, 0.83, and 0.81, respectively, while for class 1, these values were 0.64, 0.57, and 0.61. These results indicate that the ANN model can effectively classify diabetes cases while maintaining a balance between false positives and false negatives. 2) Feature Importance The feature importance analysis for the Decision Tree model was conducted to better understand the contribution of each feature in the classification process. This analysis helps identify the most relevant features and potentially improve model performance by focusing on these features during the training process. C. Support Vector Machines (SVM) 2. The SVM model exhibited an impressive performance, with an accuracy of 74.89%. Its precision, recall, and F1-score for class 0 were 0.80, 0.81, and 0.81, respectively, while for class 1, these values were 0.64, 0.62, and 0.63. This model can effectively classify diabetes cases and maintains a balance between false positives and false negatives, similar to the ANN model. Hyperparameter Optimization Using RandomizedSearchCV In addition to optimizing the hidden layer architecture, a comprehensive search for the best combination of hyperparameters was conducted using the RandomizedSearchCV method from the scikit-learn library. The parameters considered in this search included the activation function (relu or tanh), the learning rate (0.001 or 0.01), and the regularization parameter alpha (0.0001, 0.001, or 0.01). The best combination of hyperparameters was found to be a tanh activation function, a learning rate of 0.001, and an alpha value of 0.001. This optimized ANN model achieved an accuracy of 74.89%. IV. M ODEL OPTIMIZATION A. ANN Optimization The ANN model optimization process involved three main stages, which are detailed below: 1. Hidden Layer Optimization To optimize the number of hidden layers, I experimented with various architectures to find the one that delivered the best performance. A graph was plotted to visualize the relationship between the number of hidden units and model accuracy. The architectures tested included single-layer networks with 64 and 128 hidden units, as well as a two-layer network with 64 hidden units in each layer. The results indicated that the optimal architecture consisted of a single hidden layer with 128 units, which provided the highest accuracy among the tested architectures. 3. Training and Validation Loss Curves The training process of the ANN model was visualized by plotting both the training and validation loss curves. These curves provided insights into the model's learning process and helped identify potential issues such as overfitting or underfitting. The training and validation loss curves were close together and steadily decreasing, indicating that the model was learning effectively without overfitting. C. SVM Optimization B. Decision Tree Optimization To improve the performance of the decision tree model, I focused on optimizing its hyperparameters by tuning the maximum depth and minimum samples per leaf. Through experimentation, I tested different combinations of max_depth and min_samples_leaf values a nd assessed their impact on the model's accuracy. The best combination I found was max_depth = 5 and min_samples_leaf = 2, resulting in an accuracy of 75.76%. I also attempted to balance the dataset by having an equal number of samples for each class (dia betic and non-diabetic). However, the accuracy of the model trained on the balanced dataset was 68.32%, which was lower than the accuracy achieved using the original dataset. By visualizing the optimized decision tree with the best parameters (max_depth = 5 and min_samples_leaf = 2), I observed a cleaner and more interpretable decision -making structure. This pruned decision tree performs better than the initial one, providing a more robust model with better generalization capabilities. For the support vector machine (SVM) model, I used a heatmap visualization to identify the best hyperparameters by analyzing the relationship between the C and gamma parameters. The optimal parameters were found to be C = 1 and gamma = 0.01. The optimized SVM model achieved an accuracy of 75.32%. The heatmap provided valuable insights into how the model's performance varied with different combinations of C and gamma values, helping to identify the region where the model performed best. The heatmap visualization can be found in the "SVM Visualization" section. V. OPTIMIZATION ANALYSIS By optimizing the hyperparameters and analyzing the respective visualizations, I was able to improve the performance of the decision tree (DT), artificial neural network (ANN), and support vector machine (SVM) models, achieving accuracies of 75.76%, 74.89%, and 75.32%, respectively. The visualizations played a crucial role in understanding the impact of different hyperparameters on the models' performances and provided insights into the optimal configurations. The decision tree model experienced the most significant improvement in performance after optimization. Initially, the accuracy was lower, but by tuning the max_depth and min_samples_leaf parameters, the model's accuracy increased to 75.76%. This improvement indicates the effectiveness of the optimization process. In comparison, the artificial neural network and support vector machine models showed negligible improvements in accuracy after optimization. The ANN model's accuracy increased slightly from 74.36% to 74.89% after tuning the learning rate and hidden layer sizes. Similarly, the SVM model's accuracy increased from 74.46% to 75.32% after optimizing the C and gamma parameters. Although these improvements are minor, they still demonstrate the importance of hyperparameter tuning in achieving better model performance. A bar graph comparing the accuracies of the six models (three optimized and three not optimized) further illustrates the improvements achieved through the optimization process. The optimized models consistently outperform their nonoptimized counterparts, showcasing the value of hyperparameter tuning and model optimization in maximizing the predictive power of machine learning algorithms. VI. LIMITATIONS AND FUTURE WORK One limitation of the current study is the use of a single dataset for model training and evaluation. This could limit the generalizability of our findings to other populations with diabetes. Moreover, challenges in optimizing hyperparameters for the ANN model, such as the number of hidden units and learning rate, were encountered during the research. The impact of class imbalance on model performance should also be considered, as it could potentially lead to biased predictions. Future research could explore other machine learning models for diabetes prediction or focus on incorporating additional features and datasets to improve the predictive performance. Integrating clinical knowledge in the modeling process could also help enhance the interpretability and applicability of the models in real-world scenarios. VII. CONCLUSION In this study, I compared the performa nce of three machine learning models, Artificial Neural Networks (ANN), Decision Trees (DT), and Support Vector Machines (SVM), for diabetes prediction using the Pima Indians Diabetes dataset. I also analyzed the feature importance for the DT model and visualized the relationship between features using distribution plots and a correlation heatmap. The results of my comparison indicate that the ANN and SVM models perform similarly well in predicting diabetes cases, while the DT model exhibits slightly lower performance. These findings suggest that machine learning techniques, particularly ANN and SVM, can be effectively applied to diabetes prediction and could contribute to early diagnosis and treatment planning for patients. As part of future work, I could explore the use of ensemble methods to combine the strengths of different models, potentially leading to improved prediction performance. Additionally, incorporating feature selection techniques may help in identifying the most relevant features and reducing the complexity of the models. Expanding the scope of this study to include additional features or datasets could also improve the accuracy and generalizability of the models. VIII.ACKNOWLEDGEMENTS I would like to express my deepest appreciation to my professor, Dr. Rania Hodhod, for her invaluable guidance, support, and encouragement throughout the course of this research. Her expertise and insights have been essential in the successful completion of this project. I would also like to extend my gratitude to my brother, Aiman Elsowiny for his assistance and constructive feedback during the development of this study. His contributions have been greatly appreciated. IX. REFERENCES [1] [2] [3] [4] J. J. Khanam and S. Y. Foo, "A comparison of machine learning algorithms for diabetes prediction," ICT Express, vol. 7, no. 4, pp. 432439, 2021.J. Clerk Maxwell, A Treatise on Electricity and Magnetism, 3rd ed., vol. 2. Oxford: Clarendon, 1892, pp.68 –73. S. Mohapatra, J. Swain, and M. Mohanty, "Detection of Diabetes Using Multilayer Perceptron: Proceedings of ICICA 2018," in Lecture Notes in Electrical Engineering, vol. 499, pp. 117-126, 2019. doi: 10.1007/978-981-13-2182-5_11. H. Nam, "Predicting Diabetes Using Tree-based Methods," Dissertation, 2019. S. Elsowiny, "Diabetes Prediction using Machine Learning," GitHub repository, Available: https://github.com/elsowiny/diabetesprediction.

Comparing Diabetes Prediction in the Pima Indians Dataset

Related documents

Products

Support

Comparing Diabetes Prediction in the Pima Indians Dataset

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib