Comparing Diabetes Prediction in the Pima Indians Dataset

Comparing Decision Trees, Artificial Neural
Networks, and Support Vector Machines for
Diabetes Prediction in the Pima Indians Dataset
Sherief Elsowiny
TSYS School of Computer Science
Columbus State University
Columbus, GA, USA
Abstract—Diabetes is a major global health issue, affecting
millions of people worldwide. Early prediction of diabetes can help
in effective management and treatment, leading to improved
patient outcomes. In this study, I compared three machine
learning models, Artificial Neural Networks (ANN), Decision
Trees (DT), and Support Vector Machines (SVM), for diabetes
prediction using the Pima Indians Diabetes dataset. The dataset
was preprocessed, including normalizing the data, and splitting
into training and testing sets. The models were trained on the
training set, and their performance was evaluated using accuracy,
precision, recall, F1-score, and confusion matrices. The SVM
model achieved the highest accuracy of 74.89%, followed by the
ANN model with 74.03% accuracy, and the DT model with 70.56%
Model optimization
was performed
hyperparameter tuning techniques like GridSearchCV and
RandomizedSearchCV, which improved model performance. The
study also analyzed feature importance and visualizations to better
understand the data and the performance of the models.
Hyperparameter Optimization, Model Evaluation, Feature
Importance, Feature Scaling, Model Visualization, Imbalanced
Diabetes is a major global health issue, affecting millions
of people worldwide. Early prediction of diabetes can help in
effective management and treatment, leading to imp roved
patient outcomes. Machine learning techniques have shown
great promise in the early detection of various diseases,
including diabetes [1]. In this study, I compared three machine
learning models, Artificial Neural Networks (ANN), Decision
Trees, and Support Vector Machines (SVM), for diabetes
prediction using the Pima Indians Diabetes dataset.
a study by Khanam and Foo [1] applied ANN and SVM
models to the Pima Indians Diabetes dataset and reported
promising results. Another study by Nam [3] investigated the
use of Decision Trees for diabetes prediction using the same
dataset. These studies have demonstrated the potential of
machine learning techniques in predicting diabetes; however,
there is still room for improvement and exploration of other
methods, such as feature selection and optimization.
In this study, I seek to compare the performance of ANN,
Decision Trees, and SVM models for diabetes prediction
using the Pima Indians Diabetes dataset. Moreover, I
investigate the use of hyperparameter tuning techniques, such
as GridSearchCV and RandomizedSearchCV, for model
optimization to enhance performance. The study also analyzes
feature importance and visualizations to better understand the
data and the performance of the models.
In the subsequent sections, I describe the proposed
method, present experimental data and results, and draw
conclusions from the findings.
A. Model Selection
The three models chosen for this study, Artificial Neural
Networks (ANN), Decision Trees (DT), and Support Vector
Machines (SVM), were selected based on their popularity and
proven effectiveness in solving classification problems,
particularly in the healthcare domain.
In recent years, machine learning has been increasingly
applied to various aspects of healthcare, including early
disease detection, diagnosis, personalized treatment planning,
and prognosis [1]. Among the numerous machine learning
techniques, ANN, Decision Trees, and SVM have gained
popularity due to their effectiveness in solving classification
problems [1]. ANN is an artificial intelligence technique that
mimics the human brain's neural networks, while Decision
Trees and SVM are statistical learning methods that aim to
find the best decision boundaries for classification [2].
1) Artificial Neural Networks (ANN): ANN is a
computational model inspired by the human brain's neural
networks. It consists of interconnected nodes or neurons that
process and transmit information. ANN has shown great
potential in pattern recognition and classification tasks due
to its ability to learn complex, nonlinear relationships
between features and outcomes. One of the strengths of ANN
is its ability to generalize from the training data, making it
robust to noise and variations in the input data. However, a
potential weakness of ANN is the risk of overfitting, especially
when the network architecture is too complex or when there
is insufficient training data.
Previous studies have explored the application of machine
learning techniques for diabetes prediction [1, 3]. For instance,
2) Decision Trees (DT): DT is a hierarchical, tree-like
structure that recursively splits the input space based on
feature values to make decisions or predictions. DT is an
interpretable model, allowing for easy visualization and
understanding of the decision-making process. This
transparency is particularly useful in medical applications
where interpretability is crucial for decision-making. One of
the strengths of DT is its simplicity and ease of
implementation. However, a major weakness is its tendency
to overfit, especially when the tree becomes too deep or
complex. Pruning techniques are often employed to address
this issue.
3) Support Vector Machines (SVM): SVM is a supervised
learning algorithm that aims to find the best decision
boundary or hyperplane that separates the data into different
classes. SVM is particularly effective when dealing with highdimensional data and has shown robust performance in
various classification tasks. The primary strength of SVM lies
in its ability to handle nonlinear relationships through the use
of kernel functions. However, one weakness of SVM is its
sensitivity to the choice of the kernel function and its
parameters, which may require extensive fine-tuning to
achieve optimal performance.
B. Data Preprocessing
Data preprocessing is a crucial step in any machine
learning project, as it ensures that the input data is clean, wellstructured, and suitable for model training. In this study, the
following preprocessing steps were employed:
Feature Scaling: Feature scaling was performed to
normalize the range of the input features, ensuring that each
feature contributes equally to the model's performance. This
is particularly important for algorithms like SVM, which are
sensitive to the scale of the input data.
Train-Test Splitting: The dataset was split into training and
testing sets to evaluate the performance of the models on
unseen data. This helps to assess the model's generalizability
and prevent overfitting.
C. Feature Visualization
To better understand the data and its distribution, we
visualize the features using distribution plots and a
correlation heatmap. The distribution plots show the
distribution of each feature in the dataset, revealing any
possible trends or patterns. The correlation heatmap
illustrates the relationship between each pair of features,
providing insights into potential multicollinearity issues.
In the experiment, I compared the performance of
three machine learning models, Artificial Neural Networks
(ANN), Decision Trees (DT), and Support Vector Machines
(SVM), for diabetes prediction using the Pima Indians
Diabetes dataset. The results of the experiment demonstrated
varying performance levels among the models. The SVM
model outperformed the other models with an accuracy of
74.89%, closely followed by the ANN model with an accuracy
of 74.03%, and finally the Decision Tree model with an
accuracy of 70.56%.
A detailed examination of the classification report
revealed that the SVM model not only achieved the highest
accuracy but also exhibited better F1-scores for both classes
compared to the other models. This indicates a balanced
performance between precision and recall for the SVM model.
On the other hand, the Decision Tree model had the highest
recall for class 1 (70%), signifying that it identified the highest
proportion of actual positive cases. However, its precision for
class 1 was the lowest (0.56), implying that it also
misclassified more negative cases as positive.
B. Decision Tree (DT)
The experiment's results highlight the potential of machine
learning techniques, particularly SVM and ANN, in predicting
diabetes cases. These models can contribute to early diagnosis
and treatment planning for patients, ultimately improving
patient outcomes. The Decision Tree model, although not as
accurate as the other models, still offers valuable insights into
feature importance and could be further optimized to enhance
its performance. Additionally, the optimized versions of the
models showed improvements in their respective
performances, demonstrating the benefits of optimization
techniques in machine learning applications.
1) Model Performance
The DT model showed a moderate performance in
diabetes prediction, achieving an accuracy of 70.56%. Its
precision, recall, and F1-score for class 0 were 0.82, 0.71, and
0.76, respectively, while for class 1, these values were 0.56,
0.70, and 0.62. Although the model can classify diabetes
cases, it may have a higher rate of false positives or false
negatives compared to the ANN model.
A. Artificial Neural Network (ANN)
The ANN model demonstrated a satisfactory performance
in predicting diabetes, with an accuracy of 74.03%. Its
precision, recall, and F1-score for class 0 were 0.79, 0.83, and
0.81, respectively, while for class 1, these values were 0.64,
0.57, and 0.61. These results indicate that the ANN model can
effectively classify diabetes cases while maintaining a balance
between false positives and false negatives.
2) Feature Importance
The feature importance analysis for the Decision Tree
model was conducted to better understand the contribution of
each feature in the classification process. This analysis helps
identify the most relevant features and potentially improve
model performance by focusing on these features during the
training process.
C. Support Vector Machines (SVM)
The SVM model exhibited an impressive
performance, with an accuracy of 74.89%. Its precision,
recall, and F1-score for class 0 were 0.80, 0.81, and 0.81,
respectively, while for class 1, these values were 0.64, 0.62,
and 0.63. This model can effectively classify diabetes cases
and maintains a balance between false positives and false
negatives, similar to the ANN model.
In addition to optimizing the hidden layer architecture, a
comprehensive search for the best combination of
RandomizedSearchCV method from the scikit-learn library.
The parameters considered in this search included the
activation function (relu or tanh), the learning rate (0.001 or
0.01), and the regularization parameter alpha (0.0001, 0.001,
or 0.01). The best combination of hyperparameters was found
to be a tanh activation function, a learning rate of 0.001, and
an alpha value of 0.001. This optimized ANN model achieved
an accuracy of 74.89%.
A. ANN Optimization
The ANN model optimization process involved three main
stages, which are detailed below:
1. Hidden Layer Optimization
To optimize the number of hidden layers, I experimented with
various architectures to find the one that delivered the best
performance. A graph was plotted to visualize the
relationship between the number of hidden units and model
accuracy. The architectures tested included single-layer
networks with 64 and 128 hidden units, as well as a two-layer
network with 64 hidden units in each layer. The results
indicated that the optimal architecture consisted of a single
hidden layer with 128 units, which provided the highest
accuracy among the tested architectures.
3. Training and Validation Loss Curves
The training process of the ANN model was visualized by
plotting both the training and validation loss curves. These
curves provided insights into the model's learning process
and helped identify potential issues such as overfitting or
underfitting. The training and validation loss curves were
close together and steadily decreasing, indicating that the
model was learning effectively without overfitting.
C. SVM Optimization
B. Decision Tree Optimization
To improve the performance of the decision tree model, I
focused on optimizing its hyperparameters by tuning the
maximum depth and minimum samples per leaf. Through
experimentation, I tested different combinations of
max_depth and min_samples_leaf values a nd assessed their
impact on the model's accuracy. The best combination I
found was max_depth = 5 and min_samples_leaf = 2,
resulting in an accuracy of 75.76%.
I also attempted to balance the dataset by having an equal
number of samples for each class (dia betic and non-diabetic).
However, the accuracy of the model trained on the balanced
dataset was 68.32%, which was lower than the accuracy
achieved using the original dataset.
By visualizing the optimized decision tree with the best
parameters (max_depth = 5 and min_samples_leaf = 2), I
observed a cleaner and more interpretable decision -making
structure. This pruned decision tree performs better than the
initial one, providing a more robust model with better
For the support vector machine (SVM) model, I used a
heatmap visualization to identify the best hyperparameters by
analyzing the relationship between the C and gamma
parameters. The optimal parameters were found to be C = 1
and gamma = 0.01. The optimized SVM model achieved an
accuracy of 75.32%. The heatmap provided valuable insights
into how the model's performance varied with different
combinations of C and gamma values, helping to identify the
region where the model performed best. The heatmap
visualization can be found in the "SVM Visualization"
By optimizing the hyperparameters and analyzing
the respective visualizations, I was able to improve the
performance of the decision tree (DT), artificial neural
network (ANN), and support vector machine (SVM) models,
achieving accuracies of 75.76%, 74.89%, and 75.32%,
respectively. The visualizations played a crucial role in
understanding the impact of different hyperparameters on the
models' performances and provided insights into the optimal
The decision tree model experienced the most
significant improvement in performance after optimization.
Initially, the accuracy was lower, but by tuning the
max_depth and min_samples_leaf parameters, the model's
accuracy increased to 75.76%. This improvement indicates
the effectiveness of the optimization process.
In comparison, the artificial neural network and support
vector machine models showed negligible improvements in
accuracy after optimization. The ANN model's accuracy
increased slightly from 74.36% to 74.89% after tuning the
learning rate and hidden layer sizes.
Similarly, the SVM model's accuracy increased from 74.46%
to 75.32% after optimizing the C and gamma parameters.
Although these improvements are minor, they still
demonstrate the importance of hyperparameter tuning in
achieving better model performance.
A bar graph comparing the accuracies of the six models (three
optimized and three not optimized) further illustrates the
improvements achieved through the optimization process.
The optimized models consistently outperform their nonoptimized counterparts, showcasing the value of
hyperparameter tuning and model optimization in
maximizing the predictive power of machine learning
One limitation of the current study is the use of a
single dataset for model training and evaluation. This could
limit the generalizability of our findings to other populations
with diabetes. Moreover, challenges in optimizing
hyperparameters for the ANN model, such as the number of
hidden units and learning rate, were encountered during the
research. The impact of class imbalance on model
performance should also be considered, as it could potentially
lead to biased predictions.
Future research could explore other machine learning models
for diabetes prediction or focus on incorporating additional
features and datasets to improve the predictive performance.
Integrating clinical knowledge in the modeling process could
also help enhance the interpretability and applicability of the
models in real-world scenarios.
In this study, I compared the performa nce of three
machine learning models, Artificial Neural Networks (ANN),
Decision Trees (DT), and Support Vector Machines (SVM),
for diabetes prediction using the Pima Indians Diabetes
dataset. I also analyzed the feature importance for the DT
model and visualized the relationship between features using
distribution plots and a correlation heatmap. The results of
my comparison indicate that the ANN and SVM models
perform similarly well in predicting diabetes cases, while the
DT model exhibits slightly lower performance.
These findings suggest that machine learning
techniques, particularly ANN and SVM, can be effectively
applied to diabetes prediction and could contribute to early
diagnosis and treatment planning for patients. As part of
future work, I could explore the use of ensemble methods to
combine the strengths of different models, potentially leading
to improved prediction performance. Additionally,
incorporating feature selection techniques may help in
identifying the most relevant features and reducing the
complexity of the models. Expanding the scope of this study
to include additional features or datasets could also improve
the accuracy and generalizability of the models.
I would like to express my deepest appreciation to my
professor, Dr. Rania Hodhod, for her invaluable guidance,
support, and encouragement throughout the course of this
research. Her expertise and insights have been essential in the
successful completion of this project.
I would also like to extend my gratitude to my brother, Aiman
Elsowiny for his assistance and constructive feedback during
the development of this study. His contributions have been
greatly appreciated.
