Project Description: Diabetes Prediction Using Machine Learning
1. Introduction
This project focuses on predicting diabetes using machine learning techniques, specifically logistic
regression and random forest classifiers. The dataset used is the Pima Indians Diabetes dataset, which
includes various health metrics such as plasma glucose concentration, blood pressure, BMI, age, and
diabetes pedigree function. The goal is to analyze the data, identify key predictors of diabetes, and build
a predictive model to classify individuals as diabetic or non-diabetic.
2. Dataset Overview
The dataset contains 768 entries with 9 features, including:
- Numerical Features:
- time_pregnant_no: Number of times pregnant.
- plasma_concentration: Plasma glucose concentration.
- diastolic_blood_pressure: Diastolic blood pressure (mm Hg).
- triceps_skinfold_thickness: Triceps skinfold thickness (mm).
- serum_insulin: 2-hour serum insulin (mu U/ml).
- bmi: Body mass index (weight in kg/(height in m)^2).
- diabetes_pedigree: Diabetes pedigree function (genetic influence).
- age: Age in years.
- Target Variable:
- class: Binary outcome (0 = No Diabetes, 1 = Diabetes).
Class Distribution
- Non-diabetic (Class 0): 500 cases.
- Diabetic (Class 1): 268 cases.
3. Data Preprocessing
Handling Missing Values
- Zero values in key features (plasma_concentration, diastolic_blood_pressure,
triceps_skinfold_thickness, serum_insulin, bmi) were treated as missing and replaced with the median of
each column.
Train-Test Split
- The dataset was split into 70% training and 30% testing sets, stratified to maintain class distribution.
Feature Scaling
- Standard scaling (StandardScaler) was applied to normalize the features for logistic regression.
4. Model Training & Evaluation
Logistic Regression
- Model Configuration:
- max_iter=1000, random_state=42, class_weight='balanced' (to handle class imbalance).
- Performance Metrics:
- Accuracy: 73.4%.
- Precision (Class 1): 60%.
- Recall (Class 1): 70%.
- F1-Score (Class 1): 65%.
- ROC AUC Score: 0.813 (strong discriminative ability).
ROC Curve
- The model shows good separation between classes, with an AUC of 0.813.
Random Forest Classifier
- Feature Importance Analysis:
- plasma_concentration was the most influential feature.
- bmi and diabetes_pedigree also contributed significantly.
5. Key Findings
Finding 1: Feature Importance
- Logistic Regression Coefficients:
- Plasma glucose concentration had the highest positive coefficient.
- BMI and age were also significant predictors.
- Random Forest Importance:
- Plasma glucose remained the top feature, followed by BMI and diabetes pedigree.
Finding 2: BMI and Diabetes Risk
- Boxplot Analysis:
- Individuals with diabetes had higher median BMI compared to non-diabetics.
- Confirms the physiological link between obesity and diabetes.
Finding 3: Distribution Analysis
- Plasma Glucose:
- Diabetics had higher glucose levels (right-shifted distribution).
- Age:
- Diabetes was more prevalent in older individuals (20–30-year-olds mostly non-diabetic).
- Diabetes Pedigree:
- Higher pedigree values slightly increased diabetes risk.
Finding 4: Pairwise Relationships (Pairplot)
- Plasma Glucose vs. BMI:
- Positive correlation; higher glucose and BMI increased diabetes likelihood.
- Age vs. Diabetes:
- Older individuals had a wider BMI range and higher diabetes prevalence.
Finding 5: Blood Pressure Analysis
- Diabetic individuals had slightly higher median diastolic BP (~75 mmHg vs. ~70 mmHg).
- Supports the link between diabetes and hypertension (metabolic syndrome).
Finding 6: Correlation Heatmap
- Weak to Moderate Correlations:
- plasma_concentration and age (r = 0.26).
- bmi and diastolic_blood_pressure (r = 0.29).
- diabetes_pedigree was largely independent of other features.
6. Predictions on New Data
A sample prediction was made for a hypothetical individual:
- Input Features: [6, 148, 72, 35, 0, 33.6, 0.627, 50].
- Prediction: Diabetic (Class 1).
- Probabilities:
- No Diabetes: 16.4%.
- Diabetes: 83.6%.
7. Conclusion & Insights
- Top Predictors: Plasma glucose, BMI, and age were the strongest predictors.
- Model Performance: Logistic regression achieved 73.4% accuracy, with good AUC (0.813).
- Clinical Relevance:
- High glucose and BMI are critical indicators.
- Blood pressure and genetic factors (pedigree) provide additional risk stratification.
- Future Work:
- Experiment with other models (e.g., SVM, XGBoost).
- Include more features (e.g., lifestyle factors).
This project successfully demonstrates the application of machine learning in medical diagnostics,
providing actionable insights into diabetes risk factors.