Uploaded by Monika Mahawar

Diabetes Prediction with Logistic Regression

advertisement
Name- Monika Mahawar
Roll number - b23ci1024
("This project aims to build a logistic regression model to predict the onset of
diabetes based on the Pima Indians Diabetes Database.")
("The dataset contains various diagnostic measures for individuals, and the goal is to
develop a model that can accurately classify whether a person has diabetes or not.")
1 st step.Import necessary libraries:
- numpy for numerical operations
- pandas for data manipulation
- matplotlib for plotting
- sklearn for preprocessing and model building
- seaborn for advanced visualizations
Step 2 2. Load the dataset:
- Read the CSV file into a pandas Data Frame.
Step 3. Check for missing values:
- Identify any missing values in the dataset.
Normalize numerical data:
- Select numerical columns.
- Use MinMaxScaler to scale the numerical data between 0 and 1.
5. Check for categorical data:
- Identify if there are any categorical columns in the dataset.
. Display basic information about the dataset:
- Use dataset.info to get an overview of the data types and missing values.
- Use dataset.describe to get summary statistics of the numerical features.
- Display the first few rows (head) and last few rows (tail) of the dataset.
- Print the shape of the dataset (number of rows and columns).
Performed EDA
- Create histograms to visualize the distribution of each feature.
- Create a pairplot to visualize relationships between pairs of features.
- Calculate the correlation matrix to understand the relationships between features.
- Create a heatmap to visualize the correlation matrix.
- Create box plots to visualize the distribution of each feature and identify outliers.
Observations from the plots:
# Histograms:
- Features like 'Pregnancies' and 'Age' show skewed distributions.
- Features like 'Glucose', 'BloodPressure', and 'Insulin' have potential outliers.
# Pairplot:
- There seems to be a positive correlation between 'Glucose' and 'Outcome'. Higher
glucose levels appear to be associated with a higher probability of diabetes.
- 'BMI' and 'Age' also show some correlation with 'Outcome'.
Correlation Heatmap:
- 'Glucose' has a strong positive correlation with 'Outcome'.
- 'BMI' and 'Age' also have moderate positive correlations with 'Outcome'.
- Some features might have moderate correlations with each other (e.g., 'BMI' and
'Insulin').
# Box Plots:
- Several features have potential outliers, which might need further investigation.
- The box plots help visualize the spread and central tendency of each feature.
# Overall Observations:
- Glucose levels seem to be a strong predictor of diabetes.
- BMI and age also play a role in the development of diabetes.
- There might be some relationships between features, which could be explored
further.
- Outliers might need to be handled carefully during model building.
The code implements a logistic regression model from scratch to predict diabetes.
# observation from scatter plots
The scatter plots show the relationship between two features (Glucose and
BloodPressure) before and after applying different techniques.
# Before Sigmoid:
- The initial scatter plot shows the distribution of data points with respect to the
'Outcome' variable.
- It helps visualize the overall relationship between the features and the target
variable.
# After Sigmoid:
- This plot shows the data after applying the sigmoid function to the linear model.
- The sigmoid function transforms the output into a probability between 0 and 1.
- It helps visualize the predicted probabilities for each data point.
# After Binary Cross-Entropy Loss:
- This plot displays the data after calculating the binary cross-entropy loss.
- Binary cross-entropy measures the difference between the predicted probabilities
and the actual labels.
- It helps visualize how well the model is predicting the probabilities.
After L1 Loss:
- This plot shows the data after applying L1 regularization.
- L1 regularization adds a penalty to the sum of absolute values of the weights.
- It helps prevent overfitting and improves the model's generalization ability.
# After L2 Loss:
- This plot shows the data after applying L2 regularization.
- L2 regularization adds a penalty to the sum of squared values of the weights.
- It also helps prevent overfitting and improves the model's generalization ability.
# After Gradient Descent:
- This plot shows the data after applying gradient descent to optimize the model's
parameters.
- Gradient descent iteratively updates the weights and bias to minimize the loss
function.
- It helps the model learn the best fit for the data.
Observations and Reasoning behind the roc curve
- The model achieved a reasonable accuracy, indicating its ability to classify diabetes
cases.
- The ROC curve shows the model's performance across different thresholds.
- The AUC ROC score provides a comprehensive measure of the model's ability to
distinguish between positive and negative classes.
- Further analysis of the confusion matrix helps understand the types of errors the
model makes (false positives and false negatives).
- Fine-tuning hyperparameters and exploring different regularization techniques can
potentially improve the model's performance.
obseravtion from lose curve
# The loss curve shows that the binary cross-entropy loss decreases rapidly in the
initial iterations and then gradually converges.
# L1 loss and L2 loss also show a similar trend, but they converge at different rates.
# This indicates that the model is learning and improving its predictions over time.
# The specific rate of convergence depends on the learning rate and the complexity
of the dataset.
observation for the roc curve between best features and complete data
. The ROC curve with best features and complete data show similar performance.
. Both curves have a good AUC score, indicating good model performance.
.his suggests that the selected features ('Glucose' and 'BMI') are strong predictors of
diabetes.
.The slight difference in AUC might be due to the reduced dimensionality of the best
features model.
# Observation:
# The decision boundary appears to effectively separate the data points into two
classes (likely representing individuals with and without diabetes).
# The plot shows how the model makes predictions based on the combination of
'Glucose' and 'BMI' values.
# The model seems to be more sensitive to 'Glucose' levels, as the decision
boundary is more pronounced along the 'Glucose' axis.
# This suggests that 'Glucose' plays a more significant role in predicting the outcome
compared to 'BMI' in this specific model.
Download