Uploaded by feruzifeye0

Untitled document

advertisement
Introduction
The main focus of this study is to address two predictive modeling tasks. Predicting Walmart
sales and analyzing the pima indians diabetes dataset.
Q1. Predictive Modeling For Walmart Sales.
Introduction
The input parameters for predicting Walmart sales include historical sales data, external factors
such as holidays and economic indicators. The output parameter is the prediction of weekly
sales for each store. We first of all explored the structure and characteristics of the dataset and
we found the following. First we are going to explore the general dataset characteristics:
Dataset Description
The output above provides summary statistics for each numerical column in the dataset.
● Count - indicates numbers of values in each column that are not null
● Mean - Represents the average value of each column
● Standard Deviation(STD) - measure the values of the spread that surrounds the
average.
● Min - denotes the minimum value in each column
● 25%(25th percentile) - The first quartile is another name that refers to it. The value
divides 25% of the lower data. .
● 50% (5oth percentile) - Stands for the middle value or the value in the middle of the
dataset.
● 75% (75th percentile) - Also referred to as the third quartile. This value divides the lower
75% of the data.
● Max - denotes the value that is maximum in each column
The output above displays the data types in the dataset of each column. Here is a brief
explanation of each data type.
● Int64 - Represents integer values.
● Object - The data type represents string values or mixed data types
● Float64 - The Data Type represents floating point values
The output above indicates the number of values that are missing in each column of the
dataset. Here is an explanation.
● Missing Values - This section shows the count of the values that are missing for every
column. Each name of the column is listed along with the values that correspond to the
values count that are missing. As observed all columns have no missing values.
The output below is a histogram to show the spread of features that are numerical in the
dataset. Overall, these histograms aid in understanding the central tendency and variability of
each feature by illuminating the distribution and spread of numerical features throughout the
dataset. They can be useful in identifying outliers, understanding data ranges and making
informed decisions during data analysis. Here is the output:
The output below visualizes the results using a heatmap. All things considered, the heatmap
offers a graphic overview of the connections between the various numerical characteristics in
the dataset. This is the result:
Algorithm Analysis and Selection
We then did performance analysis of the models
● Mean Absolute Error(MAE) - This metric measures the average absolute difference
between the actual and expected values. Lower MAE indicates better performance.
● Mean Squared Error(MSE) - A lower MSE indicates better performance. It is calculated
using the mean squared deviation of the expected and actual values.
●
R-Squared - This shows the target variable variance percentage that can be accounted
for. Values closer to 1 indicate a better fit.
Discussion and Conclusion
Above is the performance of various algorithms. In light of the performance metrics’ result, the
gradient boosting model appears to be the best choice for predicting Walmart sales. Here is
why:
The mean absolute error is lowest for gradient boosting. This shows that, in comparison to the
other models, its projections are generally the closest to the real sale statistics. In comparison to
other models, it also has the lowest mean squared error, indicating that it makes the fewest
mistakes in sales prediction. The gradient boosting model’s R squared value indicates that it is
more successful than other models at capturing underlying patterns in the data . Gradient
boosting is well known for its resilience against outliers and noisy data in terms of robustness.
Additionally, gradient boosting frequency offers consistent and dependable performance across
a variety of datasets. Although gradient boosting is an effective and adaptable machine learning
technique overall, its computational cost may make it unsuitable for real time applications and
necessitate careful parameter optimization. Overall while gradient boosting is a powerful and
versatile machine learning technique it requires careful tuning of parameters and due to the
computational complexity it may not be most suitable for real time applications. Its performance
can also be affected by the quality of the data. Based on the evaluation measures used,
gradient boosting performs better than linear regression and random forest, despite the latter
two methods offering respectable performance as well. This is true both in terms of accuracy
and predictive power
Gradient boosting produced the best accuracy among the evaluated models, indicating its
reduced MAE and MSE as well a value of R-Squared that is higher. The suggested model
showed promising performance prediction at Walmart shops. Random forest and linear
regression algorithms provide insightful data even though their metrics are a little bit worse. One
of the model’s strongest points is its capacity to accurately predict sales by capturing complex
data linkages and nonlinearities. Subsequent investigations may concentrate on sophisticated
modeling techniques, integrating supplementary data sources, and establish resilient
deployment and monitoring structures to maintain precision and significance in retail settings.
Q2 Pima Indians Diabetes Dataset Analysis.
Introduction
In order to analyze the diabetes dataset for pima indians, the input parameters consist of
medical and demographic information about individuals, such as glucose levels, blood pressure,
age, and pregnancy status. The output parameters is the prediction of gestational diabetes
mellitus in pregnant women.
Dataset Description
We first of all print a few rows of the dataset so as to get an overall overview of the dataset.
Here is a grab of the dataset:
The dataset’s numerical properties, including the mean, standard deviation, minimum, maximum
and quartiles, are then summarized statistically. An overview of the statistics is provided below:
Here is the definition of the various terms used
● Count - indicates number of non-null values in each column
● Mean - Represents average value of each column
●
●
●
●
●
●
std(standard deviation) - measures the spread of values around the mean
Min - denotes the minimum value in each column
25% (25 percentile) -Also referred to as the first quartile, this value divides 25% of the
lower data.
50% (5oth percentile) - Shows the middle value or the median of the dataset.
75% (75th percentile) - also referred to as the third quartile. This divides the lower 75%
of the data
Max - Represents the maximum value in each column.
We then check for values that are missing in the dataset then print the for each column for the
values that are missing. Below is a screenshot of the count
To further visualize the dataset. We created a pair plot which basically is a grid of scatter plots
showing the connections between pairs of features . We set a parameter hue=Outcome, to color
the points based on the outcome variable allowing visualization of how different features vary
with the target variable. Below is the visualization:
Also we generate a heatmap showing the correlation matrix of numerical features. This helps to
understand the linear connections involving different target variables and features. Below is a
screenshot of this visualization:
Algorithm Analysis and selection
We used two algorithms: logistic regression and random forest classifier for building a
classification model. We train the model using logistic regression as it is the better one, to learn
patterns and relationships within the data. Also we printed a classification report of which
included metrics like F1 score, precision, recall and support to provide insights into the models
performance.
We also computed the confusion matrix which puts in summary the performance of the
classification model by contrasting true labels with predicted tables. Then we visualize the
confusion matrix as a heatmap making it easier to interpret results. Below is the output:
Discussion and Conclusion
We juxtapose the performance of the two models and below is an output of the analysis:
Comparing the performances of the two models, The model logistic regression outperforms the
random forest classifier in terms of accuracy, It also has a higher precision, recall and F1 score
for class 1 compared to the random forest classifier. They both have almost the same
ROC-AUC scores with logistic regression being slightly higher.BAsed on these metrics the
logistic regression model appears to be the better model as it has higher accuracy and better
precision, recall and F1 score for predicting the presence of gestational diabetes mellitus in
pregnant women.
The coefficient values show the impact on the outcome variable of each feature on the log
odds.(presence of gestational diabetes). Glucose has the highest positive coefficient, indicating
that higher glucose levels are associated with an increased likelihood of gestational diabetes.
BMI also has a strong positive influence on the likelihood of gestational diabetes. Age and the
diabetes pedigree function also have positive coefficients, suggesting that older age and a
family history of diabetes increase the likelihood of gestational diabetes. Conversely, features
like Blood Pressure and insulin have negative coefficients, indicating a negative relationship with
the likelihood of gestational diabetes.
Here ar the results of the coefficients:
Features with positive coefficients have a positive relationship with the likelihood of gestational
diabetes, while features with negative coefficients have a negative relationship. For
Randomforest, the classifiers do not have coefficients in the same sense as logistic regression.
Instead we use future importance values to indicate the relative importance of each feature.
Higher feature value suggests greater influence on the model’s prediction. Glucose is identified
as the most important feature by the random forest classifier, indicating it is a strong predictive
power for gestational diabetes. BMI and age also have a significant importance, suggesting their
contribution to predicting gestational diabetes. Features like insulin and skin thickness have
relatively lower importance in the random forest model compared to logistic regression. Below is
screenshot of the same:
Appendix
model_walmartSale.R
# Step 1: Data Understanding and Preprocessing
# Load data
data <- read.csv("walmart_sale.csv")
# Data Exploration
summary(data)
str(data)
unique(data$Holiday_Flag)
colSums(is.na(data))
# Data Preprocessing
# Convert 'Date' column to datetime format
data$Date <- as.Date(data$Date)
# Feature Engineering
data$Day_of_Week <- as.numeric(format(data$Date, "%w"))
data$Month <- as.numeric(format(data$Date, "%m"))
data$Year <- as.numeric(format(data$Date, "%Y"))
# Step 2: Feature Selection
# Select relevant features
correlation_matrix <- cor(data[,-c(1, 2)])
relevant_features <- names(sort(correlation_matrix$Weekly_Sales, decreasing = TRUE)[-1])
# Step 3: Model Selection
# Split data into features (X) and target variable (y)
X <- data[,relevant_features]
y <- data$Weekly_Sales
# Split data into training and testing sets
set.seed(42)
sample <- sample.int(n = nrow(data), size = floor(.8*nrow(data)), replace = FALSE)
X_train <- X[sample, ]
X_test <- X[-sample, ]
y_train <- y[sample]
y_test <- y[-sample]
# Model Selection and Evaluation
# Model Building
library(caret)
# Pipeline for preprocessing and model building
pipeline_lr <- trainControl(method = "cv", number = 5)
pipeline_rf <- trainControl(method = "cv", number = 5)
pipeline_gb <- trainControl(method = "cv", number = 5)
# Fit models
set.seed(42)
model_lr <- train(Weekly_Sales ~ ., data = data.frame(Weekly_Sales = y_train, X_train), method
= "lm", trControl = pipeline_lr)
model_rf <- train(Weekly_Sales ~ ., data = data.frame(Weekly_Sales = y_train, X_train), method
= "rf", trControl = pipeline_rf)
model_gb <- train(Weekly_Sales ~ ., data = data.frame(Weekly_Sales = y_train, X_train),
method = "gbm", trControl = pipeline_gb)
# Model Evaluation
pred_lr <- predict(model_lr, newdata = X_test)
pred_rf <- predict(model_rf, newdata = X_test)
pred_gb <- predict(model_gb, newdata = X_test)
# Evaluation Metrics
cat("\nLinear Regression:\n")
printMetrics(y_test, pred_lr)
cat("\nRandom Forest:\n")
printMetrics(y_test, pred_rf)
cat("\nGradient Boosting:\n")
printMetrics(y_test, pred_gb)
# Feature Importances (for Random Forest model)
feature_importances <- varImp(model_rf)
importance_df
<data.frame(Feature
= rownames(feature_importances$importance),
Importance = feature_importances$importance$Overall)
importance_df <- importance_df[order(-importance_df$Importance), ]
# Visualization of Feature Importances
library(ggplot2)
ggplot(importance_df, aes(x = Importance, y = Feature)) +
geom_bar(stat = "identity") +
ggtitle("Feature Importances") +
xlab("Importance") +
ylab("Feature") +
theme_minimal()
model_pimaDiabetes.R
library(tidyverse)
library(caret)
library(randomForest)
library(ROCR)
# Load the dataset
diabetes_data <- read.csv('pima_diabetes.csv')
# Display the first few rows of the dataset
print(head(diabetes_data))
# Get summary statistics
print(summary(diabetes_data))
# Check for missing values
print(colSums(is.na(diabetes_data)))
# Visualize distributions and relationships
pairs(diabetes_data[, -9], col = diabetes_data$Outcome)
# Visualize the correlation matrix
correlation_matrix <- cor(diabetes_data[, -9])
corrplot(correlation_matrix, method = 'color', type = 'upper', order = 'hclust',
tl.col = 'black', tl.srt = 45)
# Split features and target variable
X <- diabetes_data[, -9]
y <- diabetes_data$Outcome
# Split data into training and testing sets
set.seed(42)
train_indices <- createDataPartition(y, p = 0.8, list = FALSE)
X_train <- X[train_indices, ]
X_test <- X[-train_indices, ]
y_train <- y[train_indices]
y_test <- y[-train_indices]
# Standardize features
scaler <- preProcess(X_train, method = c('center', 'scale'))
X_train_scaled <- predict(scaler, X_train)
X_test_scaled <- predict(scaler, X_test)
# Train the logistic regression model
model <- glm(Outcome ~ ., data = diabetes_data, family = binomial)
summary(model)
# Predictions
y_pred <- ifelse(predict(model, newdata = X_test_scaled, type = 'response') > 0.5, 1, 0)
# Evaluate model
accuracy <- mean(y_pred == y_test)
print(paste("Accuracy:", accuracy))
# Classification report
print(confusionMatrix(factor(y_pred), factor(y_test)))
# Train Random Forest Classifier
rf_model <- randomForest(Outcome ~ ., data = diabetes_data, importance = TRUE)
# Predictions
rf_y_pred <- predict(rf_model, newdata = X_test)
# Evaluate Random Forest Classifier
rf_accuracy <- mean(rf_y_pred == y_test)
print(paste("Random Forest Classifier Accuracy:", rf_accuracy))
print(confusionMatrix(factor(rf_y_pred), factor(y_test)))
# Interpret Random Forest Feature Importance
random_forest_feature_importance <- data.frame(Feature = names(X_train), Importance =
rf_model$importance[,"MeanDecreaseGini"]) %>%
arrange(desc(Importance))
print("Random Forest Feature Importance:")
print(random_forest_feature_importance)
Download