Introduction The main focus of this study is to address two predictive modeling tasks. Predicting Walmart sales and analyzing the pima indians diabetes dataset. Q1. Predictive Modeling For Walmart Sales. Introduction The input parameters for predicting Walmart sales include historical sales data, external factors such as holidays and economic indicators. The output parameter is the prediction of weekly sales for each store. We first of all explored the structure and characteristics of the dataset and we found the following. First we are going to explore the general dataset characteristics: Dataset Description The output above provides summary statistics for each numerical column in the dataset. ● Count - indicates numbers of values in each column that are not null ● Mean - Represents the average value of each column ● Standard Deviation(STD) - measure the values of the spread that surrounds the average. ● Min - denotes the minimum value in each column ● 25%(25th percentile) - The first quartile is another name that refers to it. The value divides 25% of the lower data. . ● 50% (5oth percentile) - Stands for the middle value or the value in the middle of the dataset. ● 75% (75th percentile) - Also referred to as the third quartile. This value divides the lower 75% of the data. ● Max - denotes the value that is maximum in each column The output above displays the data types in the dataset of each column. Here is a brief explanation of each data type. ● Int64 - Represents integer values. ● Object - The data type represents string values or mixed data types ● Float64 - The Data Type represents floating point values The output above indicates the number of values that are missing in each column of the dataset. Here is an explanation. ● Missing Values - This section shows the count of the values that are missing for every column. Each name of the column is listed along with the values that correspond to the values count that are missing. As observed all columns have no missing values. The output below is a histogram to show the spread of features that are numerical in the dataset. Overall, these histograms aid in understanding the central tendency and variability of each feature by illuminating the distribution and spread of numerical features throughout the dataset. They can be useful in identifying outliers, understanding data ranges and making informed decisions during data analysis. Here is the output: The output below visualizes the results using a heatmap. All things considered, the heatmap offers a graphic overview of the connections between the various numerical characteristics in the dataset. This is the result: Algorithm Analysis and Selection We then did performance analysis of the models ● Mean Absolute Error(MAE) - This metric measures the average absolute difference between the actual and expected values. Lower MAE indicates better performance. ● Mean Squared Error(MSE) - A lower MSE indicates better performance. It is calculated using the mean squared deviation of the expected and actual values. ● R-Squared - This shows the target variable variance percentage that can be accounted for. Values closer to 1 indicate a better fit. Discussion and Conclusion Above is the performance of various algorithms. In light of the performance metrics’ result, the gradient boosting model appears to be the best choice for predicting Walmart sales. Here is why: The mean absolute error is lowest for gradient boosting. This shows that, in comparison to the other models, its projections are generally the closest to the real sale statistics. In comparison to other models, it also has the lowest mean squared error, indicating that it makes the fewest mistakes in sales prediction. The gradient boosting model’s R squared value indicates that it is more successful than other models at capturing underlying patterns in the data . Gradient boosting is well known for its resilience against outliers and noisy data in terms of robustness. Additionally, gradient boosting frequency offers consistent and dependable performance across a variety of datasets. Although gradient boosting is an effective and adaptable machine learning technique overall, its computational cost may make it unsuitable for real time applications and necessitate careful parameter optimization. Overall while gradient boosting is a powerful and versatile machine learning technique it requires careful tuning of parameters and due to the computational complexity it may not be most suitable for real time applications. Its performance can also be affected by the quality of the data. Based on the evaluation measures used, gradient boosting performs better than linear regression and random forest, despite the latter two methods offering respectable performance as well. This is true both in terms of accuracy and predictive power Gradient boosting produced the best accuracy among the evaluated models, indicating its reduced MAE and MSE as well a value of R-Squared that is higher. The suggested model showed promising performance prediction at Walmart shops. Random forest and linear regression algorithms provide insightful data even though their metrics are a little bit worse. One of the model’s strongest points is its capacity to accurately predict sales by capturing complex data linkages and nonlinearities. Subsequent investigations may concentrate on sophisticated modeling techniques, integrating supplementary data sources, and establish resilient deployment and monitoring structures to maintain precision and significance in retail settings. Q2 Pima Indians Diabetes Dataset Analysis. Introduction In order to analyze the diabetes dataset for pima indians, the input parameters consist of medical and demographic information about individuals, such as glucose levels, blood pressure, age, and pregnancy status. The output parameters is the prediction of gestational diabetes mellitus in pregnant women. Dataset Description We first of all print a few rows of the dataset so as to get an overall overview of the dataset. Here is a grab of the dataset: The dataset’s numerical properties, including the mean, standard deviation, minimum, maximum and quartiles, are then summarized statistically. An overview of the statistics is provided below: Here is the definition of the various terms used ● Count - indicates number of non-null values in each column ● Mean - Represents average value of each column ● ● ● ● ● ● std(standard deviation) - measures the spread of values around the mean Min - denotes the minimum value in each column 25% (25 percentile) -Also referred to as the first quartile, this value divides 25% of the lower data. 50% (5oth percentile) - Shows the middle value or the median of the dataset. 75% (75th percentile) - also referred to as the third quartile. This divides the lower 75% of the data Max - Represents the maximum value in each column. We then check for values that are missing in the dataset then print the for each column for the values that are missing. Below is a screenshot of the count To further visualize the dataset. We created a pair plot which basically is a grid of scatter plots showing the connections between pairs of features . We set a parameter hue=Outcome, to color the points based on the outcome variable allowing visualization of how different features vary with the target variable. Below is the visualization: Also we generate a heatmap showing the correlation matrix of numerical features. This helps to understand the linear connections involving different target variables and features. Below is a screenshot of this visualization: Algorithm Analysis and selection We used two algorithms: logistic regression and random forest classifier for building a classification model. We train the model using logistic regression as it is the better one, to learn patterns and relationships within the data. Also we printed a classification report of which included metrics like F1 score, precision, recall and support to provide insights into the models performance. We also computed the confusion matrix which puts in summary the performance of the classification model by contrasting true labels with predicted tables. Then we visualize the confusion matrix as a heatmap making it easier to interpret results. Below is the output: Discussion and Conclusion We juxtapose the performance of the two models and below is an output of the analysis: Comparing the performances of the two models, The model logistic regression outperforms the random forest classifier in terms of accuracy, It also has a higher precision, recall and F1 score for class 1 compared to the random forest classifier. They both have almost the same ROC-AUC scores with logistic regression being slightly higher.BAsed on these metrics the logistic regression model appears to be the better model as it has higher accuracy and better precision, recall and F1 score for predicting the presence of gestational diabetes mellitus in pregnant women. The coefficient values show the impact on the outcome variable of each feature on the log odds.(presence of gestational diabetes). Glucose has the highest positive coefficient, indicating that higher glucose levels are associated with an increased likelihood of gestational diabetes. BMI also has a strong positive influence on the likelihood of gestational diabetes. Age and the diabetes pedigree function also have positive coefficients, suggesting that older age and a family history of diabetes increase the likelihood of gestational diabetes. Conversely, features like Blood Pressure and insulin have negative coefficients, indicating a negative relationship with the likelihood of gestational diabetes. Here ar the results of the coefficients: Features with positive coefficients have a positive relationship with the likelihood of gestational diabetes, while features with negative coefficients have a negative relationship. For Randomforest, the classifiers do not have coefficients in the same sense as logistic regression. Instead we use future importance values to indicate the relative importance of each feature. Higher feature value suggests greater influence on the model’s prediction. Glucose is identified as the most important feature by the random forest classifier, indicating it is a strong predictive power for gestational diabetes. BMI and age also have a significant importance, suggesting their contribution to predicting gestational diabetes. Features like insulin and skin thickness have relatively lower importance in the random forest model compared to logistic regression. Below is screenshot of the same: Appendix model_walmartSale.R # Step 1: Data Understanding and Preprocessing # Load data data <- read.csv("walmart_sale.csv") # Data Exploration summary(data) str(data) unique(data$Holiday_Flag) colSums(is.na(data)) # Data Preprocessing # Convert 'Date' column to datetime format data$Date <- as.Date(data$Date) # Feature Engineering data$Day_of_Week <- as.numeric(format(data$Date, "%w")) data$Month <- as.numeric(format(data$Date, "%m")) data$Year <- as.numeric(format(data$Date, "%Y")) # Step 2: Feature Selection # Select relevant features correlation_matrix <- cor(data[,-c(1, 2)]) relevant_features <- names(sort(correlation_matrix$Weekly_Sales, decreasing = TRUE)[-1]) # Step 3: Model Selection # Split data into features (X) and target variable (y) X <- data[,relevant_features] y <- data$Weekly_Sales # Split data into training and testing sets set.seed(42) sample <- sample.int(n = nrow(data), size = floor(.8*nrow(data)), replace = FALSE) X_train <- X[sample, ] X_test <- X[-sample, ] y_train <- y[sample] y_test <- y[-sample] # Model Selection and Evaluation # Model Building library(caret) # Pipeline for preprocessing and model building pipeline_lr <- trainControl(method = "cv", number = 5) pipeline_rf <- trainControl(method = "cv", number = 5) pipeline_gb <- trainControl(method = "cv", number = 5) # Fit models set.seed(42) model_lr <- train(Weekly_Sales ~ ., data = data.frame(Weekly_Sales = y_train, X_train), method = "lm", trControl = pipeline_lr) model_rf <- train(Weekly_Sales ~ ., data = data.frame(Weekly_Sales = y_train, X_train), method = "rf", trControl = pipeline_rf) model_gb <- train(Weekly_Sales ~ ., data = data.frame(Weekly_Sales = y_train, X_train), method = "gbm", trControl = pipeline_gb) # Model Evaluation pred_lr <- predict(model_lr, newdata = X_test) pred_rf <- predict(model_rf, newdata = X_test) pred_gb <- predict(model_gb, newdata = X_test) # Evaluation Metrics cat("\nLinear Regression:\n") printMetrics(y_test, pred_lr) cat("\nRandom Forest:\n") printMetrics(y_test, pred_rf) cat("\nGradient Boosting:\n") printMetrics(y_test, pred_gb) # Feature Importances (for Random Forest model) feature_importances <- varImp(model_rf) importance_df <data.frame(Feature = rownames(feature_importances$importance), Importance = feature_importances$importance$Overall) importance_df <- importance_df[order(-importance_df$Importance), ] # Visualization of Feature Importances library(ggplot2) ggplot(importance_df, aes(x = Importance, y = Feature)) + geom_bar(stat = "identity") + ggtitle("Feature Importances") + xlab("Importance") + ylab("Feature") + theme_minimal() model_pimaDiabetes.R library(tidyverse) library(caret) library(randomForest) library(ROCR) # Load the dataset diabetes_data <- read.csv('pima_diabetes.csv') # Display the first few rows of the dataset print(head(diabetes_data)) # Get summary statistics print(summary(diabetes_data)) # Check for missing values print(colSums(is.na(diabetes_data))) # Visualize distributions and relationships pairs(diabetes_data[, -9], col = diabetes_data$Outcome) # Visualize the correlation matrix correlation_matrix <- cor(diabetes_data[, -9]) corrplot(correlation_matrix, method = 'color', type = 'upper', order = 'hclust', tl.col = 'black', tl.srt = 45) # Split features and target variable X <- diabetes_data[, -9] y <- diabetes_data$Outcome # Split data into training and testing sets set.seed(42) train_indices <- createDataPartition(y, p = 0.8, list = FALSE) X_train <- X[train_indices, ] X_test <- X[-train_indices, ] y_train <- y[train_indices] y_test <- y[-train_indices] # Standardize features scaler <- preProcess(X_train, method = c('center', 'scale')) X_train_scaled <- predict(scaler, X_train) X_test_scaled <- predict(scaler, X_test) # Train the logistic regression model model <- glm(Outcome ~ ., data = diabetes_data, family = binomial) summary(model) # Predictions y_pred <- ifelse(predict(model, newdata = X_test_scaled, type = 'response') > 0.5, 1, 0) # Evaluate model accuracy <- mean(y_pred == y_test) print(paste("Accuracy:", accuracy)) # Classification report print(confusionMatrix(factor(y_pred), factor(y_test))) # Train Random Forest Classifier rf_model <- randomForest(Outcome ~ ., data = diabetes_data, importance = TRUE) # Predictions rf_y_pred <- predict(rf_model, newdata = X_test) # Evaluate Random Forest Classifier rf_accuracy <- mean(rf_y_pred == y_test) print(paste("Random Forest Classifier Accuracy:", rf_accuracy)) print(confusionMatrix(factor(rf_y_pred), factor(y_test))) # Interpret Random Forest Feature Importance random_forest_feature_importance <- data.frame(Feature = names(X_train), Importance = rf_model$importance[,"MeanDecreaseGini"]) %>% arrange(desc(Importance)) print("Random Forest Feature Importance:") print(random_forest_feature_importance)