Dayananda Sagar University School of Engineering Department of CSE (Data Science) Fundamentals of Data Science Lab Manual Course code: 21DS2402 1. Write a Program to import iris dataset and display head,tail,summary,interquartile range and structure. a. Using a python code display head,tail, summary,info,interquartile range b. Using R code display head,tail, summary,structure,interquartile range a.Using a python code display head,tail, summary,info,interquartile range import pandas as pd df = pd.read_csv('F:\ML\Iris.csv') df.head() df.tail() df.shape print(df.quantile(q = 0.5)) print(df.quantile(q = 0.75)) print(df.describe()) b.Using R code display head,tail, summary,structure,interquartile range library(tidyverse) df <- iris print("Head") print("--------------------") print(head(df)) print('-----------------------') print("Tail") print(tail(df)) print(str(df)) # structure of data print(summary(df)) # summary of data print("----------------------------------------------------") print(quantile(df$mpg)) # quantile of data df <- iris # Subset of data print(summary(df)) print(df[1:3]) print(subset(df,Species == "versicolor")) print(subset(df,Sepal.Length == 6.2)) print(aggregate(df$Sepal.Width,list(df$Sepal.Width),FUN = mean)) Output: 2. Write a Program for data cleaning and find missing values in iris dataset a. Using python perform data cleaning,find missing values and drop unused attributes b. Using R perform data cleaning,find missing values and drop unused attributes a.Using python perform data cleaning,find missing values and drop unused attributes import pandas as pd df = pd.read_csv('F:\ML\Iris.csv') df.isnull() df.isnull().sum() df.drop(columns="Id",inplace=True) df.isnull().sum() df.duplicated() df.fillna(0) df.dropna() b. Using R perform data cleaning,find missing values and drop unused attributes data("iris") str(iris) # First we create a copy of our dataset iris_copy <- iris # We introduce several missing values in some columns iris_copy$Sepal.Length[c(15, 20, 50, 67, 97, 118)] <- NA iris_copy$Sepal.Width[c(4, 80, 97, 106)] <- NA iris_copy$Petal.Length[c(5, 17, 35, 49)] <- NA # Now we see that there are missing values in some columns summary(iris_copy) length(which(is.na(iris_copy))) # We can check that we introduced 14 missing values in the table # First we will use the complete.cases function (check ?complete.cases for information) # This function returns only rows without NA's.Putting infront of it we get only rows with NA's iris_NA <- iris_copy[!complete.cases(iris_copy), ] iris_NA # We see that we have 13 rows with missing values on it # Another way is to search for TRUE values in the is.na function iris_NA <- iris_copy[rowSums(is.na(iris_copy)) > 0, ] iris_NA # The first choice can be to just remove the rows containing NA's complete.cases(iris_copy) iris_clean <- iris_copy[complete.cases(iris_copy), ] length(which(is.na(iris_clean))) iris_copy[is.na(iris_copy$Sepal.Length) & (iris_copy$Species == "setosa"),"Sepal.Length"] <-median(iris_copy$Sepal.Length[which(iris_copy$Species == "setosa")], na.rm = TRUE) iris_NA <- iris_copy[!complete.cases(iris_copy), ] iris_NA # Now we have removed 3 NA's. Only 11 left iris_copy[is.na(iris_copy$Sepal.Length) & (iris_copy$Species == "versicolor"),"Sepal.Length"] <- median(iris_copy$Sepal.Length[which(iris_copy$Species == "versicolor")], na.rm = TRUE) iris_NA <- iris_copy[!complete.cases(iris_copy), ] iris_NA # Now we have removed 2 NA's. Only 9 left iris_copy[is.na(iris_copy$Sepal.Length) & (iris_copy$Species == "virginica"),"Sepal.Length"] <- median(iris_copy$Sepal.Length[which(iris_copy$Species == "virginica")], na.rm = TRUE) iris_NA <- iris_copy[!complete.cases(iris_copy), ] iris_NA # Now we have removed 1 NA's. Only 8 left iris_copy[is.na(iris_copy$Sepal.Width) & (iris_copy$Species == "setosa"),"Sepal.Width"] <median(iris_copy$Sepal.Width[which(iris_copy$Species == "setosa")], na.rm = TRUE) iris_NA <- iris_copy[!complete.cases(iris_copy), ] iris_NA # Now we have removed 1 NA's. Only 7 left iris_copy[is.na(iris_copy$Petal.Length) & (iris_copy$Species == "setosa"),"Petal.Length"] <median(iris_copy$Petal.Length[which(iris_copy$Species == "setosa")], na.rm = TRUE) iris_NA <- iris_copy[!complete.cases(iris_copy), ] iris_NA Output: Sepal.Length Sepal.Width Petal.Length Petal.Width 4 4.6 NA 1.5 0.2 setosa 5 5.0 3.6 NA 0.2 setosa 1.2 0.2 setosa NA 0.4 setosa 1.5 0.3 setosa 15 NA 4.0 17 5.4 20 NA 35 4.9 3.1 NA 0.2 setosa 49 5.3 3.7 NA 0.2 setosa 50 NA 3.3 1.4 0.2 setosa 67 NA 3.0 4.5 1.5 versicolor 80 5.7 NA 3.5 1.0 versicolor 97 NA 3.9 3.8 NA 4.2 1.3 versicolor Species 106 7.6 NA 6.6 2.1 virginica 118 NA 3.8 6.7 2.2 virginica Sepal.Length Sepal.Width Petal.Length Petal.Width 80 5.70 NA 3.5 1.0 versicolor 97 5.95 NA 4.2 1.3 versicolor 106 7.60 NA 6.6 2.1 virginica Species 3. Write a program for exploratory data analysis(EDA) on a dataset to identify patterns, trends,outliers, and relationships between variables and visualize the iris dataset using box plot,scatter plot and histogram. a. Using python perform EDA for iris dataset and visualize the dataset using above techniques. b. Using R perform EDA for iris dataset and visualize the dataset using above techniques. a. Using python perform EDA for iris dataset and visualize the dataset using above techniques. import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt df = pd.read_csv('F:\ML\Iris.csv') df.groupby('Species').agg(['mean', 'median']) # passing a list of recognized strings df.groupby('Species').agg([np.mean, np.median]) df.groupby('Species').agg(['mean', 'median']) # passing a list of recognized strings df.groupby('Species').agg([np.mean, np.median]) ## Distribution of particular species sns.distplot(a=df['PetalWidthCm'], bins=40, color='b') plt.title('petal width distribution plot') ## count of number of observation of each species sns.countplot(x='Species',data=df) sns.pairplot(df, hue='Species') plt.figure(figsize=(17,9)) plt.title("Comparison between various species based on sapel length and width") sns.scatterplot(df['SepalLengthCm'],df['SepalWidthCm'],hue =df['Species'],s=50) b. Using R perform EDA for iris dataset and visualize the dataset using above techniques. data(iris) head(iris, 4) plot(iris2) sepal_length<-iris2$sepal.length hist(sepal_length) hist(sepal_length, main="Histogram of Sepal Length", xlab="Sepal Length", xlim=c(4,8), col="blue", freq=FALSE) sepal_width<-iris2$sepal.width hist(sepal_width, main="Histogram of Sepal Width", xlab="Sepal Width", xlim=c(2,5), col="darkorchid", freq=FALSE) irisVer <- subset(iris, Species == "versicolor") irisSet <- subset(iris, Species == "setosa") irisVir <- subset(iris, Species == "virginica") par(mfrow=c(1,3),mar=c(6,3,2,1)) boxplot(irisVer[,1:4], main="Versicolor, Rainbow Palette",ylim = c(0,8),las=2, col=rainbow(4)) boxplot(irisSet[,1:4], main="Setosa, Heat color Palette",ylim = c(0,8),las=2, col=heat.colors(4)) boxplot(irisVir[,1:4], main="Virginica, Topo colors Palette",ylim = c(0,8),las=2, col=topo.colors( Outp.ut: 4. Write a program to plot correlation matrix and covariance plot for iris dataset. a. Using R code for iris dataset plot correlation matrix between sepal width and sepal length. b. Using python code for mtcars dataset plot correlation matrix a.Using R code for iris dataset plot correlation matrix between sepal width and sepal length. library(datasets) data(iris) summary(iris) # Correlation Plot plot(iris) # Correlation Matrix print(cor(iris[, 1:4])) # Diff Correlations print(plot(iris$sepal.width, iris$sepal.length)) b.Using python code for mtcars dataset plot correlation matrix import pandas as pd df = pd.read_csv("path-to-the-file/mtcars.csv") # Correlation print(df.corr(df,method = "pearson")) # Covariance print(df.cov(df["mpg"])) Output: Sepal.Length Sepal.Width Petal.Length Petal.Width Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 Median :5.800 Median :3.000 Median :4.350 Median :1.300 Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800 Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500 Species setosa :50 versicolor:50 Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411 Sepal.Width -0.1175698 1.0000000 -0.4284401 -0.3661259 Petal.Length 0.8717538 -0.4284401 1.0000000 0.9628654 Petal.Width 0.8179411 -0.3661259 0.9628654 1.0000000 Error in plot.window(...) : need finite 'xlim' values Calls: print -> plot -> plot.default -> localWindow -> plot.window 5. Write a program for dimensionality reduction using principal component analysis for iris dataset. a. Using R code implement PCA b. Using python implement PCA a. Using R code implement PCA library(datasets) data("iris") str(iris) library(caret) set.seed(100) ind <- createDataPartition(iris$Species,p=0.80,list = F) train <- iris[ind,] test <- iris[-ind,] dim(train) dim(test) library(psych) pairs.panels(train[,-5],gap=0,bg=c("red","blue","yellow")[train$Species], pch=21) pc <- prcomp(train[,-5],center = T,scale. = T) pc summary(pc) pairs.panels(pc$x,gap=0,bg=c("red","blue","yellow")[train$Species], pch = 21) b. Using python implement PCA from sklearn import datasets from sklearn.decomposition import PCA import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from mpl_toolkits.mplot3d import Axes3D # %matplotlib inline %matplotlib notebook iris = datasets.load_iris() X = pd.DataFrame(iris.data, columns=iris.feature_names) y = pd.Series(iris.target, name='FlowerType') plt.figure(2, figsize=(8, 6)) plt.clf() # Plot the training points plt.scatter(X['sepal length (cm)'], X['sepal width (cm)'], s=35, c=y, cmap=plt.cm.brg) plt.xlabel('Sepal length') plt.ylabel('Sepal width') plt.title('Sepal length vs. Sepal width') plt.show() pca_iris = PCA(n_components=3).fit(iris.data) pca_iris.explained_variance_ratio_ pca_iris.transform(iris.data) iris_reduced = PCA(n_components=3).fit(iris.data) iris_reduced.components_ iris_reduced = PCA(n_components=3).fit_transform(iris.data) fig = plt.figure(1, figsize=(8, 6)) ax = Axes3D(fig, elev=-150, azim=110) X_reduced = PCA(n_components=3).fit_transform(iris.data) ax.scatter(iris_reduced[:, 0], iris_reduced[:, 1], iris_reduced[:, 2], cmap=plt.cm.Paired, c=iris.target) for k in range(3): ax.scatter(iris_reduced[y==k, 0], iris_reduced[y==k, 1], iris_reduced[y==k, label=iris.target_names[k]) ax.set_title("First three P.C.") ax.set_xlabel("P.C. 1") ax.w_xaxis.set_ticklabels([]) ax.set_ylabel("P.C. 2") ax.w_yaxis.set_ticklabels([]) ax.set_zlabel("P.C. 3") ax.w_zaxis.set_ticklabels([]) 2], plt.legend(numpoints=1) plt.show() Output: Standard deviations (1, .., p=4): [1] 1.7161062 0.9378761 0.3941760 0.1413965 Rotation (n x k) = (4 x 4): PC1 PC2 PC3 PC4 Sepal.Length 0.5201525 -0.37173825 -0.7181697 -0.2747443 Sepal.Width -0.2864515 -0.92354821 0.2239707 0.1218250 Petal.Length 0.5777030 -0.04107397 0.1319176 0.8044687 Petal.Width 0.5600413 -0.08474854 0.6454976 -0.5123517 > summary(pc) Importance of components: PC1 PC2 PC3 PC4 Standard deviation 1.7161 0.9379 0.39418 0.1414 Proportion of Variance 0.7363 0.2199 0.03884 0.0050 6. Implement a program to select required features from iris dataset using chi-square test a. Use python code to implement chi square test b.Use R code to implement chi square test a. Use python code to implement chi square test # Load packages from sklearn.datasets import load_iris from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import chi2 # Load iris data iris_dataset = load_iris() # Create features and target X = iris_dataset.data y = iris_dataset.target # Convert to categorical data by converting data to integers X = X.astype(int) # Two features with highest chi-squared statistics are selected chi2_features = SelectKBest(chi2, k = 2) X_kbest_features = chi2_features.fit_transform(X, y) # Reduced features print('Original feature number:', X.shape[1]) print('Reduced feature number:', X_kbest_features.shape[1])# Load packages from sklearn.datasets import load_iris from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import chi2 # Load iris data iris_dataset = load_iris() # Create features and target X = iris_dataset.data y = iris_dataset.target # Convert to categorical data by converting data to integers X = X.astype(int) # Two features with highest chi-squared statistics are selected chi2_features = SelectKBest(chi2, k = 2) X_kbest_features = chi2_features.fit_transform(X, y) # Reduced features print('Original feature number:', X.shape[1]) print('Reduced feature number:', X_kbest_features.shape[1]) b.Use R code to implement chi square test dat <- iris dat$size <- ifelse(dat$Sepal.Length < median(dat$Sepal.Length), "small", "big" ) table(dat$Species, dat$size) library(ggplot2) ggplot(dat) + aes(x = Species, fill = size) + geom_bar() + scale_fill_hue() + theme_minimal() test <- chisq.test(table(dat$Species, dat$size)) test test$statistic # test statistic ## X-squared ## 86.03451 test$p.value # p-value ## [1] 2.078944e-19 summary(table(dat$Species, dassst$size)) ## Number of cases in table: 150 ## Number of factors: 2 ## Test for independence of all factors: ## Chisq = 86.03, df = 2, p-value = 2.079e-19 # third method: library(vcd) assocstats(table(dat$Species, dat$size)) test$observed test$expected # test statistic Output: Original feature number: 4 Reduced feature number: 2 Original feature number: 4 Reduced feature number: 2 X-squared 86.03451 p-value [1] 2.078944e-19 Number of cases in table: 150 Number of factors: 2 Test for independence of all factors: Chisq = 86.03, df = 2, p-value = 2.079e-19 7. Build a python program for iris dataset using different classification algorithms . a. Use Support vector machine(SVM) build a model b. Use K Nearest Neighbour(KNN) to build model c. use Decision tree classifier to build a model d.use Random forest(RF) classifier to build a model. import numpy as np import pandas as pd import matplotlib.pyplot as plt data = pd.read_csv('F:\ML\Iris.csv') data_points = data.iloc[:, 1:5] labels = data.iloc[:, 5] from sklearn.model_selection import train_test_split x_train,x_test,y_train,y_test = train_test_split(data_points,labels,test_size=0.2) from sklearn.preprocessing import StandardScaler from sklearn.model_selection import cross_val_score Standard_obj = StandardScaler() Standard_obj.fit(x_train) x_train_std = Standard_obj.transform(x_train) x_test_std = Standard_obj.transform(x_test) a. Use Support vector machine(SVM) build a model from sklearn.svm import SVC svm = SVC(kernel='rbf', random_state=0, gamma=.10, C=1.0) svm.fit(x_train_std, y_train) print('Training data accuracy {:.2f}'.format(svm.score(x_train_std, y_train)*100)) print('Testing data accuracy {:.2f}'.format(svm.score(x_test_std, y_test)*100)) b. Use K Nearest Neighbour(KNN) to build model from sklearn.neighbors import KNeighborsClassifier knn = KNeighborsClassifier(n_neighbors = 7, p = 2, metric='minkowski') knn.fit(x_train_std,y_train) print('Training data accuracy {:.2f}'.format(knn.score(x_train_std, y_train)*100)) print('Testing data accuracy {:.2f}'.format(knn.score(x_test_std, y_test)*100)) c. use Decision tree classifier to build a model from sklearn import tree decision_tree = tree.DecisionTreeClassifier(criterion='gini') decision_tree.fit(x_train_std, y_train) print('Training data accuracy {:.2f}'.format(decision_tree.score(x_train_std, y_train)*100)) print('Testing data accuracy {:.2f}'.format(decision_tree.score(x_test_std, y_test)*100)) d.use Random forest(RF) classifier to build a model. from sklearn.ensemble import RandomForestClassifier random_forest = RandomForestClassifier() random_forest.fit(x_train_std, y_train) print('Training data accuracy {:.2f}'.format(random_forest.score(x_train_std, y_train)*100)) print('Testing data accuracy {:.2f}'.format(random_forest.score(x_test_std, y_test)*100)) Output: Training data accuracy 95.83 Testing data accuracy 100.00 Training data accuracy 96.67 Testing data accuracy 100.00 Training data accuracy 100.00 Testing data accuracy 96.67 Training data accuracy 100.00 Testing data accuracy 96.67 8a. Write a python code for Simple Linear Regression import numpy as np import matplotlib.pyplot as mtplt def estimate_coeff(p, q): # Here, we will estimate the total number of points or observation n1 = nmp.size(p) # Now, we will calculate the mean of a and b vector m_p = nmp.mean(p) m_q = nmp.mean(q) # here, we will calculate the cross deviation and deviation about a SS_pq = nmp.sum(q * p) - n1 * m_q * m_p SS_pp = nmp.sum(p * p) - n1 * m_p * m_p # here, we will calculate the regression coefficients b_1 = SS_pq / SS_pp b_0 = m_q - b_1 * m_p return (b_0, b_1) def plot_regression_line(p, q, b): # Now, we will plot the actual points or observation as scatter plot mtplt.scatter(p, q, color = "m", marker = "o", s = 30) # here, we will calculate the predicted response vector q_pred = b[0] + b[1] * p # here, we will plot the regression line mtplt.plot(p, q_pred, color = "g") # here, we will put the labels mtplt.xlabel('p') mtplt.ylabel('q') # here, we will define the function to show plot mtplt.show() def main(): # entering the observation points or data p = np.array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19]) q = np.array([11, 13, 12, 15, 17, 18, 18, 19, 20, 22]) # now, we will estimate the coefficients b = estimate_coeff(p, q) print("Estimated coefficients are :\nb_0 = {} \ \nb_1 = {}".format(b[0], b[1])) # Now, we will plot the regression line plot_regression_line(p, q, b) if __name__ == "__main__": main() 8b) Write a python program for Multiple Linear Regression import numpy as np import matplotlib as mpl from mpl_toolkits.mplot3d import Axes3D import matplotlib.pyplot as plt def generate_dataset(n): x = [] y = [] random_x1 = np.random.rand() random_x2 = np.random.rand() for i in range(n): x1 = i x2 = i/2 + np.random.rand()*n x.append([1, x1, x2]) y.append(random_x1 * x1 + random_x2 * x2 + 1) return np.array(x), np.array(y) x, y = generate_dataset(200) mpl.rcParams['legend.fontsize'] = 12 fig = plt.figure() ax = fig.add_subplot(projection ='3d') ax.scatter(x[:, 1], x[:, 2], y, label ='regression', s = 5) ax.legend() ax.view_init(45, 0) plt.show() 9a) Write a python program for One-way ANOVA Install the scipy Module pip3 install scipy Step 1: Creating data groups. The very first step is to create three arrays that will keep the information of cars when d # Performance when each of the engine # oil is applied performance1 = [89, 89, 88, 78, 79] performance2 = [93, 92, 94, 89, 88] performance3 = [89, 88, 89, 93, 90] performance4 = [81, 78, 81, 92, 82] Step 2: Conduct the one-way ANOVA: Python provides us f_oneway() function from SciPy library using which we can conduct the OneWay ANOVA. # Importing library from scipy.stats import f_oneway # Performance when each of the engine # oil is applied performance1 = [89, 89, 88, 78, 79] performance2 = [93, 92, 94, 89, 88] performance3 = [89, 88, 89, 93, 90] performance4 = [81, 78, 81, 92, 82] # Conduct the one-way ANOVA f_oneway(performance1, performance2, performance3, performance4) Step 3: Analyse the result: The F statistic and p-value turn out to be equal to 4.625 and 0.016336498 respectively. Since the p-value is less than 0.05 hence we would reject the null hypothesis. This implies that we have sufficient proof to say that there exists a difference in the performance among four different engine oils. 9b)Write a Python program for Two-Way ANOVA Step 1: Import libraries. # Importing libraries import numpy as np import pandas as pd Step 2: Enter the data. Let us create a pandas DataFrame that consist of the following three variables: · · · fertilizers: how frequently each plant was fertilized that is daily or weekly. watering: how frequently each plant was watered that is daily or weekly. height: the height of each plant (in inches) after six months. # Importing libraries import numpy as np import pandas as pd # Create a dataframe dataframe = pd.DataFrame({'Fertilizer': np.repeat(['daily', 'weekly'], 15), 'Watering': np.repeat(['daily', 'weekly'], 15), 'height': [14, 16, 15, 15, 16, 13, 12, 11, 14,15, 16, 16, 17, 18, 14, 13, 14, 14,14, 15, 16, 16, 17, 18, 14, 13, 14,14, 14, 15]}) Step 3: Conduct the two-way ANOVA: To perform the two-way ANOVA, the Statsmodels library provides us with anova_lm() function. The syntax of the function is given below, Syntax: sm.stats.anova_lm(model, type=2) Parameters: · · model: It represents model statistics type: It represents the type of Anova test to perform that is { I or II or III or 1 or 2 or 3 } # Importing libraries import statsmodels.api as sm from statsmodels.formula.api import ols # Performing two-way ANOVA model = ols( 'height ~ C(Fertilizer) + C(Watering) +\ C(Fertilizer):C(Watering)', data=df).fit() sm.stats.anova_lm(model, typ=2) Step 4: Combining all the steps. # Importing libraries import statsmodels.api as sm from statsmodels.formula.api import ols # Create a dataframe dataframe = pd.DataFrame({'Fertilizer': np.repeat(['daily', 'weekly'], 15),'Watering': np.repeat(['daily', 'weekly'], 15),'height': [14, 16, 15, 15, 16, 13, 12, 11,14, 15, 16, 16, 17, 18, 14, 13,14, 14, 14, 15, 16, 16, 17, 18,14, 13, 14, 14, 14, 15]}) # Performing two-way ANOVA model = ols('height ~ C(Fertilizer) + C(Watering) +\ C(Fertilizer):C(Watering)', data=dataframe).fit() result = sm.stats.anova_lm(model, type=2) # Print the result print(result) Interpreting the result: Following are the p-values for each of the factors in the output: · · · The fertilizer p-value is equal to 0.913305 The Watering p-value is equal to 0.990865 The Fertilizer * Watering: p-value is equal to 0.904053 The p-values for water and sun turn out to be less than 0.05 which implies that the means of both the factors possess a statistically significant effect on plant height. The p-value for the interaction effect (0.904053) is greater than 0.05 which depicts that there is no significant interaction effect between fertilizer frequency and watering frequency. 10.Write a Python Program on K-Means Clustering Identification of customers based on their choices is an important strategy in any organization. This identification may help in approaching customers with specific offers. An organization with a large number of customers may experience difficulty in identifying and placing into a record each customer. A huge amount of data processing and automated techniques are involved in extracting insights from the large information collected on customers. Clustering method can help to identifying the customers based on their key characteristics. Dataset: https://www.kaggle.com/datasets/shwetabh123/mall-customers The first step is importing the required libraries. import pandas as pd import numpy as npfrom sklearn.cluster import KMeans from sklearn.preprocessing import MinMaxScaler from sklearn.datasets import make_blobs from sklearn.preprocessing import StandardScaler from yellowbrick.cluster import KElbowVisualizerfrom matplotlib import pyplot as plt import matplotlib.cm as cm import seaborn as sns %matplotlib inline sns.set_style('whitegrid') plt.style.use('fivethirtyeight') dataset = pd.read_csv("Mall_Customers.csv", sep=",") dataset.head() dataset.info() data_x = dataset.iloc[:, 3:5] data_x.head()x_array = np.array(data_x) print(x_array) scaler=MinMaxScaler() x_scaled=scaler.fit_transform(x_array) x_scaled Sum_of_squared_distances=[] K=range(1,15) for k in K: km =KMeans(n_clusters =k) km =km.fit(x_scaled) Sum_of_squared_distances.append(km.inertia_) plt.plot(K, Sum_of_squared_distances, 'bx-') plt.xlabel('k') plt.ylabel('SSE') plt.title('Elbow Method For Optimal k') plt.show() After finding the optimal number of clusters, fit the K-Means clustering model to the dataset and then predict clusters for each of the data elements. numerics = dataset[['Annual Income (k$)','Spending Score (1-100)']] from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler() for i in numerics: scaler.fit(dataset[[i]]) dataset[i] = scaler.transform(dataset[[i]])km = KMeans(n_clusters=5) y_predicted = km.fit_predict(dataset[['Annual Income (k$)', 'Spending Score (1-100)']]) y_predicted After we get the cluster of each data, then we add the new column, by naming it as “Cluster”. dataset["Cluster"] = y_predicted dataset.head(10) And the last step is visualizing the result of the clustering. For better representation, we need to give each of the clusters a unique color and name. plt.figure(figsize=(12,8)) df1 = dataset[dataset.Cluster==0] df2 = dataset[dataset.Cluster==1] df3 = dataset[dataset.Cluster==2] df4 = dataset[dataset.Cluster==3] df5 = dataset[dataset.Cluster==4] plt.scatter(df1['Annual Income (k$)'],df1['Spending Score (1-100)'],color='green', label='Target Group') plt.scatter(df2['Annual label='Sensible') Income plt.scatter(df3['Annual label='Careless') Income (k$)'],df2['Spending (k$)'],df3['Spending Score Score (1-100)'],color='magenta', (1-100)'],color='orange', plt.scatter(df4['Annual Income (k$)'],df4['Spending Score (1-100)'],color='red', label='Careful') plt.scatter(df5['Annual Income (k$)'],df5['Spending Score (1-100)'],color='blue', label='Standard') plt.title('Clustering Result', fontweight='bold',fontsize=20) plt.xlabel('Annual Income (k$)',fontsize=15) plt.ylabel('Spending Score (1-100)',fontsize=15) plt.legend(fontsize=15) plt.grid(True) plt.show() 11. Write a python program on Association Rule Mining Apriori Algorithm: In this algorithm, there are product clusters that pass frequently, and then strong relationships between these products and other products are sought. DATASET: https://www.kaggle.com/datasets/shazadudwadia/supermarket import pandas as pd import numpy as np from mlxtend.frequent_patterns import apriori, association_rules import matplotlib.pyplot as plt df = pd.read_csv('GroceryStoreDataSet.csv', names = ['products'], sep = ',') df.head() df.shape data = list(df["products"].apply(lambda x:x.split(",") )) data #Let's transform the list, with one-hot encoding from mlxtend.preprocessing import TransactionEncoder a = TransactionEncoder() a_data = a.fit(data).transform(data) df = pd.DataFrame(a_data,columns=a.columns_) df = df.replace(False,0) df #set a threshold value for the support value and calculate the support value. df = apriori(df, min_support = 0.2, use_colnames = True) df #Processing 42 combinations | Sampling itemset size 3 #Let's view our interpretation values using the Associan rule function. df_ar = association_rules(df, metric = "confidence", min_threshold = 0.6) df_ar For example, if we examine our 1st index value; ● ● ● ● ● ● The probability of seeing sugar sales is seen as 30%. Bread intake is seen as 65%. We can say that the support of both of them is measured as 20%. 67% of those who buys sugar, buys bread as well. Users who buy sugar will likely consume 3% more bread than users who don't buy sugar. Their correlation with each other is seen as 1.05. Visualizing results 1. Support vs Confidence plt.scatter(df_ar['support'],df_ar['confidence'], alpha=0.5) plt.xlabel('support') plt.ylabel('confidence') plt.title('Support vs Confidence') plt.show() Support vs Lift plt.scatter(df_ar['support'],df_ar['lift'], alpha=0.5) plt.xlabel('support') plt.ylabel('lift') plt.title('Support vs Lift') plt.show() Lift vs Confidence fit = np.polyfit(df_ar['lift'],df_ar['confidence'], 1) fit_fn = np.poly1d(fit) plt.plot(df_ar['lift'], df_ar['confidence'], 'yo', df_ar['lift'], fit_fn(df_ar['lift']))