Uploaded by Ying Delusi0n

DS lab manual[1]

advertisement
Dayananda Sagar University
School of Engineering
Department of CSE (Data Science)
Fundamentals of Data Science Lab Manual
Course code: 21DS2402
1. Write a Program to import iris dataset and display head,tail,summary,interquartile range
and structure.
a. Using a python code display head,tail, summary,info,interquartile range
b. Using R code display head,tail, summary,structure,interquartile range
a.Using a python code display head,tail, summary,info,interquartile range
import pandas as pd
df = pd.read_csv('F:\ML\Iris.csv')
df.head()
df.tail()
df.shape
print(df.quantile(q = 0.5))
print(df.quantile(q = 0.75))
print(df.describe())
b.Using R code display head,tail, summary,structure,interquartile range
library(tidyverse)
df <- iris
print("Head")
print("--------------------")
print(head(df))
print('-----------------------')
print("Tail")
print(tail(df))
print(str(df)) # structure of data
print(summary(df)) # summary of data
print("----------------------------------------------------")
print(quantile(df$mpg)) # quantile of data
df <- iris
# Subset of data
print(summary(df))
print(df[1:3])
print(subset(df,Species == "versicolor"))
print(subset(df,Sepal.Length == 6.2))
print(aggregate(df$Sepal.Width,list(df$Sepal.Width),FUN = mean))
Output:
2. Write a Program for data cleaning and find missing values in iris dataset
a. Using python perform data cleaning,find missing values and drop unused attributes
b. Using R perform data cleaning,find missing values and drop unused attributes
a.Using python perform data cleaning,find missing values and drop unused attributes
import pandas as pd
df = pd.read_csv('F:\ML\Iris.csv')
df.isnull()
df.isnull().sum()
df.drop(columns="Id",inplace=True)
df.isnull().sum()
df.duplicated()
df.fillna(0)
df.dropna()
b. Using R perform data cleaning,find missing values and drop unused attributes
data("iris")
str(iris)
# First we create a copy of our dataset
iris_copy <- iris
# We introduce several missing values in some columns
iris_copy$Sepal.Length[c(15, 20, 50, 67, 97, 118)] <- NA
iris_copy$Sepal.Width[c(4, 80, 97, 106)] <- NA
iris_copy$Petal.Length[c(5, 17, 35, 49)] <- NA
# Now we see that there are missing values in some columns
summary(iris_copy)
length(which(is.na(iris_copy)))
# We can check that we introduced 14 missing values in the table
# First we will use the complete.cases function (check ?complete.cases for information)
# This function returns only rows without NA's.Putting infront of it we get only rows with
NA's
iris_NA <- iris_copy[!complete.cases(iris_copy), ]
iris_NA
# We see that we have 13 rows with missing values on it
# Another way is to search for TRUE values in the is.na function
iris_NA <- iris_copy[rowSums(is.na(iris_copy)) > 0, ]
iris_NA
# The first choice can be to just remove the rows containing NA's
complete.cases(iris_copy)
iris_clean <- iris_copy[complete.cases(iris_copy), ]
length(which(is.na(iris_clean)))
iris_copy[is.na(iris_copy$Sepal.Length) & (iris_copy$Species == "setosa"),"Sepal.Length"]
<-median(iris_copy$Sepal.Length[which(iris_copy$Species == "setosa")], na.rm = TRUE)
iris_NA <- iris_copy[!complete.cases(iris_copy), ]
iris_NA
# Now we have removed 3 NA's. Only 11 left
iris_copy[is.na(iris_copy$Sepal.Length) & (iris_copy$Species == "versicolor"),"Sepal.Length"]
<- median(iris_copy$Sepal.Length[which(iris_copy$Species == "versicolor")], na.rm = TRUE)
iris_NA <- iris_copy[!complete.cases(iris_copy), ]
iris_NA
# Now we have removed 2 NA's. Only 9 left
iris_copy[is.na(iris_copy$Sepal.Length) & (iris_copy$Species == "virginica"),"Sepal.Length"]
<- median(iris_copy$Sepal.Length[which(iris_copy$Species == "virginica")], na.rm = TRUE)
iris_NA <- iris_copy[!complete.cases(iris_copy), ]
iris_NA
# Now we have removed 1 NA's. Only 8 left
iris_copy[is.na(iris_copy$Sepal.Width) & (iris_copy$Species == "setosa"),"Sepal.Width"] <median(iris_copy$Sepal.Width[which(iris_copy$Species == "setosa")], na.rm = TRUE)
iris_NA <- iris_copy[!complete.cases(iris_copy), ]
iris_NA
# Now we have removed 1 NA's. Only 7 left
iris_copy[is.na(iris_copy$Petal.Length) & (iris_copy$Species == "setosa"),"Petal.Length"] <median(iris_copy$Petal.Length[which(iris_copy$Species == "setosa")], na.rm = TRUE)
iris_NA <- iris_copy[!complete.cases(iris_copy), ]
iris_NA
Output:
Sepal.Length Sepal.Width Petal.Length Petal.Width
4
4.6
NA
1.5
0.2
setosa
5
5.0
3.6
NA
0.2
setosa
1.2
0.2
setosa
NA
0.4
setosa
1.5
0.3
setosa
15
NA
4.0
17
5.4
20
NA
35
4.9
3.1
NA
0.2
setosa
49
5.3
3.7
NA
0.2
setosa
50
NA
3.3
1.4
0.2
setosa
67
NA
3.0
4.5
1.5 versicolor
80
5.7
NA
3.5
1.0 versicolor
97
NA
3.9
3.8
NA
4.2
1.3 versicolor
Species
106
7.6
NA
6.6
2.1 virginica
118
NA
3.8
6.7
2.2 virginica
Sepal.Length Sepal.Width Petal.Length Petal.Width
80
5.70
NA
3.5
1.0 versicolor
97
5.95
NA
4.2
1.3 versicolor
106
7.60
NA
6.6
2.1 virginica
Species
3. Write a program for exploratory data analysis(EDA) on a dataset to identify patterns,
trends,outliers, and relationships between variables and visualize the iris dataset using box
plot,scatter plot and histogram.
a. Using python perform EDA for iris dataset and visualize the dataset using above
techniques.
b. Using R perform EDA for iris dataset and visualize the dataset using above
techniques.
a. Using python perform EDA for iris dataset and visualize the dataset using above techniques.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_csv('F:\ML\Iris.csv')
df.groupby('Species').agg(['mean', 'median']) # passing a list of recognized strings
df.groupby('Species').agg([np.mean, np.median])
df.groupby('Species').agg(['mean', 'median']) # passing a list of recognized strings
df.groupby('Species').agg([np.mean, np.median])
## Distribution of particular species
sns.distplot(a=df['PetalWidthCm'], bins=40, color='b')
plt.title('petal width distribution plot')
## count of number of observation of each species
sns.countplot(x='Species',data=df)
sns.pairplot(df, hue='Species')
plt.figure(figsize=(17,9))
plt.title("Comparison between various species based on sapel length and width")
sns.scatterplot(df['SepalLengthCm'],df['SepalWidthCm'],hue =df['Species'],s=50)
b. Using R perform EDA for iris dataset and visualize the dataset using above
techniques.
data(iris)
head(iris, 4)
plot(iris2)
sepal_length<-iris2$sepal.length
hist(sepal_length)
hist(sepal_length, main="Histogram of Sepal Length", xlab="Sepal Length", xlim=c(4,8),
col="blue", freq=FALSE)
sepal_width<-iris2$sepal.width
hist(sepal_width, main="Histogram of Sepal Width", xlab="Sepal Width", xlim=c(2,5),
col="darkorchid", freq=FALSE)
irisVer <- subset(iris, Species == "versicolor")
irisSet <- subset(iris, Species == "setosa")
irisVir <- subset(iris, Species == "virginica")
par(mfrow=c(1,3),mar=c(6,3,2,1))
boxplot(irisVer[,1:4], main="Versicolor, Rainbow Palette",ylim = c(0,8),las=2, col=rainbow(4))
boxplot(irisSet[,1:4], main="Setosa, Heat color Palette",ylim = c(0,8),las=2, col=heat.colors(4))
boxplot(irisVir[,1:4], main="Virginica, Topo colors Palette",ylim = c(0,8),las=2, col=topo.colors(
Outp.ut:
4. Write a program to plot correlation matrix and covariance plot for iris dataset.
a. Using R code for iris dataset plot correlation matrix between sepal width and sepal
length.
b. Using python code for mtcars dataset plot correlation matrix
a.Using R code for iris dataset plot correlation matrix between sepal width and sepal length.
library(datasets)
data(iris)
summary(iris)
# Correlation Plot
plot(iris)
# Correlation Matrix
print(cor(iris[, 1:4]))
# Diff Correlations
print(plot(iris$sepal.width, iris$sepal.length))
b.Using python code for mtcars dataset plot correlation matrix
import pandas as pd
df = pd.read_csv("path-to-the-file/mtcars.csv")
# Correlation
print(df.corr(df,method = "pearson"))
# Covariance
print(df.cov(df["mpg"]))
Output:
Sepal.Length
Sepal.Width
Petal.Length
Petal.Width
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Species
setosa
:50
versicolor:50
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length
1.0000000 -0.1175698 0.8717538 0.8179411
Sepal.Width
-0.1175698 1.0000000 -0.4284401 -0.3661259
Petal.Length
0.8717538 -0.4284401 1.0000000 0.9628654
Petal.Width
0.8179411 -0.3661259 0.9628654 1.0000000
Error in plot.window(...) : need finite 'xlim' values
Calls: print -> plot -> plot.default -> localWindow -> plot.window
5. Write a program for dimensionality reduction using principal component analysis for iris
dataset.
a. Using R code implement PCA
b. Using python implement PCA
a. Using R code implement PCA
library(datasets)
data("iris")
str(iris)
library(caret)
set.seed(100)
ind <- createDataPartition(iris$Species,p=0.80,list = F)
train <- iris[ind,]
test <- iris[-ind,]
dim(train)
dim(test)
library(psych)
pairs.panels(train[,-5],gap=0,bg=c("red","blue","yellow")[train$Species],
pch=21)
pc <- prcomp(train[,-5],center = T,scale. = T)
pc
summary(pc)
pairs.panels(pc$x,gap=0,bg=c("red","blue","yellow")[train$Species],
pch = 21)
b. Using python implement PCA
from sklearn import datasets
from sklearn.decomposition import PCA
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D
# %matplotlib inline
%matplotlib notebook
iris = datasets.load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = pd.Series(iris.target, name='FlowerType')
plt.figure(2, figsize=(8, 6))
plt.clf()
# Plot the training points
plt.scatter(X['sepal length (cm)'], X['sepal width (cm)'], s=35, c=y, cmap=plt.cm.brg)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.title('Sepal length vs. Sepal width')
plt.show()
pca_iris = PCA(n_components=3).fit(iris.data)
pca_iris.explained_variance_ratio_
pca_iris.transform(iris.data)
iris_reduced = PCA(n_components=3).fit(iris.data)
iris_reduced.components_
iris_reduced = PCA(n_components=3).fit_transform(iris.data)
fig = plt.figure(1, figsize=(8, 6))
ax = Axes3D(fig, elev=-150, azim=110)
X_reduced = PCA(n_components=3).fit_transform(iris.data)
ax.scatter(iris_reduced[:, 0], iris_reduced[:, 1], iris_reduced[:, 2],
cmap=plt.cm.Paired, c=iris.target)
for k in range(3):
ax.scatter(iris_reduced[y==k,
0],
iris_reduced[y==k,
1],
iris_reduced[y==k,
label=iris.target_names[k])
ax.set_title("First three P.C.")
ax.set_xlabel("P.C. 1")
ax.w_xaxis.set_ticklabels([])
ax.set_ylabel("P.C. 2")
ax.w_yaxis.set_ticklabels([])
ax.set_zlabel("P.C. 3")
ax.w_zaxis.set_ticklabels([])
2],
plt.legend(numpoints=1)
plt.show()
Output:
Standard deviations (1, .., p=4):
[1] 1.7161062 0.9378761 0.3941760 0.1413965
Rotation (n x k) = (4 x 4):
PC1
PC2
PC3
PC4
Sepal.Length 0.5201525 -0.37173825 -0.7181697 -0.2747443
Sepal.Width -0.2864515 -0.92354821 0.2239707 0.1218250
Petal.Length 0.5777030 -0.04107397 0.1319176 0.8044687
Petal.Width 0.5600413 -0.08474854 0.6454976 -0.5123517
> summary(pc)
Importance of components:
PC1 PC2 PC3 PC4
Standard deviation 1.7161 0.9379 0.39418 0.1414
Proportion of Variance 0.7363 0.2199 0.03884 0.0050
6. Implement a program to select required features from iris dataset using chi-square test
a. Use python code to implement chi square test
b.Use R code to implement chi square test
a. Use python code to implement chi square test
# Load packages
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
# Load iris data
iris_dataset = load_iris()
# Create features and target
X = iris_dataset.data
y = iris_dataset.target
# Convert to categorical data by converting data to integers
X = X.astype(int)
# Two features with highest chi-squared statistics are selected
chi2_features = SelectKBest(chi2, k = 2)
X_kbest_features = chi2_features.fit_transform(X, y)
# Reduced features
print('Original feature number:', X.shape[1])
print('Reduced feature number:', X_kbest_features.shape[1])# Load packages
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
# Load iris data
iris_dataset = load_iris()
# Create features and target
X = iris_dataset.data
y = iris_dataset.target
# Convert to categorical data by converting data to integers
X = X.astype(int)
# Two features with highest chi-squared statistics are selected
chi2_features = SelectKBest(chi2, k = 2)
X_kbest_features = chi2_features.fit_transform(X, y)
# Reduced features
print('Original feature number:', X.shape[1])
print('Reduced feature number:', X_kbest_features.shape[1])
b.Use R code to implement chi square test
dat <- iris
dat$size <- ifelse(dat$Sepal.Length < median(dat$Sepal.Length),
"small", "big"
)
table(dat$Species, dat$size)
library(ggplot2)
ggplot(dat) +
aes(x = Species, fill = size) +
geom_bar() +
scale_fill_hue() +
theme_minimal()
test <- chisq.test(table(dat$Species, dat$size))
test
test$statistic # test statistic
## X-squared
## 86.03451
test$p.value # p-value
## [1] 2.078944e-19
summary(table(dat$Species, dassst$size))
## Number of cases in table: 150
## Number of factors: 2
## Test for independence of all factors:
## Chisq = 86.03, df = 2, p-value = 2.079e-19
# third method:
library(vcd)
assocstats(table(dat$Species, dat$size))
test$observed
test$expected # test statistic
Output:
Original feature number: 4
Reduced feature number: 2
Original feature number: 4
Reduced feature number: 2
X-squared
86.03451
p-value
[1] 2.078944e-19
Number of cases in table: 150
Number of factors: 2
Test for independence of all factors:
Chisq = 86.03, df = 2, p-value = 2.079e-19
7. Build a python program for iris dataset using different classification algorithms .
a. Use Support vector machine(SVM) build a model
b. Use K Nearest Neighbour(KNN) to build model
c. use Decision tree classifier to build a model
d.use Random forest(RF) classifier to build a model.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('F:\ML\Iris.csv')
data_points = data.iloc[:, 1:5]
labels = data.iloc[:, 5]
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(data_points,labels,test_size=0.2)
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
Standard_obj = StandardScaler()
Standard_obj.fit(x_train)
x_train_std = Standard_obj.transform(x_train)
x_test_std = Standard_obj.transform(x_test)
a. Use Support vector machine(SVM) build a model
from sklearn.svm import SVC
svm = SVC(kernel='rbf', random_state=0, gamma=.10, C=1.0)
svm.fit(x_train_std, y_train)
print('Training data accuracy {:.2f}'.format(svm.score(x_train_std, y_train)*100))
print('Testing data accuracy {:.2f}'.format(svm.score(x_test_std, y_test)*100))
b. Use K Nearest Neighbour(KNN) to build model
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 7, p = 2, metric='minkowski')
knn.fit(x_train_std,y_train)
print('Training data accuracy {:.2f}'.format(knn.score(x_train_std, y_train)*100))
print('Testing data accuracy {:.2f}'.format(knn.score(x_test_std, y_test)*100))
c. use Decision tree classifier to build a model
from sklearn import tree
decision_tree = tree.DecisionTreeClassifier(criterion='gini')
decision_tree.fit(x_train_std, y_train)
print('Training data accuracy {:.2f}'.format(decision_tree.score(x_train_std, y_train)*100))
print('Testing data accuracy {:.2f}'.format(decision_tree.score(x_test_std, y_test)*100))
d.use Random forest(RF) classifier to build a model.
from sklearn.ensemble import RandomForestClassifier
random_forest = RandomForestClassifier()
random_forest.fit(x_train_std, y_train)
print('Training data accuracy {:.2f}'.format(random_forest.score(x_train_std, y_train)*100))
print('Testing data accuracy {:.2f}'.format(random_forest.score(x_test_std, y_test)*100))
Output:
Training data accuracy 95.83
Testing data accuracy 100.00
Training data accuracy 96.67
Testing data accuracy 100.00
Training data accuracy 100.00
Testing data accuracy 96.67
Training data accuracy 100.00
Testing data accuracy 96.67
8a. Write a python code for Simple Linear Regression
import numpy as np
import matplotlib.pyplot as mtplt
def estimate_coeff(p, q):
# Here, we will estimate the total number of points or observation
n1 = nmp.size(p)
# Now, we will calculate the mean of a and b vector
m_p = nmp.mean(p)
m_q = nmp.mean(q)
# here, we will calculate the cross deviation and deviation about a
SS_pq = nmp.sum(q * p) - n1 * m_q * m_p
SS_pp = nmp.sum(p * p) - n1 * m_p * m_p
# here, we will calculate the regression coefficients
b_1 = SS_pq / SS_pp
b_0 = m_q - b_1 * m_p
return (b_0, b_1)
def plot_regression_line(p, q, b):
# Now, we will plot the actual points or observation as scatter plot
mtplt.scatter(p, q, color = "m",
marker = "o", s = 30)
# here, we will calculate the predicted response vector
q_pred = b[0] + b[1] * p
# here, we will plot the regression line
mtplt.plot(p, q_pred, color = "g")
# here, we will put the labels
mtplt.xlabel('p')
mtplt.ylabel('q')
# here, we will define the function to show plot
mtplt.show()
def main():
# entering the observation points or data
p = np.array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])
q = np.array([11, 13, 12, 15, 17, 18, 18, 19, 20, 22])
# now, we will estimate the coefficients
b = estimate_coeff(p, q)
print("Estimated coefficients are :\nb_0 = {} \
\nb_1 = {}".format(b[0], b[1]))
# Now, we will plot the regression line
plot_regression_line(p, q, b)
if __name__ == "__main__":
main()
8b) Write a python program for Multiple Linear Regression
import numpy as np
import matplotlib as mpl
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
def generate_dataset(n):
x = []
y = []
random_x1 = np.random.rand()
random_x2 = np.random.rand()
for i in range(n):
x1 = i
x2 = i/2 + np.random.rand()*n
x.append([1, x1, x2])
y.append(random_x1 * x1 + random_x2 * x2 + 1)
return np.array(x), np.array(y)
x, y = generate_dataset(200)
mpl.rcParams['legend.fontsize'] = 12
fig = plt.figure()
ax = fig.add_subplot(projection ='3d')
ax.scatter(x[:, 1], x[:, 2], y, label ='regression', s = 5)
ax.legend()
ax.view_init(45, 0)
plt.show()
9a) Write a python program for One-way ANOVA
Install the scipy Module
pip3 install scipy
Step 1: Creating data groups.
The very first step is to create three arrays that will keep the information of cars when d
# Performance when each of the engine
# oil is applied
performance1 = [89, 89, 88, 78, 79]
performance2 = [93, 92, 94, 89, 88]
performance3 = [89, 88, 89, 93, 90]
performance4 = [81, 78, 81, 92, 82]
Step 2: Conduct the one-way ANOVA:
Python provides us f_oneway() function from SciPy library using which we can conduct the OneWay ANOVA.
# Importing library
from scipy.stats import f_oneway
# Performance when each of the engine
# oil is applied
performance1 = [89, 89, 88, 78, 79]
performance2 = [93, 92, 94, 89, 88]
performance3 = [89, 88, 89, 93, 90]
performance4 = [81, 78, 81, 92, 82]
# Conduct the one-way ANOVA
f_oneway(performance1, performance2, performance3, performance4)
Step 3: Analyse the result:
The F statistic and p-value turn out to be equal to 4.625 and 0.016336498 respectively. Since the
p-value is less than 0.05 hence we would reject the null hypothesis. This implies that we have
sufficient proof to say that there exists a difference in the performance among four different engine
oils.
9b)Write a Python program for Two-Way ANOVA
Step 1: Import libraries.
# Importing libraries
import numpy as np
import pandas as pd
Step 2: Enter the data.
Let us create a pandas DataFrame that consist of the following three variables:
·
·
·
fertilizers: how frequently each plant was fertilized that is daily or weekly.
watering: how frequently each plant was watered that is daily or weekly.
height: the height of each plant (in inches) after six months.
# Importing libraries
import numpy as np
import pandas as pd
# Create a dataframe
dataframe = pd.DataFrame({'Fertilizer': np.repeat(['daily', 'weekly'], 15),
'Watering': np.repeat(['daily', 'weekly'], 15),
'height': [14, 16, 15, 15, 16, 13, 12, 11, 14,15, 16, 16, 17, 18, 14, 13, 14, 14,14, 15, 16, 16, 17, 18,
14, 13, 14,14, 14, 15]})
Step 3: Conduct the two-way ANOVA:
To perform the two-way ANOVA, the Statsmodels library provides us with anova_lm() function.
The syntax of the function is given below,
Syntax:
sm.stats.anova_lm(model, type=2)
Parameters:
·
·
model: It represents model statistics
type: It represents the type of Anova test to perform that is { I or II or III or 1 or
2 or 3 }
# Importing libraries
import statsmodels.api as sm
from statsmodels.formula.api import ols
# Performing two-way ANOVA
model = ols(
'height ~ C(Fertilizer) + C(Watering) +\
C(Fertilizer):C(Watering)', data=df).fit()
sm.stats.anova_lm(model, typ=2)
Step 4: Combining all the steps.
# Importing libraries
import statsmodels.api as sm
from statsmodels.formula.api import ols
# Create a dataframe
dataframe = pd.DataFrame({'Fertilizer': np.repeat(['daily', 'weekly'], 15),'Watering':
np.repeat(['daily', 'weekly'], 15),'height': [14, 16, 15, 15, 16, 13, 12, 11,14, 15, 16, 16, 17, 18, 14,
13,14, 14, 14, 15, 16, 16, 17, 18,14, 13, 14, 14, 14, 15]})
# Performing two-way ANOVA
model = ols('height ~ C(Fertilizer) + C(Watering) +\
C(Fertilizer):C(Watering)',
data=dataframe).fit()
result = sm.stats.anova_lm(model, type=2)
# Print the result
print(result)
Interpreting the result:
Following are the p-values for each of the factors in the output:
·
·
·
The fertilizer p-value is equal to 0.913305
The Watering p-value is equal to 0.990865
The Fertilizer * Watering: p-value is equal to 0.904053
The p-values for water and sun turn out to be less than 0.05 which implies that the means of both
the factors possess a statistically significant effect on plant height. The p-value for the interaction
effect (0.904053) is greater than 0.05 which depicts that there is no significant interaction effect
between fertilizer frequency and watering frequency.
10.Write a Python Program on K-Means Clustering
Identification of customers based on their choices is an important strategy in any organization.
This identification may help in approaching customers with specific offers. An organization with
a large number of customers may experience difficulty in identifying and placing into a record
each customer. A huge amount of data processing and automated techniques are involved in
extracting insights from the large information collected on customers. Clustering method can help
to identifying the customers based on their key characteristics.
Dataset: https://www.kaggle.com/datasets/shwetabh123/mall-customers
The first step is importing the required libraries.
import pandas as pd
import numpy as npfrom sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
from yellowbrick.cluster import KElbowVisualizerfrom matplotlib import pyplot as plt
import matplotlib.cm as cm
import seaborn as sns
%matplotlib inline
sns.set_style('whitegrid')
plt.style.use('fivethirtyeight')
dataset = pd.read_csv("Mall_Customers.csv", sep=",")
dataset.head()
dataset.info()
data_x = dataset.iloc[:, 3:5]
data_x.head()x_array = np.array(data_x)
print(x_array)
scaler=MinMaxScaler()
x_scaled=scaler.fit_transform(x_array)
x_scaled
Sum_of_squared_distances=[]
K=range(1,15)
for k in K:
km =KMeans(n_clusters =k)
km =km.fit(x_scaled)
Sum_of_squared_distances.append(km.inertia_)
plt.plot(K, Sum_of_squared_distances, 'bx-')
plt.xlabel('k')
plt.ylabel('SSE')
plt.title('Elbow Method For Optimal k')
plt.show()
After finding the optimal number of clusters, fit the K-Means clustering model to the dataset and
then predict clusters for each of the data elements.
numerics = dataset[['Annual Income (k$)','Spending Score (1-100)']]
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
for i in numerics:
scaler.fit(dataset[[i]])
dataset[i] = scaler.transform(dataset[[i]])km = KMeans(n_clusters=5)
y_predicted = km.fit_predict(dataset[['Annual Income (k$)', 'Spending Score (1-100)']])
y_predicted
After we get the cluster of each data, then we add the new column, by naming it as “Cluster”.
dataset["Cluster"] = y_predicted
dataset.head(10)
And the last step is visualizing the result of the clustering. For better representation, we need to
give each of the clusters a unique color and name.
plt.figure(figsize=(12,8))
df1 = dataset[dataset.Cluster==0]
df2 = dataset[dataset.Cluster==1]
df3 = dataset[dataset.Cluster==2]
df4 = dataset[dataset.Cluster==3]
df5 = dataset[dataset.Cluster==4]
plt.scatter(df1['Annual Income (k$)'],df1['Spending Score (1-100)'],color='green', label='Target
Group')
plt.scatter(df2['Annual
label='Sensible')
Income
plt.scatter(df3['Annual
label='Careless')
Income
(k$)'],df2['Spending
(k$)'],df3['Spending
Score
Score
(1-100)'],color='magenta',
(1-100)'],color='orange',
plt.scatter(df4['Annual Income (k$)'],df4['Spending Score (1-100)'],color='red', label='Careful')
plt.scatter(df5['Annual Income (k$)'],df5['Spending Score (1-100)'],color='blue', label='Standard')
plt.title('Clustering Result', fontweight='bold',fontsize=20)
plt.xlabel('Annual Income (k$)',fontsize=15)
plt.ylabel('Spending Score (1-100)',fontsize=15)
plt.legend(fontsize=15)
plt.grid(True)
plt.show()
11. Write a python program on Association Rule Mining
Apriori Algorithm: In this algorithm, there are product clusters that pass frequently, and then
strong relationships between these products and other products are sought.
DATASET: https://www.kaggle.com/datasets/shazadudwadia/supermarket
import pandas as pd
import numpy as np
from mlxtend.frequent_patterns import apriori, association_rules
import matplotlib.pyplot as plt
df = pd.read_csv('GroceryStoreDataSet.csv', names = ['products'], sep = ',')
df.head()
df.shape
data = list(df["products"].apply(lambda x:x.split(",") ))
data
#Let's transform the list, with one-hot encoding
from mlxtend.preprocessing import TransactionEncoder
a = TransactionEncoder()
a_data = a.fit(data).transform(data)
df = pd.DataFrame(a_data,columns=a.columns_)
df = df.replace(False,0)
df
#set a threshold value for the support value and calculate the support value.
df = apriori(df, min_support = 0.2, use_colnames = True)
df
#Processing 42 combinations | Sampling itemset size 3
#Let's view our interpretation values using the Associan rule function.
df_ar = association_rules(df, metric = "confidence", min_threshold = 0.6)
df_ar
For example, if we examine our 1st index value;
●
●
●
●
●
●
The probability of seeing sugar sales is seen as 30%.
Bread intake is seen as 65%.
We can say that the support of both of them is measured as 20%.
67% of those who buys sugar, buys bread as well.
Users who buy sugar will likely consume 3% more bread than users who don't buy sugar.
Their correlation with each other is seen as 1.05.
Visualizing results
1. Support vs Confidence
plt.scatter(df_ar['support'],df_ar['confidence'], alpha=0.5)
plt.xlabel('support')
plt.ylabel('confidence')
plt.title('Support vs Confidence')
plt.show()
Support vs Lift
plt.scatter(df_ar['support'],df_ar['lift'], alpha=0.5)
plt.xlabel('support')
plt.ylabel('lift')
plt.title('Support vs Lift')
plt.show()
Lift vs Confidence
fit = np.polyfit(df_ar['lift'],df_ar['confidence'], 1)
fit_fn = np.poly1d(fit)
plt.plot(df_ar['lift'], df_ar['confidence'], 'yo', df_ar['lift'],
fit_fn(df_ar['lift']))
Download