Skyline University College Sharjah United Arab Emirates Big Data Analytics Course Code: BIT4118 Student Name Student ID Remarks EXPERIMENT NO:1 Name of experiment: - CLUSTERING Goal: Understanding the intuition of Clustering using K-means clustering Theory: Imagine a dataset consisting of several points spread over an n-dimensional space. In order to find patterns over data points on a n-dimensional space, we use unsupervised methods. One of the most popular unsupervised method is clustering. Clustering is the task of grouping together a set of objects such that objects in the same cluster are more similar to each other than objects in different cluster. Clustering algorithms can be categorized based on their cluster model, in other words on how they form clusters or groups. Some of the prominent based clustering algorithms are connectivity-based clustering, centroid-based clustering, Distribution based clustering and density based methods. In this exercise, centroid based clustering is implemented. In this type of clustering, clusters are represented by a central vector or a centroid. This centroid might not necessarily be a member of the dataset. This is an iterative clustering algorithm where the notion of similarity is derived on how close the data point is to the center of the cluster. In this exercise, we will be working on mall customers data. Software Tools: R-Studio Big Data Analytics Experiment No. 1 Page 1 of 4 Skyline University College Sharjah United Arab Emirates Big Data Analytics Course Code: BIT4118 Procedure: 1. First, we randomly initialize and select k-points. These k-points are the means 2. We use Euclidean distance to find data-points that are closed to their center of the cluster 3. Then we calculate the mean of all the points in the cluster which is finding their centroid 4. We iteratively repeat step 1, step 2, step 3 until all the points are assigned to their respective clusters CODE library(tidyverse) library(arules) library(arulesViz) library(knitr) library(gridExtra) library(lubridate) library(readr) library(cluster) library(factoextra) dataset <- read.csv('Mall_Customers.csv') head(dataset) kmeans2 <- kmeans(na.omit(dataset), centers = 5) str(kmeans2) fviz_cluster(kmeans2, data = dataset) fviz_nbclust(dataset, kmeans, method = "wss") OUTPUT Big Data Analytics Experiment No. 1 Page 2 of 4 Skyline University College Sharjah United Arab Emirates Big Data Analytics Course Code: BIT4118 Summary of data Big Data Analytics Experiment No. 1 Page 3 of 4 Skyline University College Sharjah United Arab Emirates Big Data Analytics Course Code: BIT4118 Big Data Analytics Experiment No. 1 Page 4 of 4