VISVESVARAYA TECHNOLOGICAL UNIVERSITY JNANASANGAMA, BELAGAVI, KARNATAKA-590018 An Internship Report on “DATA PRE-PROCESSING OF MALL CUSTOMERS DATASET” Submitted in partial fulfillment towards award of the degree of BACHELOR OF ENGINEERING in Computer Science and Engineering Submitted by Deepthi Shekar K 4GW18CS022 Internship carried out at Tequed Labs rd No 10, 3 A Cross, Anjaneya Nagar, BSK 3rd stage Bangalore-560085 Internal Guide External Guide Dr. Gururaj K S(Professor) Supreeth Y S (CEO, Tequed Labs) DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING (Accredited by NBA, New Delhi, Validity: 01.07.2017 – 30.06.2020 & 01.07.2020 – 30.06.2023) GSSS INSTITUTE OF ENGINEERING & TECHNOLOGY FOR WOMEN (Affiliated to VTU, Belagavi, Approved by AICTE, New Delhi & Govt. of Karnataka) (Accredited with Grade ‘A’ by NAAC) K.R.S Road, Metagalli, Mysuru-570016, Karnataka 2021-2022 Geetha Shishu Shikshana Sangha (R) GSSS INSTITUTE OF ENGINEERING & TECHNOLOGY FOR WOMEN (Affiliated to VTU, Belagavi, Approved by AICTE -New Delhi & Govt. of Karnataka) K.R.S Road, Mysuru-570016, Karnataka Accredited with Grade ‘A’ by NACC DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING (Accredited by NBA, New Delhi, Validity 01.07.2017 to 30.06.2020 & 01.07.2020 to 30.06.2023) CERTIFICATE This is to certify that the 8th Semester Internship titled “DATA PRE-PROCESSING OF MALL CUSTOMERS DATASET” is a bonafide work carried out by Deepthi Shekar K(4GW18CS022), in partial fulfillment for the award of Degree of Bachelor of Engineering in Computer Science and Engineering of the Visvesvaraya Technological University, Belagavi, during the year 2021-22. The Internship Report has been approved as it satisfies the academic requirements with respect to the Internship work prescribed for Bachelor of Engineering Degree. Signature of Guide Dr. Gururaj K S Signature of HOD Signature of the Principal Dr. S Meenakshi Sundaram Designation Professor and Head Dr. Shivakumar M Principal Examiners Internal Examiner External Examiner ………………………………. ………………………………… Signature: ……………………………… ………………………………... Name: ACKNOWLEDGEMENT I sincerely owe my gratitude to all the persons who helped and guided me to carry out the internship. I am thankful to Mrs. Vanaja B Pandit, Honorary Secretary, GSSSIETW, Mysuru, for having supported in my academic endeavors. I am thankful to Dr. Shivakumar M, Principal, GSSSIETW, Mysuru, for all the support he has rendered. I thank Dr. S Meenakshi Sundaram, Professor and Head, Department of Computer Science and Engineering, for his constant support and encouragement throughout the tenure of this seminar work. I would like to thank Mr. Aditya S K, who guided the internship work at company I would like to sincerely thank my guide Dr. Gururaj K S, Designation, Department of Computer Science and Engineering, for providing relevant information, valuable guidance and encouragement to complete this seminar work. I am extremely pleased to thank my parents, family members and friends for their continuous support, inspiration and encouragement, for their helping hand and also last but not the least, I thank all the members who supported directly or indirectly in the seminar work process. Deepthi Shekar K [4GW18CS022] i INTERNSHIP COMPLETION CERTIFICATE ABSTRACT Customer segmentation is a separation of a market into multiple distinct groups of consumers who share the similar characteristics. Segmentation of market is an effective way to define and meet customer needs. Unsupervised Machine Learning Techniques, KMeans Clustering Algorithm are used to perform Market Basket Analysis. Market Basket Analysis is carried out to predict the target customers who can be easily converged, among all the customers. In order to allow the marketing team to plan the strategy to market the newproducts to the target customers which are like their interests. Management and maintain of customer relationship have always played a vital role to provide business intelligence to organizations to build, manage and develop valuable long term customer relationships. The importance of treating customers as an organizations main asset is increasing in value in present day and era. Organizations have an interest to invest in the development of customer acquisition, maintenance and development strategies. The business intelligence has a vital role to play in allowing companies to use technical expertise to gain better customer knowledge and Programs for outreach. Key words: Target Customers, Clusters, Unsupervised Learning, K-Means, Market Basket Analysis. ii TABLE OF CONTENTS Acknowledgement i Company Certificate ii Abstract iii List of Figures iv COMPANY PROFILE 1 1.1 About the company 1 1.2 History of the company 1 1.3 Founders of the company 2 1.4 Activities organized by the company 3 1.5 Services offered by company 3 1.6 Organization of the Report 3 INTRODUCTION 5 2.1 Objectives 5 2.2 Problem Statement 5 2.3 Proposed Solution 6 3 AREAS OF LEARNING 7 4 ABOUT THE PROJECT 9 4.1 Overview of the Project 9 4.2 System Requirement Specification 9 4.3 Architecture 10 4.4 Task: Data Acquisition and Cleaning 11 4.5 Implementation Code 12 RESULTS AND DISCUSSION 13 1 3 5 SNAPSHOTS 15 CONCLUSION 19 REFERNCES 20 iv LIST OF FIGURES FIGURE NUMBER DESCRIPTION PAGE NUMBER 1.1 Company Logo 1 4.1 Model Architecture 10 5.1 Loading Data 15 5.2 Renaming Columns 15 5.3 Display of first five rows of dataset 16 5.4 Dropping off Irrelevant Columns 16 5.5 Retrieving Description of Dataset 17 5.6 Checking missing values 17 5.7 5.8 Graph showing Data is cleaned and no missing values found Retrieving the data types of individual columns iv 18 18 Data Pre-Processing of Mall Customers Dataset Chapter 1 COMPANY PROFILE 1.1 About the Company Tequed Labs Private Limited is a Private incorporated on 22 January 2018. It is classified as Non-govt Company and is registered at Registrar of Companies, Bangalore. Figure 1.1 Company Logo Tequed Labs is a research and development center and educational institute based in Bangalore. They run a project consultancy where they undertake various projects from wide range of companies and assist them technically and build products and provide services to them. They are continuously involved in research about futuristic technologies and finding ways to simplify them for their clients. They also involved in distribution and sales of latest electronic innovation products developed all over the globe to their customers. 1.2 History of the Company Tequed Labs is a research and development center and educational institute based in Bangalore. Tequed Labs Private Limited is a Private incorporated on 22 January 2018. It is classified as Non-govt Company and is registered at Registrar of Companies, Bangalore. They are continuously involved in research about futuristic technologies and finding ways to simplify them for their clients. They run a project consultancy where they undertake various projects from wide range of companies and assist them technically and build products and provide services to them. They are recognized by many of their innovative projects which includes ‘women’s safety device’ which sends signals to the nearby police station, this project is highly appreciated by the Government of Karnataka. Dept. of CSE 1 GSSSIETW, Mysuru Data Pre-Processing of Mall Customers Dataset 1.2.1 Vision To be a world-class research and development organization committed to enhancing stakeholder’s value. 1.2.2 Mission To build best products that is socially innovative with high quality attributes and provides excellent education to all. 1.2.3 Values Zeal to excel and zest for change. Integrity and fairness in all matters. Respect for dignity and potential of individuals Strict adherence to commitments. Ensure speed of response. Faster learning, creativity and team-work. Loyalty and pride in the company. 1.3 Founders of the Company Tequed Labs is a research and development center and educational institute based in Bangalore started by Mr Aditya S K and Mr Supreeth Y S. They were focused on providing quality education on latest technologies and develop products which are of great need to the society. Dept. of CSE 2 GSSSIETW, Mysuru Data Pre-Processing of Mall Customers Dataset 1.4 Activities Organized by the Company They provide quality education on latest technologies and develop products which are of great need to the society. They provide education based on the following domains. Artificial intelligence and machine learning, Internet of things, Cyber security and ethical hacking, Full stack web development. 1.4.1 Technology Consulting Consulting Customization Branding Technology Migration. 1.5 Services offered by the Company Tequed lab Pvt Ltd, is one stop partner for all technology needs of tier II cities. An indepth knowledge of various technology areas enables us to provide end to end solutions and services with our web of participation. We maximize the benefits of our depth, diversity and delivery capability, ensuring adaptability to individual needs, and thus bringing out the most innovative solution in every business and technology domain. 1.5.1 Domain of working Artificial Intelligence and Machine Learning. Internet of Things. Cyber Security and Ethical Hacking. Full Stack Web Development. 1.6 Organization of the Report The report is organized in the following manner: Chapter 1 focused on the Company profile that is about the About the company, history of the company, founders of the company, activities organized by the company, services Dept. of CSE 3 GSSSIETW, Mysuru Data Pre-Processing of Mall Customers Dataset offered by company. Chapter 2 focused on the Introduction, objectives, problem statement and proposed solution. Chapter 3 focuses on the Area of Learning that is full stack web development. Chapter 4 focuses on the Overview of the Project, system requirement specification, implementation, testing. Chapter 5 provides the results and discussion ******** Dept. of CSE 4 GSSSIETW, Mysuru Data Pre-Processing of Mall Customers Dataset Chapter 2 INTRODUCTION Management and maintain of customer relationship have always played a vital role to provide business intelligence to organizations to build, manage and develop valuable long term customer relationships. The importance of treating customers as an organizations main asset is increasing in value in present day and era. Organizations have an interest to invest in the development of customer acquisition, maintenance and development strategies. The business intelligence has a vital role to play in allowing companies to use technical expertise to gain better customer knowledge and Programs for outreach. By using clustering techniques like k-means, customers with similar means are clustered together. 2.1 Objective The main objective of this project is to understand the need of Customer Data Segmentation. The customer- organization relation plays a major role in the development of business and its products. Following are the key reasons for understanding Customer Data Segmentation: Management and Maintenance of customer relationship. Customer Data Segmentation helps in developing marketing strategies. Helps in understanding customer needs and expectation. Leads to business growth and production. 2.2 Problem Statement Customer Segmentation is a famous application of unsupervised learning. Using clustering, identify segments of customers to focus on the potential client base. They divide customers into groups according to common characteristics like gender, age, interests, and spending habits they can market to each group effectively. Utilize K-means clustering and furthermore envision the orientation and age difference. Then, at that point, examine their yearly earnings and division is that it centers around working on the relations with the client spending scores. The initial segment of the issue portrayal explains on the issues behind the idea of not having characterized client fragments inside an organization. One of the biggest challenges with customer segmentation is data quality. Inaccurate data in source system leads to poor grouping. Dept. of CSE 5 GSSSIETW, Mysuru Data Pre-Processing of Mall Customers Dataset 2.3 Proposed Solution Earlier, the segmentation process done by manually in before, since the previous models are predicted by constant data, the system needs the updated values and methods. Machine learning approaches are an incredible instrument for dissecting customer information and tracking down bits of knowledge and examples. Misleadingly wise models are useful assets for chiefs. They can exactly recognize client fragments, which is a lot harder to do physically or with ordinary logical techniques. There are many machine learning algorithms, each reasonable for a particular sort of issue. One extremely normal AI calculation that is appropriate for client division issues is the k-means clustering algorithm. Dept. of CSE 6 GSSSIETW, Mysuru Data Pre-Processing of Mall Customers Dataset Chapter 3 Areas of Learning 3.1 DATA SCIENCE: Data science is a deep study of the massive amount of data, which involves extracting meaningful insights from raw, structured, and unstructured data that is processed using the scientific method, different technologies, and algorithms. It is a multidisciplinary field that uses tools and techniques to manipulate the data so that one can find something new and meaningful. Data science uses the most powerful hardware, programming systems, and most efficient algorithms to solve the data related problems. It is the future of artificial intelligence. 3.2 CLUSTERING: Clustering or cluster analysis is a machine learning technique, which groups the unlabelled dataset. It can be defined as "A way of grouping the data points into different clusters, consisting of similar data points. The objects with the possible similarities remain in a group that has less or no similarities with another group." It does it by finding some similar patterns in the unlabelled dataset such as shape, size, color, behavior, etc., and divides them as per the presence and absence of those similar patterns. It is an unsupervised learning method, hence no supervision is provided to the algorithm, and it deals with the unlabeled dataset. After applying this clustering technique, each cluster or group is provided with a cluster-ID. ML system can use this id to simplify the processing of large and complex datasets. 3.3 UNSUPERVISED LEARNING: As the name suggests, unsupervised learning is a machine learning technique in which models are not supervised using training dataset. Instead, models itself find the hidden patterns and insights from the given data. It can be compared to learning which takes place in the human brain while learning new things. It can be defined as: Unsupervised learning is a type of machine learning in which models are trained using unlabeled dataset and are allowed to act on that data without any supervision. Unsupervised learning cannot be Dept. of CSE 7 GSSSIETW, Mysuru Data Pre-Processing of Mall Customers Dataset directly applied to a regression or classification problem because unlike supervised learning, we have the input data but no corresponding output data. The goal of unsupervised learning is to find the underlying structure of dataset, group that data according to similarities, and represent that dataset in a compressed format. 3.4 K-MEANS CLUSTERING: K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering problems in machine learning or data science. K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into different clusters. Here K defines the number of pre-defined clusters that need to be created in the process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on. The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters, and repeats the process until it does not find the best clusters. The value of k should be predetermined in this algorithm. Dept. of CSE 8 GSSSIETW, Mysuru Data Pre-Processing of Mall Customers Dataset Chapter 4 ABOUT THE PROJECT 4.1 Overview of the Project Data Science is one technology using which it is easy to analyze the data generated, make decisions, and understand business strategies and make future predictions. Mall Customers Data Segmentation is a necessary and one of the most important aspects in order to understand customers and their expectations. The importance of treating customers as an organizations main asset is increasing in value in present day and era. Organizations have an interest to invest in the development of customer acquisition, maintenance and development strategies. The business intelligence has a vital role to play in allowing companies to use technical expertise to gain better customer knowledge and Programs for outreach. By using clustering techniques like k-means, customers with similar means are clustered together. Customer segmentation helps the marketing team to recognize and expose different customer segments that think differently and follow differentpurchasing strategies. Customer segmentation helps in figuring out the customers who vary in terms of preferences, expectations, desires and attributes. The main purpose of performing customer segmentation is to group people, who have similar interest so that the marketing team can converge in an effective marketing plan. Clustering is an iterative process of knowledge discovery from vast amounts of raw and unorganized data. Clustering is a type of exploratory data mining that is used in many applications, such as machine learning, classification and pattern recognition. 4.2 System Requirements System requirements should describe functional and non-functional requirements so that they are understandable by system users who do not have detailed technical knowledge. User requirements are described using natural language, tables and diagrams. Dept. of CSE 9 GSSSIETW, Mysuru Data Pre-Processing of Mall Customers Dataset Software Requirements: 1. Language: Python 2. Operating System: Windows 7 and above. 3. IDE: Jupyter Notebook 4. Libraries: Pandas NumPy Matplotlib Seaborn. Hardware Requirements: 1. Processor: Intel i5 2.39GHz 2. Hard disk: 500GB 3. RAM: 8GB 4.3 Architecture The model architecture of the Customer Segmentation is as shown in the following figure: Figure 4.1 Model Architecture Dept. of CSE 10 GSSSIETW, Mysuru Data Pre-Processing of Mall Customers Dataset 4.4 Task: Data Acquisition and Cleaning 4.4.1 Selection of dataset Feature selection is the process of reducing the number of input variables when developing a predictive model. It is desirable to reduce the number of input variables to both reduce the computational cost of modeling and, in some cases, to improve the performance of the model. Statistical-based feature selection methods involve evaluating the relationship between each input variable and the target variable using statistics and selecting those input variables that have the strongest relationship with the target variable. These methods can be fast and effective, although the choice of statistical measures depends on the data type of both the input and output variables. 4.4.2 Pre-Processing Pre-processing refers to the transformations applied to our data before feeding it to the algorithm. Data Preprocessing is a technique that is used to convert the raw data into a clean data set. In other words, whenever the data is gathered from different sources it is collected in raw format which is not feasiblefor the analysis. 4.4.3 Data Cleaning Data cleaning, or data cleansing, is the important process of correcting or removing incorrect, incomplete, or duplicate data within a dataset. Data cleaning should be the first step in our workflow. When working with large datasets and combining various data sources, there’s a strong possibility one may duplicate or mislabel data. If we have inaccurate or incorrect data, it will lose its quality, and our algorithms and outcomes become unreliable. Dept. of CSE 11 GSSSIETW, Mysuru Data Pre-Processing of Mall Customers Dataset 4.5 Implementation Code import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns import warnings df=pd.read_csv('C:/Users/Admin/Downloads/Works/Mall_Customers_dataset.csv') df df.rename(columns={'Genre':'Gender'},inplace=True) #Renaming column df df.head()# Printing first 5 rows of the table df.drop(['CustomerID'],axis=1,inplace=True) #Dropping off irrelevant columns df df.shape # To get the number of rows and columns df.describe() df.isnull().sum() #To check if there is any null values in datset sns.heatmap(df.isnull()) # Graph showing Data is cleaned and no missing values are present df.dtypes Dept. of CSE 12 GSSSIETW, Mysuru Data Pre-Processing of Mall Customers Dataset Chapter 5 RESULTS AND DISCUSSIONS 5.1 TECHNICAL OUTCOMES The internship at Tequed Labs has been a success. This was a complete online internship, which has taken from zero to an advanced level, where we were able to create own programs and understand other programs as well. Data Science is one of the most leading technologies that help in managing and storing data and also to make future predictions. Using the techniques of Data Science, it is easy to classify and group the data collected and make predictions for the growth of business. 5.2 SKILLS DEVLOPED Problem Solving Problem solving is the act of defining a problem, determining the cause of the problem, identifying, prioritizing, and selecting alternatives for a solution, and Implementing a solution. Data Scientists should have a rigorous data-driven problem-solving approach to their thinking. Top Data Scientists are able to discern which problems are important to solve and then model what is critical to solving the problem. There’s no template for solving a data science problem. The path to solving a business problem changes with every new dataset. In addition, the practice of data science is riddled with challenges like missing data values, uncooperative stakeholders and coding bugs. Communication Along with being able to create great visualizations to communicate results to end users, Data Scientists must possess persuasive communication skills and strong interpersonal skills to see a project from start to finish. In their role, they may have to interact with a variety of personalities and stakeholders from technical IT and software engineers to marketing managers and other functional staff to C-suite managers. Certainly, to progress in the ranks as a Data Scientist, communication skills need to be strong. 5.3 DRAWBACKS K-means algorithm is good in capturing structure of the data if clusters have a spherical-like Dept. of CSE 13 GSSSIETW, Mysuru Data Pre-Processing of Mall Customers Dataset shape. It always tries to construct a nice spherical shape around the centroid. That means, the minute the clusters have a complicated geometric shapes, k-means does a poor job in clustering the data. K-means algorithm doesn’t let data points that are far-away from each other share the same cluster even though they obviously belong to the same cluster. Dept. of CSE 14 GSSSIETW, Mysuru Data Pre-Processing of Mall Customers Dataset SNAPSHOTS Figure 5.1 Loading Data Figure 5.2 Renaming Columns Dept. of CSE 15 GSSSIETW, Mysuru Data Pre-Processing of Mall Customers Dataset Figure 5.3 Display of first five rows of dataset Figure 5.4 Dropping off Irrelevant columns Dept. of CSE 16 GSSSIETW, Mysuru Data Pre-Processing of Mall Customers Dataset Figure 5.5 Retrieving Description of Dataset Figure 5.6 Checking missing values Dept. of CSE 17 GSSSIETW, Mysuru Data Pre-Processing of Mall Customers Dataset Figure 5.7 Graph showing Data is cleaned and no missing values found Figure 5.8 Retrieving the data types of individual columns Dept. of CSE 18 GSSSIETW, Mysuru CONCLUSION This report uses Mall customer datasets contains information about people visiting the mall. The dataset has gender, customer id, age, annual income, and spending score. It collects insights from the data and group customers based on their behaviors. Before starting the predictions, the report makes a brief summary of model evaluation, explaining the most common metricsused in categorical problems in machine learning. In data preparation, the training and testing sets are created and they will be used during the model building. In data exploration and visualization, we look for features that may provide good prediction results. The best predictors have low distribution overlapping area and low correlation among them. Model building starts with taking two features at a time and make clusters using K means clustering it happens almost three time first age and spending score ,second annual income and spending score at last we combine all three features and populate the cluster. explaining very simple models and gradually moves to more complex ones. There’s a brief explanation on some of the models used in this report. XX XX REFERENCES IEEE/Journal Papers [1]“ Customer Segmentation In Shopping Mall Using Clustering In Machine Learning”, M.Thirunavakarasu1, Kuncham Pavan Kumar Reddy2, G.Srinivasa Teja3 WEBSITES https://quanthub.com/data-science-skills/ https://www.javatpoint.com/data-science https://www.kaggle.com/datasets/vjchoudhary7/customer-segmentation-tutorialin-python?resource=download https://www.simplilearn.com/what-skills-do-i-need-to-become-a-data-scientistarticle https://www.simplilearn.com/what-skills-do-i-need-to-become-a-data-scientistarticle ******** xx