Uploaded by yasasal337

026146633

advertisement
VISVESVARAYA TECHNOLOGICAL UNIVERSITY
JNANASANGAMA, BELAGAVI, KARNATAKA-590018
An Internship Report
on
“DATA PRE-PROCESSING OF MALL CUSTOMERS
DATASET”
Submitted in partial fulfillment towards award of the degree of
BACHELOR OF ENGINEERING
in
Computer Science and Engineering
Submitted by
Deepthi Shekar K
4GW18CS022
Internship carried out at
Tequed Labs
rd
No 10, 3 A Cross, Anjaneya Nagar, BSK 3rd stage
Bangalore-560085
Internal Guide
External Guide
Dr. Gururaj K S(Professor)
Supreeth Y S (CEO, Tequed Labs)
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
(Accredited by NBA, New Delhi, Validity: 01.07.2017 – 30.06.2020 & 01.07.2020 – 30.06.2023)
GSSS INSTITUTE OF ENGINEERING & TECHNOLOGY FOR WOMEN
(Affiliated to VTU, Belagavi, Approved by AICTE, New Delhi & Govt. of Karnataka)
(Accredited with Grade ‘A’ by NAAC)
K.R.S Road, Metagalli, Mysuru-570016, Karnataka
2021-2022
Geetha Shishu Shikshana Sangha (R)
GSSS INSTITUTE OF ENGINEERING & TECHNOLOGY FOR WOMEN
(Affiliated to VTU, Belagavi, Approved by AICTE -New Delhi & Govt. of Karnataka)
K.R.S Road, Mysuru-570016, Karnataka
Accredited with Grade ‘A’ by NACC
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
(Accredited by NBA, New Delhi, Validity 01.07.2017 to 30.06.2020 & 01.07.2020 to 30.06.2023)
CERTIFICATE
This is to certify that the 8th Semester Internship titled “DATA PRE-PROCESSING OF MALL
CUSTOMERS DATASET” is a bonafide work carried out by Deepthi Shekar
K(4GW18CS022), in partial fulfillment for the award of Degree of Bachelor of Engineering
in Computer Science and Engineering of the Visvesvaraya Technological University,
Belagavi, during the year 2021-22. The Internship Report has been approved as it satisfies the
academic requirements with respect to the Internship work prescribed for Bachelor of
Engineering Degree.
Signature of Guide
Dr. Gururaj K S
Signature of HOD
Signature of the Principal
Dr. S Meenakshi Sundaram
Designation
Professor and Head
Dr. Shivakumar M
Principal
Examiners
Internal Examiner
External Examiner
……………………………….
…………………………………
Signature: ………………………………
………………………………...
Name:
ACKNOWLEDGEMENT
I sincerely owe my gratitude to all the persons who helped and guided me to
carry out the internship.
I am thankful to Mrs. Vanaja B Pandit, Honorary Secretary, GSSSIETW,
Mysuru, for having supported in my academic endeavors.
I am thankful to Dr. Shivakumar M, Principal, GSSSIETW, Mysuru, for all
the support he has rendered.
I thank Dr. S Meenakshi Sundaram, Professor and Head, Department of
Computer Science and Engineering, for his constant support and encouragement
throughout the tenure of this seminar work.
I would like to thank Mr. Aditya S K, who guided the internship work at
company
I would like to sincerely thank my guide Dr. Gururaj K S, Designation,
Department of Computer Science and Engineering, for providing relevant information,
valuable guidance and encouragement to complete this seminar work.
I am extremely pleased to thank my parents, family members and friends for
their continuous support, inspiration and encouragement, for their helping hand and
also last but not the least, I thank all the members who supported directly or indirectly
in the seminar work process.
Deepthi Shekar K
[4GW18CS022]
i
INTERNSHIP COMPLETION CERTIFICATE
ABSTRACT
Customer segmentation is a separation of a market into multiple distinct groups of
consumers who share the similar characteristics. Segmentation of market is an effective
way to define and meet customer needs. Unsupervised Machine Learning Techniques, KMeans Clustering Algorithm are used to perform Market Basket Analysis. Market Basket
Analysis is carried out to predict the target customers who can be easily converged, among
all the customers. In order to allow the marketing team to plan the strategy to market the
newproducts to the target customers which are like their interests.
Management and maintain of customer relationship have always played a vital role to
provide business intelligence to organizations to build, manage and develop valuable long
term customer relationships. The importance of treating customers as an organizations
main asset is increasing in value in present day and era. Organizations have an interest to
invest in the development of customer acquisition, maintenance and development
strategies. The business intelligence has a vital role to play in allowing companies to use
technical expertise to gain better customer knowledge and Programs for outreach.
Key words: Target Customers, Clusters, Unsupervised Learning, K-Means, Market Basket
Analysis.
ii
TABLE OF CONTENTS
Acknowledgement
i
Company Certificate
ii
Abstract
iii
List of Figures
iv
COMPANY PROFILE
1
1.1 About the company
1
1.2 History of the company
1
1.3 Founders of the company
2
1.4 Activities organized by the company
3
1.5 Services offered by company
3
1.6 Organization of the Report
3
INTRODUCTION
5
2.1 Objectives
5
2.2 Problem Statement
5
2.3 Proposed Solution
6
3
AREAS OF LEARNING
7
4
ABOUT THE PROJECT
9
4.1 Overview of the Project
9
4.2 System Requirement Specification
9
4.3 Architecture
10
4.4 Task: Data Acquisition and Cleaning
11
4.5 Implementation Code
12
RESULTS AND DISCUSSION
13
1
3
5
SNAPSHOTS
15
CONCLUSION
19
REFERNCES
20
iv
LIST OF FIGURES
FIGURE
NUMBER
DESCRIPTION
PAGE
NUMBER
1.1
Company Logo
1
4.1
Model Architecture
10
5.1
Loading Data
15
5.2
Renaming Columns
15
5.3
Display of first five rows of dataset
16
5.4
Dropping off Irrelevant Columns
16
5.5
Retrieving Description of Dataset
17
5.6
Checking missing values
17
5.7
5.8
Graph showing Data is cleaned and no missing
values found
Retrieving the data types of individual columns
iv
18
18
Data Pre-Processing of Mall Customers Dataset
Chapter 1
COMPANY PROFILE
1.1 About the Company
Tequed Labs Private Limited is a Private incorporated on 22 January 2018. It is
classified as Non-govt Company and is registered at Registrar of Companies, Bangalore.
Figure 1.1 Company Logo
Tequed Labs is a research and development center and educational institute based in
Bangalore. They run a project consultancy where they undertake various projects from
wide range of companies and assist them technically and build products and provide
services to them. They are continuously involved in research about futuristic technologies
and finding ways to simplify them for their clients. They also involved in distribution and
sales of latest electronic innovation products developed all over the globe to their
customers.
1.2 History of the Company
Tequed Labs is a research and development center and educational institute based in
Bangalore. Tequed Labs Private Limited is a Private incorporated on 22 January 2018. It
is classified as Non-govt Company and is registered at Registrar of Companies,
Bangalore. They are continuously involved in research about futuristic technologies and
finding ways to simplify them for their clients. They run a project consultancy where they
undertake various projects from wide range of companies and assist them technically and
build products and provide services to them. They are recognized by many of their
innovative projects which includes ‘women’s safety device’ which sends signals to the
nearby police station, this project is highly appreciated by the Government of Karnataka.
Dept. of CSE
1
GSSSIETW, Mysuru
Data Pre-Processing of Mall Customers Dataset
1.2.1 Vision
To be a world-class research and development organization committed to enhancing
stakeholder’s value.
1.2.2 Mission
To build best products that is socially innovative with high quality attributes and provides
excellent education to all.
1.2.3 Values
Zeal to excel and zest for change.
Integrity and fairness in all matters.
Respect for dignity and potential of individuals
Strict adherence to commitments.
Ensure speed of response.
Faster learning, creativity and team-work.
Loyalty and pride in the company.
1.3 Founders of the Company
Tequed Labs is a research and development center and educational institute based in
Bangalore started by Mr Aditya S K and Mr Supreeth Y S. They were focused on
providing quality education on latest technologies and develop products which are of
great need to the society.
Dept. of CSE
2
GSSSIETW, Mysuru
Data Pre-Processing of Mall Customers Dataset
1.4 Activities Organized by the Company
They provide quality education on latest technologies and develop products which are of
great need to the society. They provide education based on the following domains.
Artificial intelligence and machine learning, Internet of things, Cyber security and ethical
hacking, Full stack web development.
1.4.1 Technology Consulting
Consulting
Customization
Branding
Technology Migration.
1.5 Services offered by the Company
Tequed lab Pvt Ltd, is one stop partner for all technology needs of tier II cities. An indepth knowledge of various technology areas enables us to provide end to end solutions
and services with our web of participation. We maximize the benefits of our depth,
diversity and delivery capability, ensuring adaptability to individual needs, and thus
bringing out the most innovative solution in every business and technology domain.
1.5.1 Domain of working
Artificial Intelligence and Machine Learning.
Internet of Things.
Cyber Security and Ethical Hacking.
Full Stack Web Development.
1.6 Organization of the Report
The report is organized in the following manner:
Chapter 1 focused on the Company profile that is about the About the company, history
of the company, founders of the company, activities organized by the company, services
Dept. of CSE
3
GSSSIETW, Mysuru
Data Pre-Processing of Mall Customers Dataset
offered by company.
Chapter 2 focused on the Introduction, objectives, problem statement and proposed
solution.
Chapter 3 focuses on the Area of Learning that is full stack web development.
Chapter 4 focuses on the Overview of the Project, system requirement specification,
implementation, testing.
Chapter 5 provides the results and discussion
********
Dept. of CSE
4
GSSSIETW, Mysuru
Data Pre-Processing of Mall Customers Dataset
Chapter 2
INTRODUCTION
Management and maintain of customer relationship have always played a vital role to
provide business intelligence to organizations to build, manage and develop valuable long
term customer relationships. The importance of treating customers as an organizations main
asset is increasing in value in present day and era. Organizations have an interest to invest in
the development of customer acquisition, maintenance and development strategies. The
business intelligence has a vital role to play in allowing companies to use technical expertise
to gain better customer knowledge and Programs for outreach. By using clustering
techniques like k-means, customers with similar means are clustered together.
2.1
Objective
The main objective of this project is to understand the need of Customer Data Segmentation.
The customer- organization relation plays a major role in the development of business and
its products. Following are the key reasons for understanding Customer Data Segmentation:
Management and Maintenance of customer relationship.
Customer Data Segmentation helps in developing marketing strategies.
Helps in understanding customer needs and expectation.
Leads to business growth and production.
2.2
Problem Statement
Customer Segmentation is a famous application of unsupervised learning. Using clustering,
identify segments of customers to focus on the potential client base. They divide customers
into groups according to common characteristics like gender, age, interests, and spending
habits they can market to each group effectively. Utilize K-means clustering and
furthermore envision the orientation and age difference. Then, at that point, examine their
yearly earnings and division is that it centers around working on the relations with the client
spending scores. The initial segment of the issue portrayal explains on the issues behind the
idea of not having characterized client fragments inside an organization. One of the biggest
challenges with customer segmentation is data quality. Inaccurate data in source system
leads to poor grouping.
Dept. of CSE
5
GSSSIETW, Mysuru
Data Pre-Processing of Mall Customers Dataset
2.3
Proposed Solution
Earlier, the segmentation process done by manually in before, since the previous models are
predicted by constant data, the system needs the updated values and methods. Machine
learning approaches are an incredible instrument for dissecting customer information and
tracking down bits of knowledge and examples. Misleadingly wise models are useful assets
for chiefs. They can exactly recognize client fragments, which is a lot harder to do
physically or with ordinary logical techniques. There are many machine learning algorithms,
each reasonable for a particular sort of issue. One extremely normal AI calculation that is
appropriate for client division issues is the k-means clustering algorithm.
Dept. of CSE
6
GSSSIETW, Mysuru
Data Pre-Processing of Mall Customers Dataset
Chapter 3
Areas of Learning
3.1 DATA SCIENCE:
Data science is a deep study of the massive amount of data, which involves extracting
meaningful insights from raw, structured, and unstructured data that is processed using the
scientific method, different technologies, and algorithms. It is a multidisciplinary field that
uses tools and techniques to manipulate the data so that one can find something new and
meaningful. Data science uses the most powerful hardware, programming systems, and most
efficient algorithms to solve the data related problems. It is the future of artificial
intelligence.
3.2 CLUSTERING:
Clustering or cluster analysis is a machine learning technique, which groups the unlabelled
dataset. It can be defined as "A way of grouping the data points into different clusters,
consisting of similar data points. The objects with the possible similarities remain in a group
that has less or no similarities with another group." It does it by finding some similar
patterns in the unlabelled dataset such as shape, size, color, behavior, etc., and divides them
as per the presence and absence of those similar patterns. It is an unsupervised
learning method, hence no supervision is provided to the algorithm, and it deals with the
unlabeled dataset. After applying this clustering technique, each cluster or group is provided
with a cluster-ID. ML system can use this id to simplify the processing of large and complex
datasets.
3.3 UNSUPERVISED LEARNING:
As the name suggests, unsupervised learning is a machine learning technique in which
models are not supervised using training dataset. Instead, models itself find the hidden
patterns and insights from the given data. It can be compared to learning which takes place
in the human brain while learning new things. It can be defined as: Unsupervised learning is
a type of machine learning in which models are trained using unlabeled dataset and are
allowed to act on that data without any supervision. Unsupervised learning cannot be
Dept. of CSE
7
GSSSIETW, Mysuru
Data Pre-Processing of Mall Customers Dataset
directly applied to a regression or classification problem because unlike supervised learning,
we have the input data but no corresponding output data. The goal of unsupervised learning
is to find the underlying structure of dataset, group that data according to similarities, and
represent that dataset in a compressed format.
3.4 K-MEANS CLUSTERING:
K-Means Clustering is an unsupervised learning algorithm that is used to solve the
clustering problems in machine learning or data science. K-Means Clustering is
an Unsupervised Learning algorithm, which groups the unlabeled dataset into different
clusters. Here K defines the number of pre-defined clusters that need to be created in the
process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and
so on. The algorithm takes the unlabeled dataset as input, divides the dataset into k-number
of clusters, and repeats the process until it does not find the best clusters. The value of k
should be predetermined in this algorithm.
Dept. of CSE
8
GSSSIETW, Mysuru
Data Pre-Processing of Mall Customers Dataset
Chapter 4
ABOUT THE PROJECT
4.1
Overview of the Project
Data Science is one technology using which it is easy to analyze the data generated, make
decisions, and understand business strategies and make future predictions. Mall Customers
Data Segmentation is a necessary and one of the most important aspects in order to
understand customers and their expectations.
The importance of treating customers as an organizations main asset is increasing in value in
present day and era. Organizations have an interest to invest in the development of customer
acquisition, maintenance and development strategies. The business intelligence has a vital
role to play in allowing companies to use technical expertise to gain better customer
knowledge and Programs for outreach. By using clustering techniques like k-means,
customers with similar means are clustered together. Customer segmentation helps the
marketing team to recognize and expose different customer segments that think differently
and follow differentpurchasing strategies.
Customer segmentation helps in figuring out the customers who vary in terms of
preferences, expectations, desires and attributes. The main purpose of performing customer
segmentation is to group people, who have similar interest so that the marketing team can
converge in an effective marketing plan. Clustering is an iterative process of knowledge
discovery from vast amounts of raw and unorganized data. Clustering is a type of
exploratory data mining that is used in many applications, such as machine learning,
classification and pattern recognition.
4.2
System Requirements
System requirements should describe functional and non-functional requirements so that
they are understandable by system users who do not have detailed technical knowledge.
User requirements are described using natural language, tables and diagrams.
Dept. of CSE
9
GSSSIETW, Mysuru
Data Pre-Processing of Mall Customers Dataset
Software Requirements:
1. Language: Python
2. Operating System: Windows 7 and above.
3. IDE: Jupyter Notebook
4. Libraries: Pandas
NumPy
Matplotlib
Seaborn.
Hardware Requirements:
1. Processor: Intel i5 2.39GHz
2. Hard disk: 500GB
3. RAM: 8GB
4.3
Architecture
The model architecture of the Customer Segmentation is as shown in the following figure:
Figure 4.1 Model Architecture
Dept. of CSE
10
GSSSIETW, Mysuru
Data Pre-Processing of Mall Customers Dataset
4.4
Task: Data Acquisition and Cleaning
4.4.1 Selection of dataset
Feature selection is the process of reducing the number of input variables when developing a
predictive model. It is desirable to reduce the number of input variables to both reduce the
computational cost of modeling and, in some cases, to improve the performance of the
model.
Statistical-based feature selection methods involve evaluating the relationship between each
input variable and the target variable using statistics and selecting those input variables that
have the strongest relationship with the target variable. These methods can be fast and
effective, although the choice of statistical measures depends on the data type of both the
input and output variables.
4.4.2 Pre-Processing
Pre-processing refers to the transformations applied to our data before feeding it to the
algorithm. Data Preprocessing is a technique that is used to convert the raw data into a clean
data set. In other words, whenever the data is gathered from different sources it is collected
in raw format which is not feasiblefor the analysis.
4.4.3 Data Cleaning
Data cleaning, or data cleansing, is the important process of correcting or removing
incorrect, incomplete, or duplicate data within a dataset. Data cleaning should be the first
step in our workflow. When working with large datasets and combining various data
sources, there’s a strong possibility one may duplicate or mislabel data. If we have
inaccurate or incorrect data, it will lose its quality, and our algorithms and outcomes become
unreliable.
Dept. of CSE
11
GSSSIETW, Mysuru
Data Pre-Processing of Mall Customers Dataset
4.5
Implementation Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
df=pd.read_csv('C:/Users/Admin/Downloads/Works/Mall_Customers_dataset.csv')
df
df.rename(columns={'Genre':'Gender'},inplace=True) #Renaming column
df
df.head()# Printing first 5 rows of the table
df.drop(['CustomerID'],axis=1,inplace=True) #Dropping off irrelevant columns
df
df.shape # To get the number of rows and columns
df.describe()
df.isnull().sum() #To check if there is any null values in datset
sns.heatmap(df.isnull()) # Graph showing Data is cleaned and no missing values are
present
df.dtypes
Dept. of CSE
12
GSSSIETW, Mysuru
Data Pre-Processing of Mall Customers Dataset
Chapter 5
RESULTS AND DISCUSSIONS
5.1
TECHNICAL OUTCOMES
The internship at Tequed Labs has been a success. This was a complete online internship,
which has taken from zero to an advanced level, where we were able to create own programs
and understand other programs as well. Data Science is one of the most leading technologies
that help in managing and storing data and also to make future predictions. Using the
techniques of Data Science, it is easy to classify and group the data collected and make
predictions for the growth of business.
5.2
SKILLS DEVLOPED
Problem Solving
Problem solving is the act of defining a problem, determining the cause of the problem,
identifying, prioritizing, and selecting alternatives for a solution, and Implementing a
solution. Data Scientists should have a rigorous data-driven problem-solving approach to
their thinking. Top Data Scientists are able to discern which problems are important to solve
and then model what is critical to solving the problem. There’s no template for solving a
data science problem. The path to solving a business problem changes with every new
dataset. In addition, the practice of data science is riddled with challenges like missing data
values, uncooperative stakeholders and coding bugs.
Communication
Along with being able to create great visualizations to communicate results to end users,
Data Scientists must possess persuasive communication skills and strong interpersonal skills
to see a project from start to finish. In their role, they may have to interact with a variety of
personalities and stakeholders from technical IT and software engineers to marketing
managers and other functional staff to C-suite managers. Certainly, to progress in the ranks
as a Data Scientist, communication skills need to be strong.
5.3
DRAWBACKS
K-means algorithm is good in capturing structure of the data if clusters have a spherical-like
Dept. of CSE
13
GSSSIETW, Mysuru
Data Pre-Processing of Mall Customers Dataset
shape. It always tries to construct a nice spherical shape around the centroid. That means,
the minute the clusters have a complicated geometric shapes, k-means does a poor job in
clustering the data. K-means algorithm doesn’t let data points that are far-away from each
other share the same cluster even though they obviously belong to the same cluster.
Dept. of CSE
14
GSSSIETW, Mysuru
Data Pre-Processing of Mall Customers Dataset
SNAPSHOTS
Figure 5.1 Loading Data
Figure 5.2 Renaming Columns
Dept. of CSE
15
GSSSIETW, Mysuru
Data Pre-Processing of Mall Customers Dataset
Figure 5.3 Display of first five rows of dataset
Figure 5.4 Dropping off Irrelevant columns
Dept. of CSE
16
GSSSIETW, Mysuru
Data Pre-Processing of Mall Customers Dataset
Figure 5.5 Retrieving Description of Dataset
Figure 5.6 Checking missing values
Dept. of CSE
17
GSSSIETW, Mysuru
Data Pre-Processing of Mall Customers Dataset
Figure 5.7 Graph showing Data is cleaned and no missing values found
Figure 5.8 Retrieving the data types of individual columns
Dept. of CSE
18
GSSSIETW, Mysuru
CONCLUSION
This report uses Mall customer datasets contains information about people visiting the mall.
The dataset has gender, customer id, age, annual income, and spending score. It collects
insights from the data and group customers based on their behaviors. Before starting the
predictions, the report makes a brief summary of model evaluation, explaining the most
common metricsused in categorical problems in machine learning.
In data preparation, the training and testing sets are created and they will be used during the
model building.
In data exploration and visualization, we look for features that may provide good
prediction results. The best predictors have low distribution overlapping area and low
correlation among them.
Model building starts with taking two features at a time and make clusters using K means
clustering it happens almost three time first age and spending score ,second annual income
and spending score at last we combine all three features and populate the cluster. explaining
very simple models and gradually moves to more complex ones. There’s a brief explanation
on some of the models used in this report.
XX
XX
REFERENCES
IEEE/Journal Papers
[1]“ Customer Segmentation In Shopping Mall Using Clustering In Machine
Learning”, M.Thirunavakarasu1, Kuncham Pavan Kumar Reddy2, G.Srinivasa Teja3
WEBSITES

https://quanthub.com/data-science-skills/

https://www.javatpoint.com/data-science

https://www.kaggle.com/datasets/vjchoudhary7/customer-segmentation-tutorialin-python?resource=download

https://www.simplilearn.com/what-skills-do-i-need-to-become-a-data-scientistarticle

https://www.simplilearn.com/what-skills-do-i-need-to-become-a-data-scientistarticle
********
xx
Download