Uploaded by Ryl Ryl

Final internship report compressed

advertisement
DATA SCIENCE
A SUMMER INTERNSHIP REPORT
Submitted by
HEM GONDALIYA
190050131019
In partial fulfilment for the award of the degree of
BACHELOR OF ENGINEERING
in
COMPUTER SCIENCE AND ENGINEERING
BABARIA INSTITUTE OF TECHNOLOGY
BABARIA INSTITUTE OF TECHNOLOGY
FEB-APRIL 2023
BABARIA INSTITUTE OF TECHNOLOGY
On national highway 8 varnama vadodara
CERTIFICATE
This is to certify that the project report submitted along with the project
entitled DATA SCIENCE has been carried out by HEM GONDALIYA
(190050131019) under my guidance in partial fulfillment for the degree of
Bachelor
of
Engineering
in
COMPUTER
SCIENCE
AND
ENGINEERING, 8th Semester of Gujarat Technological University,
Ahmadabad during the academic year 2022-23.
Mrs. Apoorva Shah
DR.NITESH SUREJA
Internal Guide
Head o f the Department
TO WHOM IT MAY CONCERN
This is to certify that Hem Gondaliya student of Babaria Institute of Technology
has successfully completed the project titled “customer segmentation” using
Python in our company with reference to the complete fulfillment of the
requirements of Degree Engineering. He had taken training in our company
during 1st February 2023 to 30th April 2023. During his stay at company, he is
found sincere and hardworking.
During the period of his internship program with us, he had been exposed to
different processes and was found diligent, hardworking and inquisitive.
We wish him very best in all the future endeavors.
With Best Regards,
Divya Dharani
HR,
Teachnook Technologies Pvt. Ltd.
Bengaluru (Karnataka -INDIA)
Technook
No. 592,3rd Block, karamangala,
Bengaluru, Karnataka 560068
Info_hr@teachnook.com
Mob. +91-63600 93009
TEACHNOOK
592, 3rd Block,
Koramangala, Bengaluru, Karnataka
560068
Re: Internship Acceptance letter
Dear Hem Gondaliya ,
We are pleased to offer you Mr. Hem Gondaliya , Student of B.E. (CSE) Department, Babaria Institute of
Technology, Vadodara , for an internship in Introduction to Python with data science with our Company
Teachnook collaborated with Wissenaire (IIT Bhubaneswar). This is an Internship and Training Program.
Our goal is for you to learn more about the domain, to get real industrial knowledge & experience.
As we discussed, your internship is expected to last for 3 months from February,2023 to April,2023.
[However, at the sole discretion of the Company, the duration of the internship may be extended or
shortened with or without advance notice. During the Internship no leaves will be provided.]
As an intern, you will not be a Company employee. Therefore, you will not receive a salary, wages, or
other compensation. In addition, you will not be eligible for any benefits that the Company offers its
employees, including, but not limited to, health benefits, holiday pay, vacation pay, sick leave, retirement
benefits. You understand that participation in the internship program is not an offer of employment, and
successful completion of the internship does not entitle you to employment with the Company.
During your internship, you may have access to confidential, proprietary, and/or trade secret information
belonging to the Company. You agree that you will keep all this information strictly.
Confidential and refrain from using it for your own purposes or from disclosing it to anyone outside the
Company. In addition, you agree that, upon conclusion of the internship, you will immediately return to
the Company all its property, equipment, and documents, including electronically stored information.
By accepting this offer, you agree that you will follow all of the Company's policies that apply to nonemployee interns, including the Company's anti-harassment policy.
This letter constitutes the complete understanding between you and the Company regarding your
internship and supersedes all prior discussions or agreements. This letter may only be modified by a
written agreement signed by both of us.
I hope that your internship with the Company
will be successful
and rewarding.
Please
indicate your acceptance of this offer by signing below and returning it to our company desk.
If you have any questions, please do not hesitate to contact us.
Very truly yours,
Saumya Tiwari
HR - Manager
TEACHNOOK
I accept Intern with the Company on the terms and conditions set out in this letter.
Date : 04/01/2023
Signature
JOINING LETTER
COMPLETION CERTIFICATE
BABARIA INSTITUTE OF TECHNOLOGY
On national highway 8 varnama vadodara
DECLARATION
We hereby declare that the Internship report submitted along with the Project
entitled DATA SCIENCE submitted in partial fulfilment for the degree of
Bachelor of Engineering in COMPUTER SCIENCE AND ENGINEERING to
Gujarat Technological University, Ahmedabad, is a bonafide record of
original project work carried out by me at Brainy Beams Pvt Ltd. under the
supervision of Mrs. Apoorva Shah and that no part of this report has been
directly copied from any students’ reports or taken from any other source,
without providing due reference.
Name of Student
HEM GONDALIYA
i
Project id: 300622
ACKNOWLEDGEMENT
I would sincerely like to thank my internal faculty mentor Mrs. Apoorva
Shah for guiding me and giving me this opportunity to proceed with Data
Science internship at Teachnook which helped me to brush up my skills to
match up the industry requirements. I would like to extend my gratitude
toteachnook . who saw the right candidate in me for this internship and gave
me this opportunity to be a part of Data science and machine learning. Also
I appreciate the guidance given by the developer at Teachnook, jinal Modi
as well as the panels especially for the internship that has advised me and
gave guidance at every moment of the internship.
HEM GONDALIYA
(190050131019)
ii
Project id: 300622
ABSTRACT
Data science is an interdisciplinary field that uses scientific methods,
processes, algorithms, and systems to extract insights and knowledge from
various forms of data. The field encompasses a broad range of techniques
and tools from statistics, mathematics, and computer science, including data
mining, machine learning, and artificial intelligence. The goal of data
science is to enable organizations to make data-driven decisions, improve
efficiency and productivity, and gain a competitive advantage in the
marketplace.Data science involves several stages, including data collection,
data cleaning, data transformation, data modeling, and data visualization.
These stages aim to convert raw data into actionable insights that can be
used to drive business decisions. The field has a wide range of applications,
including but not limited to business intelligence, health informatics, finance,
social sciences, and more.As the amount of data generated continues to grow
exponentially, data science has become an essential tool for organizations
across industries. The field's importance lies in its ability to provide valuable
insights that enable organizations to make informed decisions, improve
operations, and innovate new products and services. This abstract provides a
comprehensive overview of the scope and significance of data science in
today's world.
iii
Project id: 300622
LIST OF FIGURES
Figure 1 BUSINESS ANALYSIS ...................................................................................... 2
Figure 2 BOX PLOT........................................................................................................ 11
Figure 3 SCATTER PLOT ............................................................................................. 11
Figure 4 ALGO. SELECTION ....................................................................................... 13
Figure 5 LINEAR REGRESSION GRAPH .................................................................. 14
Figure 6 LOGISTIC REGRESSION ............................................................................. 15
Figure 7 K-MEANS CLUSTERING .............................................................................. 15
Figure 8 DATASET ......................................................................................................... 16
Figure 9 DATA FLOW DIAGRAM ............................................................................... 19
Figure 10 MACHINE LEARNING PIPELINE ............................................................ 21
Figure 11 IMPORTING LIBRARY ............................................................................... 22
Figure 12 PIE CHART .................................................................................................... 23
Figure 13 BAR GRAPH .................................................................................................. 24
Figure 14 ACCURACY TEST ........................................................................................ 25
Figure 15 CONFUSION MATRIX................................................................................. 26
Figure 16 TESTING AND PREDICTION .................................................................... 27
iv
Project id: 300622
TABLE OF CONTENTS
ACKNOWLEDGEMENT ................................................................................................ II
ABSTRACT ..................................................................................................................... III
LIST OF FIGURES ........................................................................................................ V
CHAPTER-1 COMPANY PROFILE ............................................................................. 1
1.1 ABOUT US: ...............................................................................................................1
1.2 VISION:..................................................................................................................... 1
1.3 SPECILATES: .......................................................................................................... 1
CHAPTER-2 INTRODUCTION TO DATA SCIENCE ................................................ 2
2.1 PYTHON ................................................................................................................... 2
2.2 SPECTUM OF BUSSINESS ANALYSIS ............................................................... 2
2.3 APPLICATION OF DATA SCIENCE ................................................................... 3
2.4 PYTHON INTRODUCTION ..................................................................................4
CHAPTER-3 STATISTICS .............................................................................................. 5
3.1 DESCRIPTIVE STATISTICS................................................................................. 5
3.2 OUTLIERS................................................................................................................5
3.3 HISTOGRAMS ......................................................................................................... 6
3.4 .................................................................................................................................... 6
3.5 HYPOTHESIS TESTING…………………………………………………………7
3.6 T TEST………………………………………………………………………………7
3.7 SCORE……………………………………………………………………………….7
3.8 CHI SQUARED TEST……………………………………………………………..7
CHAPTER-4 PRIDICTIVE MODELING ...................................................................... 8
4.1STAGE OF PREDICTIVE MODELING................................................................ 9
4.2 PROBLEM DEFINITION ....................................................................................... 9
4.3 HYPOTHESIS GENRATION ................................................................................. 9
4.4 DATA EXTRACTION……………………………………………………………9
4.5 DATA EXPLORATION AND TRANSFORMATION…………………………9
4.6 UNIVARIATE AND BIVARIATE………………………………………………10
4.7 GRAPHICAL METHOD ………………………………………………………..11
v
Project id: 300622
CHAPTER-5 MODEL BUILDING ............................................................................... 12
5.1 STEPS IN MODEL BUILDING ............................................................................ 12
5.2 ALGORITHM SELECTION .................................................................................13
5.2 TYPES OF MACHINE LEARNING ALGORITHM ......................................... 14
CHAPTER-6 DATA AND MODEL SELECTION ...................................................... 16
6.1 DATASET ............................................................................................................... 16
6.2 ALGORITHMS ...................................................................................................... 17
CHAPTER-7 VALIDATION.......................................................................................... 19
7.1 DATA FLOW DIAGRAM ..................................................................................... 19
7.2 MODULES DESCRIPTION ................................................................................. 20
CHAPTER-8 .....................................................................................................................22
8.1 IMPOERTING LIBRARY AND DATASET…………………………………...22
8.2 DATA CLEANING AND VISULIZATION…………………………………….23
8.3 DIVIDING DATASET INTO TRAINING AND TESTING SET……………..24
8.4 DATA NORMALIZATION……………………………………………………...26
8.5 TESTING AND PREDICTION ………………………………………………...27
CHAPTER-9 LEARNING OUTCOME ........................................................................ 29
CONCLUSION ................................................................................................................ 29
REFERENCE ................................................................................................................... 30
vi
Project Id: 300622
COMPANY PROFILE
CHAPTER-1 COMPANY PROFILE
1.1 About us:
Teachnook is at the forefront of innovation, using cutting-edge techniques to create
intelligent solutions that drive business success. We specialize in developing custom
machine learning algorithms that can extract insights from complex data sets and enable
businesses to make data-driven decisions. Our team of data scientists, engineers, and
developers has deep expertise in a wide range of machine learning techniques, including
deep learning, natural language processing, and computer vision. With our solutions,
businesses can optimize their operations, improve customer experiences, and gain a
competitive edge in the marketplace. Our company is committed to delivering
highquality, scalable, and efficient machine learning solutions that meet our clients'
unique needs.
1.2 VISION:
To become the most trusted and preferred offshore IT solutions partner for
Startups, SMBs and Enterprises through innovation and technology leadership.
Understanding your ambitious vision, honing in on its essence, creating a
design strategy, and knowing how to technically execute it is what we do best.
Our promise? The integrity of your vision will be maintained and we'll enhance
it to best reach your target customers. With our primary focus on creating
amazing user experiences, we'll help you understand the tradeoffs, prioritize
features, and distill valuable functionality. It's an art form we care about getting
right.
1.3 SPECILATES:
Machine learning, Data science, Deep learning, Artificialintelligent, Android
Development, iOS Development, Windows Development, Web Development,
IOT Development, Cross Platform Development, Mobile App Development,
Enterprise Solutions, Database Administration, UI Design and Development,
Database Handling, Web Services, Python, and R.
Gujarat Technological University
1
BITS VADODARA
Project Id: 300622
INTRODUCTION TO DATA SCIENCE
CHAPTER-2 INTRODUCTION TO DATA SCIENCE
2.1 PYTHON:
Python is a computer programming language often used to build websites and
software, automate tasks, and conduct data analysis. Python is a general-purpose
language, meaning it can be used to create a variety of different programs and isn't
specialized for any specific problems.
2.2 SPECTUM OF BUSSINESS ANALYSIS:
Reporting / Management Information System
To track what is happening in organization.
Detective Analysis
Gujarat Technological University
2
BITS VADODARA
Project Id: 300622
INTRODUCTION TO DATA SCIENCE
Asking questions based on data we are seeing, like. Why something happened?
Dashboard / Business Intelligence
Utopia of reporting. Every action about business is reflected in front of screen.
Predictive Modelling
Using past data to predict what is happening at granular level.
Big Data
Stage where complexity of handling data gets beyond the traditional system.
Can be caused because of volume, variety or velocity of data. Use specific tools to
analyse such scale data.
2.3 APPLICATION OF DATA SCIENCE:
Recommendation System
Example-In Amazon recommendations are different for different users according to their
past search.
• Social Media
1. Recommendation Engine
2. Ad placement
3. Sentiment Analysis
•
Deciding the right credit limit for credit card customers.
•
Suggesting right products from e-commerce companies
1. Recommendation System
2. Past Data Searched
3. Discount Price Optimization
•
How google and other search engines know what are the more relevant results for our
search query?
1. Apply ML and Data Science
2. Fraud Detection
Gujarat Technological University
3
BITS VADODARA
Project Id: 300622
INTRODUCTION TO DATA SCIENCE
2.4 PYTHON INTRODUCTION:
Python is an interpreted, high-level, general-purpose programming language. It has efficient
high-level data structures and a simple but effective approach to object-oriented programming.
Python’s elegant syntax and dynamic typing, together with its interpreted nature, make it an ideal
language for scripting and rapid application development in many areas on most platforms.
Python for Data science:
Why Python???
1. Python is an open source language.
2. Syntax as simple as English.
3. Very large and Collaborative developer community.
4. Extensive Packages.
•
UNDERSTANDING OPERATORS:
Theory of operators: - Operators are symbolic representation of Mathematical tasks.
•
VARIABLES AND DATATYPES:
Variables are named bounded to objects. Data types in python are int (Integer), Float,
Boolean and strings.
•
CONDITIONAL STATEMENTS:
If-else statements (Single condition)
If- elif- else statements (Multiple Condition)
•
LOOPING CONSTRUCTS:
For loop
•
FUNCTIONS:
Functions are re-usable piece of code. Created for solving specific problem.
Two types: Built-in functions and User- defined functions.
Functions cannot be reused in python.
Gujarat Technological University
4
BITS VADODARA
Project Id: 300622
STATISTICS
CHAPTER-3 STATISTICS
3.1 Descriptive Statistic :
Mode
It is a number which occurs most frequently in the data series.
It is robust and is not generally affected much by addition of couple of new
values. Code import pandas as pd data=pd.read_csv(
"Mode.csv")
//reads data from csv file
data.head()
//print first five lines
mode_data=data['Subject'].mode() //to take mode of
subject column print(mode_data) Mean import
pandas as pd data=pd.read_csv( "mean.csv")
//reads data from csv file
data.head()
//print first five lines
mean_data=data[Overallmarks].mean() //to take mode of subject column
print(mean_data)
Meadian
Absolute central value of data set.
import pandas as pd data=pd.read_csv(
"data.csv")
//reads data from csv file
data.head()
//print first five lines
median_data=data[Overallmarks].median() //to take mode of subject column print(median_data)
Types of variables
•
Continous – Which takes continuous numeric values. Eg-marks
•
Categorial-Which have discrete values. Eg- Gender
•
Ordinal – Ordered categorial variables. Eg- Teacher feedback
3.2 OUTLIERS
Any value which will fall outside the range of the data is termed as a outlier. Eg- 9700 instead of
97.
Gujarat Technological University
5
BITS VADODARA
Project Id: 300622
STATISTICS
Reasons of Outliers
• Typos-During collection. Eg-adding extra zero by mistake.
•
Measurement Error-Outliers in data due to measurement operator being faulty.
•
Intentional Error-Errors which are induced intentionally. Eg-claiming smaller amount of
alcohol consumed then actual.
•
Legit Outlier—These are values which are not actually errors but in data due to
legitimate reasons.
Eg - a CEO’s salary might actually be high as compared to other employees.
3.3 HISTOGRAMS
Histograms depict the underlying frequency of a set of discrete or continuous data that are
measured on an interval scale. import pandas as pd histogram=pd.read_csv(histogra
m.csv)
import matplotlib.pyplot as plt
%matplot inline plt.hist(x=
'Overall
Marks',data=histogram) plt.show()
3.4 iNFERENTIAL
Inferential statistics allows to make inferences about the population from the sample data.
3.5 HYPOTHESIS TESTING
Hypothesis testing is a kind of statistical inference that involves asking a question, collecting
data, and then examining what the data tells us about how to proceed. The hypothesis to be
tested is called the null hypothesis and given the symbol Ho. We test the null hypothesis against
an alternative hypothesis, which is given the symbol Ha.
Gujarat Technological University
6
BITS VADODARA
Project Id: 300622
STATISTICS
3.6 T TEST
When we have just a sample not population statistics.
Use sample standard deviation to estimate population standard deviation.
T test is more prone to errors, because we just have samples.
3.7 Z SCORE
The distance in terms of number of standard deviations, the observed value is away from mean,
is standard score or z score.
+Z – value is above mean.
-Z – value is below mean.
The distribution once converted to z- score is always same as that of shape of original
distribution
3.8 CHI SQUARED TEST
To test categorical var
Gujarat Technological University
7
BITS VADODARA
Project Id: 300622
PRIDICTIVE MODELING
CHAPTER-4 PRIDICTIVE MODELING
Making use of past data and attributes we predict future
using this data. EgPast
Horror Movies
Future
Unwatched Horror Movies
Predicting stock price movement
1. Analysing past stock prices.
2. Analysing similar stocks.
3. Future stock price required.
Types
1. Supervised Learning
Supervised learning is a type algorithm that uses a known dataset (called the
training dataset) to make predictions. The training dataset includes input data and
response values.
•
Regression-which have continuous possible values. Eg-Marks
•
Classification-which have only two values. Eg-Cancer prediction is
either 0 or 1.
2. Unsupervised Learning
Unsupervised learning is the training of machine using information that is neither
classified nor. Here the task of machine is to group unsorted information
according to similarities, patterns and differences without any prior training of
data.
•
Clustering: A clustering problem is where you want to discover the
inherent groupings in the data, such as grouping customers by purchasing
behaviour.
•
Association: An association rule learning problem is where you want to
discover rules that describe large portions of your data, such as people that
buy X also tend to buy Y.
Gujarat Technological University
8
BITS VADODARA
Project Id: 300622
PRIDICTIVE MODELING
4.1 STAGE OF PREDICTIVE MODELING :
1. Problem definition
2. Hypothesis Generation
3. Data Extraction/Collection
4. Data Exploration and Transformation
5. Predictive Modelling
6. Model Development/Implementation
4.2 PROBLEM DEFINITION:
Identify the right problem statement, ideally formulate the problem mathematically.
4.3 HYPOTHESIS GENRATION:
List down all possible variables, which might influence problem objective. These
variables should be free from personal bias and preferences.
Quality of model is directly proportional to quality of hypothesis.
4.4 DATA EXTRACTION:
Collect data from different sources and combine those for exploration and model
building. While looking at data we might come across new hypothesis.
4.5 DATA EXPLORATION AND TRANSFORMATION:
Data extraction is a process that involves retrieval of data from various sources for
further data processing or data storage.
Steps of Data Extraction
•
Reading the data Eg- From csv file
•
Variable identification
•
Univariate Analysis
•
Bivariate Analysis
•
Missing value treatment
Gujarat Technological University
9
BITS VADODARA
Project Id: 300622
•
Outlier treatment
•
Variable Transformation
PRIDICTIVE MODELING
Variable Treatment
It is the process of identifying whether variable is
1. Independent or dependent variable
2. Continuous or categorical variable
Why do we perform variable identification?
1. Techniques like supervised learning require identification of dependent variable.
2. Different data processing techniques for categorical and continuous data.
Categorical variable- Stored as object.
Continuous variable-Stored as int or float.
4.6 UNIVARIATE AND BIVARIATE :
Univariate Analysis
1. Explore one variable at a time.
2. Summarize the variable.
3. Make sense out of that summary to discover insights, anomalies, etc.
Bivariate Analysis
•
When two variables are studied together for their empirical relationship.
•
When you want to see whether the two variables are associated with each other.
•
It helps in prediction and detecting anomalies.
Missing Value
Treatment
Reasons of
missing value
1. Non-response – Eg-when you collect data on people’s income and many
choose no to answer.
2. Error in data collection. Eg- Faculty data
3. Error in data reading.
Different methods to deal with missing values
1. Imputation
Gujarat Technological University
10
BITS VADODARA
Project Id: 300622
PRIDICTIVE MODELING
Continuous-Impute with help of mean, median or regression mode.
Categorical-With mode, classification model.
2. Deletion
Row wise or column wise deletion. But it leads to loss of data.
Outlier Treatment Reasons of Outliers
1. Data entry Errors
2. Measurement Errors
3. Processing Errors
4. Change in underlying population
4.7 GRAPHICAL METHOD
•
Box Plot
Figure 2.box plote
•
Scatter Plot
Figure 3. Scatter plot
Gujarat Technological University
11
BITS VADODARA
Project Id: 300622
MODEL BUILDING
CHAPTER-5 MODEL BUILDING
5.1 STEPS IN MODEL BUILDING
It is a process to create a mathematical model for estimating / predicting the future based
on past data.
EgA retail wants to know the default behaviour of its credit card customers. They want to
predict the probability of default for each customer in next three months.
•
Probability of default would lie between 0 and 1.
•
Assume every customer has a 10% default rate.
Probability of default for each customer in next 3 months=0.1
It moves the probability towards one of the extremes based on attributes of past
information.
A customer with volatile income is more likely (closer to) to default.
A customer with healthy credit history for last years has low chances of default (closer to
0).
Steps in Model Building
1. Algorithm Selection
2. Training Model
3. Prediction / Scoring
Gujarat Technological University
12
BITS VADODARA
Project Id: 300622
MODEL BUILDING
5.2 ALGORITHM SELECTION
Figure 4. Algo. Selection
Eg- Predict the customer will buy product or not.
Algorithms
• Logistic Regression
•
Decision Tree
•
Random Forest
Training Model
It is a process to learn relationship / correlation between independent and dependent
variables.
We use dependent variable of train data set to
predict/estimate. Dataset
•
Train
Past data (known dependent variable).
Used to train model.
•
Test
Future data (unknown dependent variable)
Gujarat Technological University
13
BITS VADODARA
Project Id: 300622
MODEL BUILDING
Used to score predictions
5.3 TYPES OF MACHINE LEARNING ALGORITHM:
Linear Regression
Linear regression is a statistical approach for modelling relationship between a dependent
variable with a given set of independent variables.
It is assumed that the wo variables are linearly related. Hence, we try to find a linear
function. That predicts the response value(y) as accurately as possible as a function of the
feature or independent variable(x).
The equation of regression line is
Y-Values
14
represented as:
12
10
8
T he squared error or cost function, J as:
6
4
2
0
0
1
2
3
4
5
6
7
8
9
Figure 5. linear regression graph
Logistic Regression
Logistic regression is a statistical model that in its basic form uses a logistic function to
model a binary dependent variable, although many more complex extensions exist.
Gujarat Technological University
14
BITS VADODARA
Project Id: 300622
MODEL BUILDING
Figure 6. logistic regression
C = -y (log(y) – (1-y) log(1-y))
K-Means Clustering (Unsupervised learning)
K-means clustering is a type of unsupervised learning, which is used when you have
unlabelled data (i.e., data without defined categories or groups). The goal of this
algorithm is to find groups in the data, with the number of groups represented by the
variable K. The algorithm works iteratively to assign each data point to one of K groups
based on the features that are provided. Data points are clustered based on feature
similarity.
Figure 7. K-means clustering
Gujarat Technological University
15
BITS VADODARA
Project Id: 300622
DATA AND MODEL SELECTION
CHAPTER-6 DATA AND MODEL SELECTION
The internship is a platform where the trainees are assigned with the specific task.
In the initial days of the internship, I was trained on the following:
 Python Programming
 Machine Learning Algorithms
6.1 DATA SET
This section describes, in brief, the data that has been used for the research.
Data of E-commerce user’s purchases was used in this project, the major amount
of data was extracted from public website Kaggle (Kaggle.com), data regarding
the review and linked was obtained from E-commerce sites. Data from sources
was integrated together to form a staging data-set. For Anticipate the Purchases
that will be made by new customer, during the following year and this, form its
first purchase by assigning them appropriate cluster/segment.
Below table shows the different types of reviews present in the data-set.
Figure 8. Dataset
Gujarat Technological University
16
BITS VADODARA
Project Id: 300622
DATA AND MODEL SELECTION
6.2 ALGORITHMS
 Linear Regression
Linear Regression is a machine learning algorithm based on supervised learning. It
performs a regression task. Regression models a target prediction value based on
independent variables. It is mostly used for finding out the relationship between
variables and forecasting. Different regression models differ based on – the kind of
relationship between dependent and independent variables they are considering, and
the number of independent variables getting used.
 SVC
The Linear Support Vector Classifier (SVC) method applies a linear kernel
function to perform classification and it performs well with a large number of
samples. If we compare it with the SVC model, the Linear SVC has additional
parameters such as penalty normalization which applies 'L1' or 'L2' and loss
function.
 Pipeline
•
pipeline is a means of automating the machine learning workflow by enabling
data to be transformed and correlated into a model that can then be analyzed
to achieve outputs. This type of ML pipeline makes the process of inputting
data into the ML model fully automated
•
Another type of ML pipeline is the art of splitting up your machine learning
workflows into independent, reusable, modular parts that can then be
pipelined together to create models. This type of ML pipeline makes building
models more efficient and simplified, cutting out redundant work.
•
This goes hand-in-hand with the recent push for microservices architectures,
branching off the main idea that by splitting your application into basic and
siloed parts you can build more powerful software over time. Operating
systems like Linux and Unix are also founded on this principle. Basic
Gujarat Technological University
17
BITS VADODARA
Project Id: 300622
DATA AND MODEL SELECTION
functions like ‘grep’ and ‘cat’ can create impressive functions when they are
pipelined together.
In my Three months Internship I have undergone through three phases:
•
Training Phase
•
Designing and Development Phase
•
Testing and Maintenance Phase
Gujarat Technological University
18
BITS VADODARA
Project Id: 300622
VALIDATION
CHAPTER-7 VALIDATION
7.1 DATA FLOW DIAGRAM
Figure 9. Data flow diagram
1. Exploratory Data Analysis in Machine Learning
2. Data Visualization
3. Training and Testing
4. Train and Evaluate Linear Support Vector Classifier
Gujarat Technological University
19
BITS VADODARA
Project Id: 300622
VALIDATION
5. Train and Evaluate pipeline in machine learning
7.2 MODULES DESCRIPTION
Exploratory Data Analysis: Performed initial investigations on data so as to
discover patterns, to spot anomalies, to test hypothesis and to check assumptions
with the help of summary statistics and graphical representations.
Data Visualization: Using data visualization, I summarized the data with graphs,
pictures and maps, so that the human mind has an easier time processing and
understanding the given data. Data visualization plays a significant role in the
representation of both small and large data sets, but it is especially useful when we
have large data sets, in which it is impossible to see all of our data, let alone process
and understand it manually.
Training and Testing: In this project, datasets are split into two subsets. The first
subset is known as the training data - it's a portion of our actual dataset that is fed
into the machine learning model to discover and learn patterns. In this way, it
trains our model. The other subset is known as the testing data.
Train and Evaluate Linear Support Vector Classifier (SVC): The Linear Support
Vector Classifier (SVC) method applies a linear kernel function to perform
classification and it performs well with a large number of samples. If we compare it
with the SVC model, the Linear SVC has additional parameters such as penalty
normalization which applies 'L1' or 'L2' and loss function Train and Evaluate
Pipeline in machine learning:
Gujarat Technological University
20
BITS VADODARA
Project Id: 300622
VALIDATION
Fighure 10 machine learning pipeline
Gujarat Technological University
21
BITS VADODARA
Project Id: 300622
DATA AND MODEL SELECTION
CHAPTER-8 END TO END MACHINE LEARNING MODEL
8.1 IMPORTING LIBRARY AND RAW DATA
Figure 11.importing library
Gujarat Technological University
22
BITS VADODARA
Project Id: 300622
END TO END ML MODEL
8.2 DATA CLEANING AND VISULIZATION
Figure 12. Pie chart
Gujarat Technological University
23
BITS VADODARA
Project Id: 300622
END TO END ML MODEL
Figure 13 bargraph
8.3 DIVIDING DATASET INTO TRAINING AND TESTING SET
Gujarat Technological University
24
BITS VADODARA
Project Id: 300622
END TO END ML MODEL
Figure 14. accuracy test
Gujarat Technological University
25
BITS VADODARA
Project Id: 300622
END TO END ML MODEL
8.4 DATA NORMALIZATION
Figure 15. confusion matrix
Gujarat Technological University
26
BITS VADODARA
Project Id: 300622
END TO END ML MODEL
8.5 TESTING AND PREDICTION
Gujarat Technological University
27
BITS VADODARA
Project Id: 300622
Gujarat Technological University
END TO END ML MODEL
28
BITS VADODARA
Project Id: 300622
CONCLUSION
9. LEARNING OUTCOME
After completing the training, I am able to:
•
Develop relevant programming abilities.
•
Demonstrate proficiency with statistical analysis of data.
•
Develop the skill to build and assess data-based model.
•
Execute statistical analysis with professional statistical software.
•
Demonstrate skill in data management.
•
Apply data science concepts and methods to solve problem in real-world contexts and will
communicate these solutions effectively.
CONCLUSION
data science has become an essential tool for businesses and organizations across
industries, enabling them to make data-driven decisions, improve efficiency and
productivity, and gain a competitive advantage. The field encompasses a broad range of
techniques and tools from statistics, mathematics, and computer science, including data
mining, machine learning, and artificial intelligence.
Data science has numerous applications, including but not limited to business
intelligence, health informatics, finance, and social sciences. It has a significant impact on
the economy, public policy, and society as a whole. Additionally, data science is essential
in tackling some of the world's biggest challenges, such as climate change, healthcare,
and cybersecurity.
The importance of data science is only expected to grow in the coming years as the
amount of data generated continues to increase. As such, there is a significant demand for
data scientists who can extract insights and knowledge from the vast amounts of data
available. Overall, data science is a fascinating and challenging field with vast potential
for innovation, and it offers an exciting career path for those with a passion for data and
analytics.
Gujarat Technological University
29
BITS VADODARA
Project Id: 300622
DATA AND MODEL SELECTION
REFERENCE
1. Mchine learning tutorial - geeksforgeeks
2. Machine learning - W3schools
3. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow : book by
geron aurelien
Gujarat Technological University
30
BITS VADODARA
Download