DATA SCIENCE A SUMMER INTERNSHIP REPORT Submitted by HEM GONDALIYA 190050131019 In partial fulfilment for the award of the degree of BACHELOR OF ENGINEERING in COMPUTER SCIENCE AND ENGINEERING BABARIA INSTITUTE OF TECHNOLOGY BABARIA INSTITUTE OF TECHNOLOGY FEB-APRIL 2023 BABARIA INSTITUTE OF TECHNOLOGY On national highway 8 varnama vadodara CERTIFICATE This is to certify that the project report submitted along with the project entitled DATA SCIENCE has been carried out by HEM GONDALIYA (190050131019) under my guidance in partial fulfillment for the degree of Bachelor of Engineering in COMPUTER SCIENCE AND ENGINEERING, 8th Semester of Gujarat Technological University, Ahmadabad during the academic year 2022-23. Mrs. Apoorva Shah DR.NITESH SUREJA Internal Guide Head o f the Department TO WHOM IT MAY CONCERN This is to certify that Hem Gondaliya student of Babaria Institute of Technology has successfully completed the project titled “customer segmentation” using Python in our company with reference to the complete fulfillment of the requirements of Degree Engineering. He had taken training in our company during 1st February 2023 to 30th April 2023. During his stay at company, he is found sincere and hardworking. During the period of his internship program with us, he had been exposed to different processes and was found diligent, hardworking and inquisitive. We wish him very best in all the future endeavors. With Best Regards, Divya Dharani HR, Teachnook Technologies Pvt. Ltd. Bengaluru (Karnataka -INDIA) Technook No. 592,3rd Block, karamangala, Bengaluru, Karnataka 560068 Info_hr@teachnook.com Mob. +91-63600 93009 TEACHNOOK 592, 3rd Block, Koramangala, Bengaluru, Karnataka 560068 Re: Internship Acceptance letter Dear Hem Gondaliya , We are pleased to offer you Mr. Hem Gondaliya , Student of B.E. (CSE) Department, Babaria Institute of Technology, Vadodara , for an internship in Introduction to Python with data science with our Company Teachnook collaborated with Wissenaire (IIT Bhubaneswar). This is an Internship and Training Program. Our goal is for you to learn more about the domain, to get real industrial knowledge & experience. As we discussed, your internship is expected to last for 3 months from February,2023 to April,2023. [However, at the sole discretion of the Company, the duration of the internship may be extended or shortened with or without advance notice. During the Internship no leaves will be provided.] As an intern, you will not be a Company employee. Therefore, you will not receive a salary, wages, or other compensation. In addition, you will not be eligible for any benefits that the Company offers its employees, including, but not limited to, health benefits, holiday pay, vacation pay, sick leave, retirement benefits. You understand that participation in the internship program is not an offer of employment, and successful completion of the internship does not entitle you to employment with the Company. During your internship, you may have access to confidential, proprietary, and/or trade secret information belonging to the Company. You agree that you will keep all this information strictly. Confidential and refrain from using it for your own purposes or from disclosing it to anyone outside the Company. In addition, you agree that, upon conclusion of the internship, you will immediately return to the Company all its property, equipment, and documents, including electronically stored information. By accepting this offer, you agree that you will follow all of the Company's policies that apply to nonemployee interns, including the Company's anti-harassment policy. This letter constitutes the complete understanding between you and the Company regarding your internship and supersedes all prior discussions or agreements. This letter may only be modified by a written agreement signed by both of us. I hope that your internship with the Company will be successful and rewarding. Please indicate your acceptance of this offer by signing below and returning it to our company desk. If you have any questions, please do not hesitate to contact us. Very truly yours, Saumya Tiwari HR - Manager TEACHNOOK I accept Intern with the Company on the terms and conditions set out in this letter. Date : 04/01/2023 Signature JOINING LETTER COMPLETION CERTIFICATE BABARIA INSTITUTE OF TECHNOLOGY On national highway 8 varnama vadodara DECLARATION We hereby declare that the Internship report submitted along with the Project entitled DATA SCIENCE submitted in partial fulfilment for the degree of Bachelor of Engineering in COMPUTER SCIENCE AND ENGINEERING to Gujarat Technological University, Ahmedabad, is a bonafide record of original project work carried out by me at Brainy Beams Pvt Ltd. under the supervision of Mrs. Apoorva Shah and that no part of this report has been directly copied from any students’ reports or taken from any other source, without providing due reference. Name of Student HEM GONDALIYA i Project id: 300622 ACKNOWLEDGEMENT I would sincerely like to thank my internal faculty mentor Mrs. Apoorva Shah for guiding me and giving me this opportunity to proceed with Data Science internship at Teachnook which helped me to brush up my skills to match up the industry requirements. I would like to extend my gratitude toteachnook . who saw the right candidate in me for this internship and gave me this opportunity to be a part of Data science and machine learning. Also I appreciate the guidance given by the developer at Teachnook, jinal Modi as well as the panels especially for the internship that has advised me and gave guidance at every moment of the internship. HEM GONDALIYA (190050131019) ii Project id: 300622 ABSTRACT Data science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract insights and knowledge from various forms of data. The field encompasses a broad range of techniques and tools from statistics, mathematics, and computer science, including data mining, machine learning, and artificial intelligence. The goal of data science is to enable organizations to make data-driven decisions, improve efficiency and productivity, and gain a competitive advantage in the marketplace.Data science involves several stages, including data collection, data cleaning, data transformation, data modeling, and data visualization. These stages aim to convert raw data into actionable insights that can be used to drive business decisions. The field has a wide range of applications, including but not limited to business intelligence, health informatics, finance, social sciences, and more.As the amount of data generated continues to grow exponentially, data science has become an essential tool for organizations across industries. The field's importance lies in its ability to provide valuable insights that enable organizations to make informed decisions, improve operations, and innovate new products and services. This abstract provides a comprehensive overview of the scope and significance of data science in today's world. iii Project id: 300622 LIST OF FIGURES Figure 1 BUSINESS ANALYSIS ...................................................................................... 2 Figure 2 BOX PLOT........................................................................................................ 11 Figure 3 SCATTER PLOT ............................................................................................. 11 Figure 4 ALGO. SELECTION ....................................................................................... 13 Figure 5 LINEAR REGRESSION GRAPH .................................................................. 14 Figure 6 LOGISTIC REGRESSION ............................................................................. 15 Figure 7 K-MEANS CLUSTERING .............................................................................. 15 Figure 8 DATASET ......................................................................................................... 16 Figure 9 DATA FLOW DIAGRAM ............................................................................... 19 Figure 10 MACHINE LEARNING PIPELINE ............................................................ 21 Figure 11 IMPORTING LIBRARY ............................................................................... 22 Figure 12 PIE CHART .................................................................................................... 23 Figure 13 BAR GRAPH .................................................................................................. 24 Figure 14 ACCURACY TEST ........................................................................................ 25 Figure 15 CONFUSION MATRIX................................................................................. 26 Figure 16 TESTING AND PREDICTION .................................................................... 27 iv Project id: 300622 TABLE OF CONTENTS ACKNOWLEDGEMENT ................................................................................................ II ABSTRACT ..................................................................................................................... III LIST OF FIGURES ........................................................................................................ V CHAPTER-1 COMPANY PROFILE ............................................................................. 1 1.1 ABOUT US: ...............................................................................................................1 1.2 VISION:..................................................................................................................... 1 1.3 SPECILATES: .......................................................................................................... 1 CHAPTER-2 INTRODUCTION TO DATA SCIENCE ................................................ 2 2.1 PYTHON ................................................................................................................... 2 2.2 SPECTUM OF BUSSINESS ANALYSIS ............................................................... 2 2.3 APPLICATION OF DATA SCIENCE ................................................................... 3 2.4 PYTHON INTRODUCTION ..................................................................................4 CHAPTER-3 STATISTICS .............................................................................................. 5 3.1 DESCRIPTIVE STATISTICS................................................................................. 5 3.2 OUTLIERS................................................................................................................5 3.3 HISTOGRAMS ......................................................................................................... 6 3.4 .................................................................................................................................... 6 3.5 HYPOTHESIS TESTING…………………………………………………………7 3.6 T TEST………………………………………………………………………………7 3.7 SCORE……………………………………………………………………………….7 3.8 CHI SQUARED TEST……………………………………………………………..7 CHAPTER-4 PRIDICTIVE MODELING ...................................................................... 8 4.1STAGE OF PREDICTIVE MODELING................................................................ 9 4.2 PROBLEM DEFINITION ....................................................................................... 9 4.3 HYPOTHESIS GENRATION ................................................................................. 9 4.4 DATA EXTRACTION……………………………………………………………9 4.5 DATA EXPLORATION AND TRANSFORMATION…………………………9 4.6 UNIVARIATE AND BIVARIATE………………………………………………10 4.7 GRAPHICAL METHOD ………………………………………………………..11 v Project id: 300622 CHAPTER-5 MODEL BUILDING ............................................................................... 12 5.1 STEPS IN MODEL BUILDING ............................................................................ 12 5.2 ALGORITHM SELECTION .................................................................................13 5.2 TYPES OF MACHINE LEARNING ALGORITHM ......................................... 14 CHAPTER-6 DATA AND MODEL SELECTION ...................................................... 16 6.1 DATASET ............................................................................................................... 16 6.2 ALGORITHMS ...................................................................................................... 17 CHAPTER-7 VALIDATION.......................................................................................... 19 7.1 DATA FLOW DIAGRAM ..................................................................................... 19 7.2 MODULES DESCRIPTION ................................................................................. 20 CHAPTER-8 .....................................................................................................................22 8.1 IMPOERTING LIBRARY AND DATASET…………………………………...22 8.2 DATA CLEANING AND VISULIZATION…………………………………….23 8.3 DIVIDING DATASET INTO TRAINING AND TESTING SET……………..24 8.4 DATA NORMALIZATION……………………………………………………...26 8.5 TESTING AND PREDICTION ………………………………………………...27 CHAPTER-9 LEARNING OUTCOME ........................................................................ 29 CONCLUSION ................................................................................................................ 29 REFERENCE ................................................................................................................... 30 vi Project Id: 300622 COMPANY PROFILE CHAPTER-1 COMPANY PROFILE 1.1 About us: Teachnook is at the forefront of innovation, using cutting-edge techniques to create intelligent solutions that drive business success. We specialize in developing custom machine learning algorithms that can extract insights from complex data sets and enable businesses to make data-driven decisions. Our team of data scientists, engineers, and developers has deep expertise in a wide range of machine learning techniques, including deep learning, natural language processing, and computer vision. With our solutions, businesses can optimize their operations, improve customer experiences, and gain a competitive edge in the marketplace. Our company is committed to delivering highquality, scalable, and efficient machine learning solutions that meet our clients' unique needs. 1.2 VISION: To become the most trusted and preferred offshore IT solutions partner for Startups, SMBs and Enterprises through innovation and technology leadership. Understanding your ambitious vision, honing in on its essence, creating a design strategy, and knowing how to technically execute it is what we do best. Our promise? The integrity of your vision will be maintained and we'll enhance it to best reach your target customers. With our primary focus on creating amazing user experiences, we'll help you understand the tradeoffs, prioritize features, and distill valuable functionality. It's an art form we care about getting right. 1.3 SPECILATES: Machine learning, Data science, Deep learning, Artificialintelligent, Android Development, iOS Development, Windows Development, Web Development, IOT Development, Cross Platform Development, Mobile App Development, Enterprise Solutions, Database Administration, UI Design and Development, Database Handling, Web Services, Python, and R. Gujarat Technological University 1 BITS VADODARA Project Id: 300622 INTRODUCTION TO DATA SCIENCE CHAPTER-2 INTRODUCTION TO DATA SCIENCE 2.1 PYTHON: Python is a computer programming language often used to build websites and software, automate tasks, and conduct data analysis. Python is a general-purpose language, meaning it can be used to create a variety of different programs and isn't specialized for any specific problems. 2.2 SPECTUM OF BUSSINESS ANALYSIS: Reporting / Management Information System To track what is happening in organization. Detective Analysis Gujarat Technological University 2 BITS VADODARA Project Id: 300622 INTRODUCTION TO DATA SCIENCE Asking questions based on data we are seeing, like. Why something happened? Dashboard / Business Intelligence Utopia of reporting. Every action about business is reflected in front of screen. Predictive Modelling Using past data to predict what is happening at granular level. Big Data Stage where complexity of handling data gets beyond the traditional system. Can be caused because of volume, variety or velocity of data. Use specific tools to analyse such scale data. 2.3 APPLICATION OF DATA SCIENCE: Recommendation System Example-In Amazon recommendations are different for different users according to their past search. • Social Media 1. Recommendation Engine 2. Ad placement 3. Sentiment Analysis • Deciding the right credit limit for credit card customers. • Suggesting right products from e-commerce companies 1. Recommendation System 2. Past Data Searched 3. Discount Price Optimization • How google and other search engines know what are the more relevant results for our search query? 1. Apply ML and Data Science 2. Fraud Detection Gujarat Technological University 3 BITS VADODARA Project Id: 300622 INTRODUCTION TO DATA SCIENCE 2.4 PYTHON INTRODUCTION: Python is an interpreted, high-level, general-purpose programming language. It has efficient high-level data structures and a simple but effective approach to object-oriented programming. Python’s elegant syntax and dynamic typing, together with its interpreted nature, make it an ideal language for scripting and rapid application development in many areas on most platforms. Python for Data science: Why Python??? 1. Python is an open source language. 2. Syntax as simple as English. 3. Very large and Collaborative developer community. 4. Extensive Packages. • UNDERSTANDING OPERATORS: Theory of operators: - Operators are symbolic representation of Mathematical tasks. • VARIABLES AND DATATYPES: Variables are named bounded to objects. Data types in python are int (Integer), Float, Boolean and strings. • CONDITIONAL STATEMENTS: If-else statements (Single condition) If- elif- else statements (Multiple Condition) • LOOPING CONSTRUCTS: For loop • FUNCTIONS: Functions are re-usable piece of code. Created for solving specific problem. Two types: Built-in functions and User- defined functions. Functions cannot be reused in python. Gujarat Technological University 4 BITS VADODARA Project Id: 300622 STATISTICS CHAPTER-3 STATISTICS 3.1 Descriptive Statistic : Mode It is a number which occurs most frequently in the data series. It is robust and is not generally affected much by addition of couple of new values. Code import pandas as pd data=pd.read_csv( "Mode.csv") //reads data from csv file data.head() //print first five lines mode_data=data['Subject'].mode() //to take mode of subject column print(mode_data) Mean import pandas as pd data=pd.read_csv( "mean.csv") //reads data from csv file data.head() //print first five lines mean_data=data[Overallmarks].mean() //to take mode of subject column print(mean_data) Meadian Absolute central value of data set. import pandas as pd data=pd.read_csv( "data.csv") //reads data from csv file data.head() //print first five lines median_data=data[Overallmarks].median() //to take mode of subject column print(median_data) Types of variables • Continous – Which takes continuous numeric values. Eg-marks • Categorial-Which have discrete values. Eg- Gender • Ordinal – Ordered categorial variables. Eg- Teacher feedback 3.2 OUTLIERS Any value which will fall outside the range of the data is termed as a outlier. Eg- 9700 instead of 97. Gujarat Technological University 5 BITS VADODARA Project Id: 300622 STATISTICS Reasons of Outliers • Typos-During collection. Eg-adding extra zero by mistake. • Measurement Error-Outliers in data due to measurement operator being faulty. • Intentional Error-Errors which are induced intentionally. Eg-claiming smaller amount of alcohol consumed then actual. • Legit Outlier—These are values which are not actually errors but in data due to legitimate reasons. Eg - a CEO’s salary might actually be high as compared to other employees. 3.3 HISTOGRAMS Histograms depict the underlying frequency of a set of discrete or continuous data that are measured on an interval scale. import pandas as pd histogram=pd.read_csv(histogra m.csv) import matplotlib.pyplot as plt %matplot inline plt.hist(x= 'Overall Marks',data=histogram) plt.show() 3.4 iNFERENTIAL Inferential statistics allows to make inferences about the population from the sample data. 3.5 HYPOTHESIS TESTING Hypothesis testing is a kind of statistical inference that involves asking a question, collecting data, and then examining what the data tells us about how to proceed. The hypothesis to be tested is called the null hypothesis and given the symbol Ho. We test the null hypothesis against an alternative hypothesis, which is given the symbol Ha. Gujarat Technological University 6 BITS VADODARA Project Id: 300622 STATISTICS 3.6 T TEST When we have just a sample not population statistics. Use sample standard deviation to estimate population standard deviation. T test is more prone to errors, because we just have samples. 3.7 Z SCORE The distance in terms of number of standard deviations, the observed value is away from mean, is standard score or z score. +Z – value is above mean. -Z – value is below mean. The distribution once converted to z- score is always same as that of shape of original distribution 3.8 CHI SQUARED TEST To test categorical var Gujarat Technological University 7 BITS VADODARA Project Id: 300622 PRIDICTIVE MODELING CHAPTER-4 PRIDICTIVE MODELING Making use of past data and attributes we predict future using this data. EgPast Horror Movies Future Unwatched Horror Movies Predicting stock price movement 1. Analysing past stock prices. 2. Analysing similar stocks. 3. Future stock price required. Types 1. Supervised Learning Supervised learning is a type algorithm that uses a known dataset (called the training dataset) to make predictions. The training dataset includes input data and response values. • Regression-which have continuous possible values. Eg-Marks • Classification-which have only two values. Eg-Cancer prediction is either 0 or 1. 2. Unsupervised Learning Unsupervised learning is the training of machine using information that is neither classified nor. Here the task of machine is to group unsorted information according to similarities, patterns and differences without any prior training of data. • Clustering: A clustering problem is where you want to discover the inherent groupings in the data, such as grouping customers by purchasing behaviour. • Association: An association rule learning problem is where you want to discover rules that describe large portions of your data, such as people that buy X also tend to buy Y. Gujarat Technological University 8 BITS VADODARA Project Id: 300622 PRIDICTIVE MODELING 4.1 STAGE OF PREDICTIVE MODELING : 1. Problem definition 2. Hypothesis Generation 3. Data Extraction/Collection 4. Data Exploration and Transformation 5. Predictive Modelling 6. Model Development/Implementation 4.2 PROBLEM DEFINITION: Identify the right problem statement, ideally formulate the problem mathematically. 4.3 HYPOTHESIS GENRATION: List down all possible variables, which might influence problem objective. These variables should be free from personal bias and preferences. Quality of model is directly proportional to quality of hypothesis. 4.4 DATA EXTRACTION: Collect data from different sources and combine those for exploration and model building. While looking at data we might come across new hypothesis. 4.5 DATA EXPLORATION AND TRANSFORMATION: Data extraction is a process that involves retrieval of data from various sources for further data processing or data storage. Steps of Data Extraction • Reading the data Eg- From csv file • Variable identification • Univariate Analysis • Bivariate Analysis • Missing value treatment Gujarat Technological University 9 BITS VADODARA Project Id: 300622 • Outlier treatment • Variable Transformation PRIDICTIVE MODELING Variable Treatment It is the process of identifying whether variable is 1. Independent or dependent variable 2. Continuous or categorical variable Why do we perform variable identification? 1. Techniques like supervised learning require identification of dependent variable. 2. Different data processing techniques for categorical and continuous data. Categorical variable- Stored as object. Continuous variable-Stored as int or float. 4.6 UNIVARIATE AND BIVARIATE : Univariate Analysis 1. Explore one variable at a time. 2. Summarize the variable. 3. Make sense out of that summary to discover insights, anomalies, etc. Bivariate Analysis • When two variables are studied together for their empirical relationship. • When you want to see whether the two variables are associated with each other. • It helps in prediction and detecting anomalies. Missing Value Treatment Reasons of missing value 1. Non-response – Eg-when you collect data on people’s income and many choose no to answer. 2. Error in data collection. Eg- Faculty data 3. Error in data reading. Different methods to deal with missing values 1. Imputation Gujarat Technological University 10 BITS VADODARA Project Id: 300622 PRIDICTIVE MODELING Continuous-Impute with help of mean, median or regression mode. Categorical-With mode, classification model. 2. Deletion Row wise or column wise deletion. But it leads to loss of data. Outlier Treatment Reasons of Outliers 1. Data entry Errors 2. Measurement Errors 3. Processing Errors 4. Change in underlying population 4.7 GRAPHICAL METHOD • Box Plot Figure 2.box plote • Scatter Plot Figure 3. Scatter plot Gujarat Technological University 11 BITS VADODARA Project Id: 300622 MODEL BUILDING CHAPTER-5 MODEL BUILDING 5.1 STEPS IN MODEL BUILDING It is a process to create a mathematical model for estimating / predicting the future based on past data. EgA retail wants to know the default behaviour of its credit card customers. They want to predict the probability of default for each customer in next three months. • Probability of default would lie between 0 and 1. • Assume every customer has a 10% default rate. Probability of default for each customer in next 3 months=0.1 It moves the probability towards one of the extremes based on attributes of past information. A customer with volatile income is more likely (closer to) to default. A customer with healthy credit history for last years has low chances of default (closer to 0). Steps in Model Building 1. Algorithm Selection 2. Training Model 3. Prediction / Scoring Gujarat Technological University 12 BITS VADODARA Project Id: 300622 MODEL BUILDING 5.2 ALGORITHM SELECTION Figure 4. Algo. Selection Eg- Predict the customer will buy product or not. Algorithms • Logistic Regression • Decision Tree • Random Forest Training Model It is a process to learn relationship / correlation between independent and dependent variables. We use dependent variable of train data set to predict/estimate. Dataset • Train Past data (known dependent variable). Used to train model. • Test Future data (unknown dependent variable) Gujarat Technological University 13 BITS VADODARA Project Id: 300622 MODEL BUILDING Used to score predictions 5.3 TYPES OF MACHINE LEARNING ALGORITHM: Linear Regression Linear regression is a statistical approach for modelling relationship between a dependent variable with a given set of independent variables. It is assumed that the wo variables are linearly related. Hence, we try to find a linear function. That predicts the response value(y) as accurately as possible as a function of the feature or independent variable(x). The equation of regression line is Y-Values 14 represented as: 12 10 8 T he squared error or cost function, J as: 6 4 2 0 0 1 2 3 4 5 6 7 8 9 Figure 5. linear regression graph Logistic Regression Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable, although many more complex extensions exist. Gujarat Technological University 14 BITS VADODARA Project Id: 300622 MODEL BUILDING Figure 6. logistic regression C = -y (log(y) – (1-y) log(1-y)) K-Means Clustering (Unsupervised learning) K-means clustering is a type of unsupervised learning, which is used when you have unlabelled data (i.e., data without defined categories or groups). The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable K. The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on feature similarity. Figure 7. K-means clustering Gujarat Technological University 15 BITS VADODARA Project Id: 300622 DATA AND MODEL SELECTION CHAPTER-6 DATA AND MODEL SELECTION The internship is a platform where the trainees are assigned with the specific task. In the initial days of the internship, I was trained on the following: Python Programming Machine Learning Algorithms 6.1 DATA SET This section describes, in brief, the data that has been used for the research. Data of E-commerce user’s purchases was used in this project, the major amount of data was extracted from public website Kaggle (Kaggle.com), data regarding the review and linked was obtained from E-commerce sites. Data from sources was integrated together to form a staging data-set. For Anticipate the Purchases that will be made by new customer, during the following year and this, form its first purchase by assigning them appropriate cluster/segment. Below table shows the different types of reviews present in the data-set. Figure 8. Dataset Gujarat Technological University 16 BITS VADODARA Project Id: 300622 DATA AND MODEL SELECTION 6.2 ALGORITHMS Linear Regression Linear Regression is a machine learning algorithm based on supervised learning. It performs a regression task. Regression models a target prediction value based on independent variables. It is mostly used for finding out the relationship between variables and forecasting. Different regression models differ based on – the kind of relationship between dependent and independent variables they are considering, and the number of independent variables getting used. SVC The Linear Support Vector Classifier (SVC) method applies a linear kernel function to perform classification and it performs well with a large number of samples. If we compare it with the SVC model, the Linear SVC has additional parameters such as penalty normalization which applies 'L1' or 'L2' and loss function. Pipeline • pipeline is a means of automating the machine learning workflow by enabling data to be transformed and correlated into a model that can then be analyzed to achieve outputs. This type of ML pipeline makes the process of inputting data into the ML model fully automated • Another type of ML pipeline is the art of splitting up your machine learning workflows into independent, reusable, modular parts that can then be pipelined together to create models. This type of ML pipeline makes building models more efficient and simplified, cutting out redundant work. • This goes hand-in-hand with the recent push for microservices architectures, branching off the main idea that by splitting your application into basic and siloed parts you can build more powerful software over time. Operating systems like Linux and Unix are also founded on this principle. Basic Gujarat Technological University 17 BITS VADODARA Project Id: 300622 DATA AND MODEL SELECTION functions like ‘grep’ and ‘cat’ can create impressive functions when they are pipelined together. In my Three months Internship I have undergone through three phases: • Training Phase • Designing and Development Phase • Testing and Maintenance Phase Gujarat Technological University 18 BITS VADODARA Project Id: 300622 VALIDATION CHAPTER-7 VALIDATION 7.1 DATA FLOW DIAGRAM Figure 9. Data flow diagram 1. Exploratory Data Analysis in Machine Learning 2. Data Visualization 3. Training and Testing 4. Train and Evaluate Linear Support Vector Classifier Gujarat Technological University 19 BITS VADODARA Project Id: 300622 VALIDATION 5. Train and Evaluate pipeline in machine learning 7.2 MODULES DESCRIPTION Exploratory Data Analysis: Performed initial investigations on data so as to discover patterns, to spot anomalies, to test hypothesis and to check assumptions with the help of summary statistics and graphical representations. Data Visualization: Using data visualization, I summarized the data with graphs, pictures and maps, so that the human mind has an easier time processing and understanding the given data. Data visualization plays a significant role in the representation of both small and large data sets, but it is especially useful when we have large data sets, in which it is impossible to see all of our data, let alone process and understand it manually. Training and Testing: In this project, datasets are split into two subsets. The first subset is known as the training data - it's a portion of our actual dataset that is fed into the machine learning model to discover and learn patterns. In this way, it trains our model. The other subset is known as the testing data. Train and Evaluate Linear Support Vector Classifier (SVC): The Linear Support Vector Classifier (SVC) method applies a linear kernel function to perform classification and it performs well with a large number of samples. If we compare it with the SVC model, the Linear SVC has additional parameters such as penalty normalization which applies 'L1' or 'L2' and loss function Train and Evaluate Pipeline in machine learning: Gujarat Technological University 20 BITS VADODARA Project Id: 300622 VALIDATION Fighure 10 machine learning pipeline Gujarat Technological University 21 BITS VADODARA Project Id: 300622 DATA AND MODEL SELECTION CHAPTER-8 END TO END MACHINE LEARNING MODEL 8.1 IMPORTING LIBRARY AND RAW DATA Figure 11.importing library Gujarat Technological University 22 BITS VADODARA Project Id: 300622 END TO END ML MODEL 8.2 DATA CLEANING AND VISULIZATION Figure 12. Pie chart Gujarat Technological University 23 BITS VADODARA Project Id: 300622 END TO END ML MODEL Figure 13 bargraph 8.3 DIVIDING DATASET INTO TRAINING AND TESTING SET Gujarat Technological University 24 BITS VADODARA Project Id: 300622 END TO END ML MODEL Figure 14. accuracy test Gujarat Technological University 25 BITS VADODARA Project Id: 300622 END TO END ML MODEL 8.4 DATA NORMALIZATION Figure 15. confusion matrix Gujarat Technological University 26 BITS VADODARA Project Id: 300622 END TO END ML MODEL 8.5 TESTING AND PREDICTION Gujarat Technological University 27 BITS VADODARA Project Id: 300622 Gujarat Technological University END TO END ML MODEL 28 BITS VADODARA Project Id: 300622 CONCLUSION 9. LEARNING OUTCOME After completing the training, I am able to: • Develop relevant programming abilities. • Demonstrate proficiency with statistical analysis of data. • Develop the skill to build and assess data-based model. • Execute statistical analysis with professional statistical software. • Demonstrate skill in data management. • Apply data science concepts and methods to solve problem in real-world contexts and will communicate these solutions effectively. CONCLUSION data science has become an essential tool for businesses and organizations across industries, enabling them to make data-driven decisions, improve efficiency and productivity, and gain a competitive advantage. The field encompasses a broad range of techniques and tools from statistics, mathematics, and computer science, including data mining, machine learning, and artificial intelligence. Data science has numerous applications, including but not limited to business intelligence, health informatics, finance, and social sciences. It has a significant impact on the economy, public policy, and society as a whole. Additionally, data science is essential in tackling some of the world's biggest challenges, such as climate change, healthcare, and cybersecurity. The importance of data science is only expected to grow in the coming years as the amount of data generated continues to increase. As such, there is a significant demand for data scientists who can extract insights and knowledge from the vast amounts of data available. Overall, data science is a fascinating and challenging field with vast potential for innovation, and it offers an exciting career path for those with a passion for data and analytics. Gujarat Technological University 29 BITS VADODARA Project Id: 300622 DATA AND MODEL SELECTION REFERENCE 1. Mchine learning tutorial - geeksforgeeks 2. Machine learning - W3schools 3. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow : book by geron aurelien Gujarat Technological University 30 BITS VADODARA