1. INTRODUCTION MACHINE LEARNING: Machine learning (ML) is a branch of artificial intelligence (AI) that enables computers to “self-learn” from training data and improve over time, without being explicitly programmed. Machine learning algorithms are able to detect patterns in data and learn from them, in order to make their own predictions. In short, machine learning algorithms and models learn through experience. Machine learning is the ability of the machines (i.e., computers or ideally computer programs) to lean from the past behaviour or data and to predict the future outcomes without being explicitly programmed to do so. Machine learning algorithms are constantly analysing and learning from the data to improve their future predictions and outcomes automatically. 1 TYPES OF MACHINE LEARNING : The types of machine learning algorithms differ in their approach, the type of data they input and output, and the type of task or problem that they are intended to solve. Broadly Machine Learning can be categorized into four categories. I. Supervised Learning II. Unsupervised Learning III. Reinforcement Learning IV. Semi-supervised Learning Machine learning enables analysis of massive quantities of data. While it generally delivers faster, more accurate results in order to identify profitable opportunities or dangerous risks, it may also require additional time and resources to train it properly. Supervised Learning: Supervised Learning is a type of learning in which we are given a data set and we already know what are correct output should look like, having the idea that there is a relationship between the input and output. Basically, it is learning task of learning a function that maps an input to an output based on example input-output pairs. It infers a function from labeled training data consisting of a set of training examples. Unsupervised Learning: Unsupervised Learning is a type of learning that allows us to approach problems with little or no idea what our problem should look like. We can derive the structure by clustering the data based on a relationship among the variables in data. With unsupervised learning there is no feedback based on prediction result. Basically, it is a type of selforganized learning that helps in finding previously unknown patterns in data set without pre-existing label. 2 Reinforcement Learning: Reinforcement learning is a learning method that interacts with its environment by producing actions and discovers errors or rewards. Trial and error search and delayed reward are the most relevant characteristics of reinforcement learning. This method allows machines and software agents to automatically determine the ideal behavior within a specific context in order to maximize its performance. Simple reward feedback is required for the agent to learn which action is best. Semi-Supervised Learning: Semi-supervised learning fall somewhere in between supervised and unsupervised learning, since they use both labelled and unlabelled data for training – typically a small amount of labelled data and a large amount of unlabelled data. The systems that use this method are able to considerably improve learning accuracy. Usually, semi-supervised learning is chosen when the acquired labelled data requires skilled and relevant resources in order to train it / learn from it. Otherwise, acquiring unlabelled data generally doesn’t require additional resources. MACHINE LEARNING MODELS: Machine Learning models can be understood as a program that has been trained to find patterns within new data and make predictions. These models are represented as a mathematical function that takes requests in the form of input data, makes predictions on input data, and then provides an output in response. First, these models are trained over a set of data, and then they are provided an algorithm to reason over data, extract the pattern from feed data and learn from those data. Once these models get trained, they can be used to predict the unseen dataset. 3 4 2.DATA INTRODUCTION: Data is a crucial component in the field of Machine Learning. It refers to the set of observations or measurements that can be used to train a machine-learning model. The quality and quantity of data available for training and testing play a significant role in determining the performance of a machine-learning model. Data can be in various forms such as numerical, categorical, or timeseries data, and can come from various sources such as databases, spreadsheets, or APIs. Machine learning algorithms use data to learn patterns and relationships between input variables and target outputs, which can then be used for prediction or classification tasks. Data is typically divided into two types: i. Labeled Data ii. Unlabeled Data LIMITATIONS OF TRADITIONAL DATA ANALYSIS Data analytics is being used in each and every industry for improving the performance of the companies just like in the example of a cricket match where the captain and coach analyze the performance of the opponent team and then device strategies for winning the match. If we begin to list down the areas where data analytics is being used the list would be infinitely long, so some of the major industries include banking, finance, insurance, telecom, healthcare, aerospace, retailers, internet, e-commerce companies, etc. Now, let us talk about the reasons as to why traditional analytics paved the way for machine learning. Limitations of traditional analytics: Involved Huge Efforts, Time Consuming 5 The traditional data analytics involved huge efforts from the statisticians and it was very time consuming as it involved a lot of human intervention, where each job had to be done manually. This became a liability for the companies because of additional costs borne by the statisticians and, they failed to provide accurate results. Static Data Model The models which these statisticians created were static in nature because of which the models needed to be revamped and re-calibrated periodically for making better predictions. This became an additional cost for the companies and they were struggling with the performance of the models. Struggle to Manage the Exponential Growth in the Volume, Velocity, and Variability of Data With the exponential growth in the volume, velocity, and variability of data, traditional analytics struggled to manage and incorporate and integrate the data using their traditional methods. As a result of which their models struggled with the performance. Lack of Skilled resources Lack of skilled resources was one of the major reasons for the downfall of traditional analytics. Companies found it difficult to get good resources and also, they lacked knowledge of advanced tools where data analytics could be done easily. Rise of Machine Learning: Human Intervention Reduced, Dynamic Model With the advent of machine learning, the challenges faced by traditional analytics were catered to. Machine Learning uses complex statistical methods and new-age tools for providing better and accurate solutions to the problems. 6 Using modern techniques and statistical methods we can create the capability for the machines to learn and make good and accurate predictions using complex machine learning algorithms. Modern machine learning algorithms help the companies to make better and accurate data-driven decisions where there are very few human efforts involved. The models created by these techniques are dynamic and cater to the changing variables. For example, with the traditional analytics statisticians had to perform each step of their analysis using the available analytical tools manually. Also, the process was not streamlined and as a result of which, they had to spend too much time back and forth the initial steps. The models created were also not dynamic, suppose they created a model based on their available data, so after each month or once in a quarter they had to check the performance of the model based on the new data and make changes to their models. This proved to be an additional cost for the companies as it involved hiring a data analyst for updating and maintaining the model. But with machine learning, the steps involved in creating the model were pre-defined and for each step, there were different algorithms. The models created based on the algorithms were more powerful and more accurate as they were built using complex statistical techniques. Human intervention was reduced a lot and as the process was streamlined, it saved a lot of time and money for the companies. Additionally, machine learning used new-age tools which made better predictions and accurate results in very less time. Managing the Exponential Growth in the Volume, Velocity, and Variability of Data One more area where the traditional analytics faltered was working with volumes of data. With the increase in volume, velocity, and variability of data, the old tools (e.g. MiniTab, Matlab, IBM SPSS, etc) failed to manage the data. For example, if the data was structured the tools were able to manage and analyze the data to some extent (i.e. the volume of 7 the data) but beyond that, the tools faltered. Also, for unstructured data, the tools had no answer at all, and there very few techniques or methods are known to the statisticians. But with modern-day tools (e.g. Python, R, Knime, Julia, etc.) and techniques, we can manage the mammoth data, be it structured or unstructured, Machine Learning had a solution for each of these problems. This is what the evolution of data analytics is all about. Computing is getting better and better with time. We had talked about the advantages of machine learning and how it is impacting our lives today, now let us talk about the challenges in machine learning. Limitations of Machine Learning: Large Data Requirement Fetching relevant data for training our model becomes a challenge. For making the model make better predictions, we have to train the model using enough data which is significant to the problem. In case there is not sufficient data available, the model build wouldn't be able to perform correctly. Lack of Trained Resources Using different ML techniques, we get our results; but interpretation and understanding of the results vary from person to person. At times it might become a major challenge if we do not have a trained resource, even if he had applied the correct algorithm based on the person’s judgement and understanding the solution might not be correct. Additionally, the selection of the correct algorithm depends solely on the data scientist’s decision. If that person is not a trained resource, the solution might not be correct as the proper and correct algorithm was not selected. Companies do face the challenge of finding a good resource that is sound technically and statistically. 8 Reliance on the Results Obtained by the ML techniques We tend to rely on the results obtained by the ML techniques more compared to our judgement and experience. Suppose we build a model using different variables as selected by the ML algorithm, and the algorithm doesn’t consider a variable that is according to the data scientist critical for the model. So, such kind of situations does occur if we rely completely on ML techniques. Treatment of Missing Fields Lastly, when we have missing fields in our data, we use machine learning to replace those missing values with some alternate values. Imputing those missing values at times might bias our data and hence impact our models and in turn the results. The conclusions and inferences might change if we use the missing data, so we treat them using ML techniques selected by the data scientists. The selection of the imputing technique depends on the data scientist's judgement and becomes a disadvantage for the model. We have talked about missing data but, what do we mean by that? Missing data: Whenever the data is not available or not present for any fields in the data, we say that the data is missing for that field. It is sometimes represented by a “Blank” or “N/A” or “n.a.” (Not Available) or even “-”. This is can be the case for both structured and unstructured data. Suppose for example, when we look at the bowling statistics for M.S. Dhoni in T20 internationals, we get missing data. The screenshot is given below for your reference. The data is not available since Dhoni hasn’t bowled in any of the T20 matches, but when we structure that data we get data missing for Dhoni. 9 There might be n number of reasons as to why some of the values for a variable is missing, it might be unavailability of data or the data was not captured, etc. But as a data scientist, when we work on building a model using the datasets we always have to replace those missing values with some other value in such a way that it won’t impact or have very little impact on our model. There are different techniques available for imputing or replacing those missing values which we are going to discuss later. After ML, what next? As we are making progress in data science and computing, new-age technologies are making a breakthrough. Researchers are constantly working on next-level techniques where there is minimalistic human intervention involved and better and more accurate results. Deep learning and now AI is a breakthrough in the world of data science. It is becoming better and better where deep learning-based techniques are reducing the time and costs for the data analysis. The progress can be summarized in the diagram given below. 10 11 3. INTRODUCTION TO PYTHON INTRODUCTION Python is a High-Level Programming Language, with high-level inbuilt data structures and dynamic binding. It is interpreted and an objectoriented programming language. Python distinguishes itself from other programming languages in its easy to write and understand syntax, which makes it charming to both beginners and experienced folks alike. The extensive applicability and library support of Python allow highly versatile and scalable software and products to be built on top of it in the real world. DATA TYPES: 12 ARITHMETIC OPERATORS: Some examples are shown below: 13 BASIC LIBRARIES IN PYTHON: The basic libraries in python are as follows: I. Pandas II. Numpy III. Matplotlib PANDAS: Pandas is a BSD (Berkeley Software Distribution) licensed opensource library. This popular library is widely used in the field of data science. They are primarily used for data analysis, manipulation, cleaning, etc. 14 Pandas allow for simple data modeling and data analysis operations without the need to switch to another language such as R. Usually, Python libraries use the following types of data: Data in a dataset. Time series containing both ordered and unordered data. Rows and columns of matrix data are labelled. Unlabeled information Any other type of statistical information Pandas can do a wide range of tasks, including: The data frame can be sliced using Pandas. Data frame joining and merging can be done using Pandas. Columns from two data frames can be concatenated using Pandas. In a data frame, index values can be changed using Pandas. In a column, the headers can be changed using Pandas. Data conversion into various forms can also be done using Pandas and many more. 15 2. NUMPY: Numpy is one of the most widely used open source python libraries, focusing on scientific competition. It features built in mathematical functions for matrices and multi dimensional data Numerical python is defined by the term Numpy. It can be used in linear algebra, as a multi dimensional container for generic data and as a random number generator, among other things. Some of the important functions in NUmpy are arcsin(), arcos(),tan(),rad() etc., Numpy array is a python object which defines an N-dimensional array with rows and columns. In python, Numpay array is preferred over lists because it takes up less memory and is faster and more convenient to use . 16 Features: 1. Interactive: Numpy is a very interactive and user-friendly library. 2. Mathematics: NumPy simplifies the implementation of difficult mathematical equations. 3. Intuitive: It makes coding and understanding topics a breeze. 4. A lot of Interaction: There is a lot of interaction in it because it is widely utilised, hence there is a lot of open source contribution. The NumPy interface can be used to represent images, sound waves, and other binary raw streams as an N-dimensional array of real values for visualization. Numpy knowledge is required for full-stack developers to implement this library for machine learning. CONDITIONAL STATEMENTS: if Statement The if statement is a conditional statement in python, that is used to determine whether a block of code will be executed or not. Meaning if the program finds the condition defined in the if statement to be true, it will go ahead and execute the code block inside the if statement. Syntax: if condition: # execute code block 17 if-else Statement As discussed above, the if statement executes the code block when the condition is true. Similarly, the else statement works in conjuncture with the if statement to execute a code block when the defined if condition is false. Syntax: if condition: # execute code if condition is true else: # execute code if condition if False if-elif-else ladder The elif statement is used to check for multiple conditions and execute the code block within if any of the conditions evaluate to be true. The elif statement is similar to the else statement in the context that it is optional but unlike the else statement, there can be multiple elif statements in a code block following an if statement if condition1: # execute this statement elif condition2: 18 # execute this statement .. .. else: # if non of the above conditions evaluate to True execute this statement Nested if Statements: A nested if statement is considered as if within another if statement(s). These are generally used to check for multiple conditions. Syntax: if condition1: if condition2: # execute code if both condition1and condition2 are True DESCRIPTIVE STATISTICS: Descriptive statistics is a means of describing features of a data set by generating summaries about data samples. It's often depicted as a summary of data shown that explains the contents of data. For example, 19 a population census may include descriptive statistics regarding the ratio of men and women in a specific city. TYPES OF DISCRIPTIVE STATISTICS: All descriptive statistics are either measures of central tendency or measures of variability, also known as measures of dispersion. Central Tendency Measures of central tendency focus on the average or middle values of data sets, whereas measures of variability focus on the dispersion of data. These two measures use graphs, tables and general discussions to help people understand the meaning of the analyzed data. Measures of central tendency describe the center position of a distribution for a data set. A person analyzes the frequency of each data point in the distribution and describes it using the mean, median, or mode, which measures the most common patterns of the analyzed data set. Measures of Variability Measures of variability (or the measures of spread) aid in analyzing how dispersed the distribution is for a set of data. For example, while the measures of central tendency may give a person the average of a data set, it does not describe how the data is distributed within the set. So while the average of the data maybe 65 out of 100, there can still be data points at both 1 and 100. Measures of variability help communicate this by describing the shape and spread of the data set. Range, quartiles, absolute deviation, and variance are all examples of measures of variability. 20 Consider the following data set: 5, 19, 24, 62, 91, 100. The range of that data set is 95, which is calculated by subtracting the lowest number (5) in the data set from the highest (100). Distribution Distribution (or frequency distribution) refers to the quantity of times a data point occurs. Alternatively, it is the measurement of a data point failing to occur. Consider a data set: male, male, female, female, female, other. The distribution of this data can be classified as: The number of males in the data set is 2. The number of females in the data set is 3. The number of individuals identifying as other is 1. The number of non-males is 4. 21 4.DATA EXPLORATION AND PREPROCESSING TARGET VARIABLE: The target variable is the feature of a dataset that you want to understand more clearly. It is the variable that the user would want to predict using the rest of the dataset. In most situations, a supervised machine learning algorithm is used to derive the target variable. Such an algorithm uses historical data to learn patterns and uncover relationships between other parts of your dataset and the target. Target variables may vary depending on the goal and available data. Data exploration, also known as exploratory data analysis (EDA), is a process where users look at and understand their data with statistical and visualization methods. This step helps identifying patterns and problems in the dataset, as well as deciding which model or algorithm to use in subsequent steps. Data exploration, also known as exploratory data analysis (EDA), is a process where users look at and understand their data with statistical and visualization methods. This step helps identifying patterns and problems in the dataset, as well as deciding which model or algorithm to use in subsequent steps. 22