UNIT-I Learning a class from examples, Vapnik Chervonenkies(VC) Dimension, Probably Approximately correct(PAC) learning, Noise, learning multiple classes. Regression: Simple linear regression, multiple linear regression, model selection and generalization, Dimensions of supervised Machine learning algorithm, Bayesian classification. …………………………………………………………………………………………………………………………….. Need For Machine Learning Ever since the technical revolution, we’ve been generating an immeasurable amount of data. As per research, we generate around 2.5 quintillion bytes of data every single day! It is estimated that by 2020, 1.7MB of data will be created every second for every person on earth. With the availability of so much data, it is finally possible to build predictive models that can study and analyze complex data to find useful insights and deliver more accurate results. Top Tier companies such as Netflix and Amazon build such Machine Learning models by using tons of data in order to identify profitable opportunities and avoid unwanted risks. Here’s a list of reasons why Machine Learning is so important: • Increase in Data Generation: Due to excessive production of data, we need a method that can be used to structure, analyze and draw useful insights from data. This is where Machine Learning comes in. It uses data to solve problems and find solutions to the most complex tasks faced by organizations. • Improve Decision Making: By making use of various algorithms, Machine Learning can be used to make better business decisions. For example, Machine Learning is used to forecast sales, predict downfalls in the stock market, identify risks and anomalies, etc. • Uncover patterns & trends in data: Finding hidden patterns and extracting key insights from data is the most essential part of Machine Learning. By building predictive models and using statistical techniques, Machine Learning allows you to dig beneath the surface and explore the data at a minute scale. Understanding data and extracting patterns manually will take days, whereas Machine Learning algorithms can perform such computations in less than a second. • Solve complex problems: From detecting the genes linked to the deadly ALS disease to building selfdriving cars, Machine Learning can be used to solve the most complex problems. 1 To give you a better understanding of how important Machine Learning is, let’s list down a couple of Machine Learning Applications: • Netflix’s Recommendation Engine: The core of Netflix is its infamous recommendation engine. Over 75% of what you watch is recommended by Netflix and these recommendations are made by implementing Machine Learning. • Facebook’s Auto-tagging feature: The logic behind Facebook’s DeepMind face verification system is Machine Learning and Neural Networks. DeepMind studies the facial features in an image to tag your friends and family. • Amazon’s Alexa: The infamous Alexa, which is based on Natural Language Processing and Machine Learning is an advanced level Virtual Assistant that does more than just play songs on your playlist. It can book you an Uber, connect with the other IoT devices at home, track your health, etc. • Google’s Spam Filter: Gmail makes use of Machine Learning to filter out spam messages. It uses Machine Learning algorithms and Natural Language Processing to analyze emails in real-time and classify them as either spam or non-spam. Introduction To Machine Learning The term Machine Learning was first coined by Arthur Samuel in the year 1959. Looking back, that year was probably the most significant in terms of technological advancements. If you browse through the net about ‘what is Machine Learning’, you’ll get at least 100 different definitions. However, the very first formal definition was given by Tom M. Mitchell: “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E.” In simple terms, Machine learning is a subset of Artificial Intelligence (AI) which provides machines the ability to learn automatically & improve from experience without being explicitly programmed to do so. In the sense, it is the practice of getting Machines to solve problems by gaining the ability to think. 2 But wait, can a machine think or make decisions? Well, if you feed a machine a good amount of data, it will learn how to interpret, process and analyze this data by using Machine Learning Algorithms, in order to solve real-world problems. Machine Learning Definitions Algorithm: A Machine Learning algorithm is a set of rules and statistical techniques used to learn patterns from data and draw significant information from it. It is the logic behind a Machine Learning model. An example of a Machine Learning algorithm is the Linear Regression algorithm. Model: A model is the main component of Machine Learning. A model is trained by using a Machine Learning Algorithm. An algorithm maps all the decisions that a model is supposed to take based on the given input, in order to get the correct output. Predictor Variable: It is a feature(s) of the data that can be used to predict the output. Response Variable: It is the feature or the output variable that needs to be predicted by using the predictor variable(s). Training Data: The Machine Learning model is built using the training data. The training data helps the model to identify key trends and patterns essential to predict the output. Testing Data: After the model is trained, it must be tested to evaluate how accurately it can predict an outcome. This is done by the testing data set. To sum it up, take a look at the above figure. A Machine Learning process begins by feeding the machine lots of data, by using this data the machine is trained to detect hidden insights and trends. These insights are then used to build a Machine Learning Model by using an algorithm in order to solve a problem. The next topic in this Introduction to Machine Learning blog is the Machine Learning Process. Machine Learning Process The Machine Learning process involves building a Predictive model that can be used to find a solution for a Problem Statement. To understand the Machine Learning process let’s assume that you have been given a problem that needs to be solved by using Machine Learning. The problem is to predict the occurrence of rain in your local area by using Machine Learning. The below steps are followed in a Machine Learning process: Step 1: Define the objective of the Problem Statement At this step, we must understand what exactly needs to be predicted. In our case, the objective is to predict the possibility of rain by studying weather conditions. At this stage, it is also essential to take mental notes on 3 what kind of data can be used to solve this problem or the type of approach you must follow to get to the solution. Step 2: Data Gathering At this stage, you must be asking questions such as, • What kind of data is needed to solve this problem? • Is the data available? • How can I get the data? Once you know the types of data that is required, you must understand how you can derive this data. Data collection can be done manually or by web scraping. However, if you’re a beginner and you’re just looking to learn Machine Learning you don’t have to worry about getting the data. There are 1000s of data resources on the web, you can just download the data set and get going. Coming back to the problem at hand, the data needed for weather forecasting includes measures such as humidity level, temperature, pressure, locality, whether or not you live in a hill station, etc. Such data must be collected and stored for analysis. Step 3: Data Preparation The data you collected is almost never in the right format. You will encounter a lot of inconsistencies in the data set such as missing values, redundant variables, duplicate values, etc. Removing such inconsistencies is very essential because they might lead to wrongful computations and predictions. Therefore, at this stage, you scan the data set for any inconsistencies and you fix them then and there. Step 4: Exploratory Data Analysis Grab your detective glasses because this stage is all about diving deep into data and finding all the hidden data mysteries. EDA or Exploratory Data Analysis is the brainstorming stage of Machine Learning. Data Exploration involves understanding the patterns and trends in the data. At this stage, all the useful insights are drawn and correlations between the variables are understood. For example, in the case of predicting rainfall, we know that there is a strong possibility of rain if the temperature has fallen low. Such correlations must be understood and mapped at this stage. Step 5: Building a Machine Learning Model 4 All the insights and patterns derived during Data Exploration are used to build the Machine Learning Model. This stage always begins by splitting the data set into two parts, training data, and testing data. The training data will be used to build and analyze the model. The logic of the model is based on the Machine Learning Algorithm that is being implemented. In the case of predicting rainfall, since the output will be in the form of True (if it will rain tomorrow) or False (no rain tomorrow), we can use a Classification Algorithm such as Logistic Regression. Choosing the right algorithm depends on the type of problem you’re trying to solve, the data set and the level of complexity of the problem. In the upcoming sections, we will discuss the different types of problems that can be solved by using Machine Learning. Step 6: Model Evaluation & Optimization After building a model by using the training data set, it is finally time to put the model to a test. The testing data set is used to check the efficiency of the model and how accurately it can predict the outcome. Once the accuracy is calculated, any further improvements in the model can be implemented at this stage. Methods like parameter tuning and cross-validation can be used to improve the performance of the model. Step 7: Predictions Once the model is evaluated and improved, it is finally used to make predictions. The final output can be a Categorical variable (eg. True or False) or it can be a Continuous Quantity (eg. the predicted value of a stock). In our case, for predicting the occurrence of rainfall, the output will be a categorical variable. So that was the entire Machine Learning process. Now it’s time to learn about the different ways in which Machines can learn. Machine Learning Types A machine can learn to solve a problem by following any one of the following three approaches. These are the ways in which a machine can learn: 1. Supervised Learning 2. Unsupervised Learning 3. Reinforcement Learning Supervised Learning Supervised learning is a technique in which we teach or train the machine using data which is well labeled. To understand Supervised Learning let’s consider an analogy. As kids we all needed guidance to solve math problems. Our teachers helped us understand what addition is and how it is done. Similarly, you can think of supervised learning as a type of Machine Learning that involves a guide. The labeled data set is the teacher that will train you to understand patterns in the data. The labeled data set is nothing but the training data set. 5 Consider the above figure. Here we’re feeding the machine images of Tom and Jerry and the goal is for the machine to identify and classify the images into two groups (Tom images and Jerry images). The training data set that is fed to the model is labeled, as in, we’re telling the machine, ‘this is how Tom looks and this is Jerry’. By doing so you’re training the machine by using labeled data. In Supervised Learning, there is a well-defined training phase done with the help of labeled data. Unsupervised Learning Unsupervised learning involves training by using unlabeled data and allowing the model to act on that information without guidance. Think of unsupervised learning as a smart kid that learns without any guidance. In this type of Machine Learning, the model is not fed with labeled data, as in the model has no clue that ‘this image is Tom and this is Jerry’, it figures out patterns and the differences between Tom and Jerry on its own by taking in tons of data. For example, it identifies prominent features of Tom such as pointy ears, bigger size, etc, to understand that this image is of type 1. Similarly, it finds such features in Jerry and knows that this image is of type 2. Therefore, it classifies the images into two different classes without knowing who Tom is or Jerry is. Reinforcement Learning Reinforcement Learning is a part of Machine learning where an agent is put in an environment and he learns to behave in this environment by performing certain actions and observing the rewards which it gets from those actions. 6 This type of Machine Learning is comparatively different. Imagine that you were dropped off at an isolated island! What would you do? Panic? Yes, of course, initially we all would. But as time passes by, you will learn how to live on the island. You will explore the environment, understand the climate condition, the type of food that grows there, the dangers of the island, etc. This is exactly how Reinforcement Learning works, it involves an Agent (you, stuck on the island) that is put in an unknown environment (island), where he must learn by observing and performing actions that result in rewards. Reinforcement Learning is mainly used in advanced Machine Learning areas such as self-driving cars, AplhaGo, etc. 1. LEARNING FROM CLASS EXAMPLES Let us say we want to learn the class, C, of a “family car.” We have a set of examples of cars, and we have a group of people that we survey to whom we show these cars. The people look at the cars and label them; the cars that they believe are family cars are positive examples, and the other cars are negative examples. Class learning is finding a description that is shared by all the positive examples and none of the negative examples. Doing this, we can make a prediction: Given a car that we have not seen before, by checking with the description learned, we will be able to say whether it is a family car or not. Or we can do knowledge extraction. After some discussions with experts in the field, let us say that we reach the conclusion that among all features a car may have, the features that separate a family car from other type of cars are the price and engine power. These two attributes are the inputs to the class recognizer. Let us denote price as the first input attribute x1 (e.g., in U.S. dollars) and engine power as the second attribute x2 (e.g., engine volume in cubic centimetres). Thus, we represent each car using two numeric values 7 Our training data can now be plotted in the two-dimensional (x1, x2) space where each instance t is a data point at coordinates (xt1, xt2) and its type, namely, positive versus negative, is given by r t (see figure 2.1). After further discussions with the expert and the analysis of the data, we may have reason to believe that for a car to be a family car, its price and engine power should be in a certain range for suitable values of p1, p2, e1, and e2. Equation 2.4 thus assumes C to be a rectangle in the price-engine power space (see figure 2.2). Equation 2.4 fixes H, the hypothesis class from which we believe C is drawn, namely, the set of rectangles. The learning algorithm then finds hypothesis the particular hypothesis, h ∈ H, specified by a particular quadruple of (ph1, ph2, eh1, eh2), to approximate C as closely as possible. 8 which particular h∈H is equal, or closest, to C. But once we restrict our attention to this hypothesis class, learning the class reduces to the easier problem of finding the four parameters that define h. The aim is to find h∈H that is as similar as possible to C. Let us say the hypothesis h makes a prediction for an instance x such that In real life we do not know C(x), so we cannot evaluate how well h(x) matches C(x). What we have is the training set X, which is a small subset of the set of all possible x. Error The empirical error is the proportion of training instances where predictions of h do not match the required values given in X. The error of hypothesis h given the training set X is where 1(a != b) is 1 if a != b and is 0 if a = b (see figure 2.3). Most specific and Most general Hypothesis: In our example, the hypothesis class H is the set of all possible rectangles. Each quadruple (ph1, ph2, eh1, eh2), defines one hypothesis, h, from H, and we need to choose the best one, or in other words, we need to find the values of these four parameters given the training set, to include all the positive examples and none of the negative examples. One possibility is to find the most specific hypothesis, S, that is the tightest rectangle that includes all the positive examples and none of the negative examples (see figure 2.4). This gives us one hypothesis, h = S, as our induced class. The most general hypothesis, G, is the largest rectangle we can draw that includes all the positive examples and none of the negative examples (figure 2.4). 9 False positive and False Negative • • • • C is the actual class and h is our induced hypothesis. The point where C is 1 but h is 0 is a false negative, and the point where C is 0 but h is 1 is a false positive. Other points—namely, true positives and true negatives—are correctly classified. version space • • • • • • Any h∈H between S and G is a valid hypothesis with no error, said to be consistent with the training set, and such h make up the version space. Given another training set, S, G, version space, the parameters and thus the learned hypothesis, h, can be different. Actually, depending on X andH, there may be several Si and Gj which respectively make up the S-set and the G-set. Every member of the S-set is consistent with all the instances, and there are no consistent hypotheses that are more specific. Similarly, every member of the G-set is consistent with all the instances, and there are no consistent hypotheses that are more general. These two make up the boundary sets and any hypothesis between them is consistent and is part of the version space. There is an algorithm called candidate elimination that incrementally updates the S- and G-sets as it sees training instances one by one. Margin • • • • The margin, which is the distance between the boundary and the instances closest to it. We choose the hypothesis with the largest margin, for best separation. The shaded instances are those that define (or support) the margin; Other instances can be removed without affecting h 10 2. VAPNIK-CHERVONENKIS DIMENSION Let us say we have a dataset containing N points. These N points can be labelled in 2N ways as positive and negative. Therefore, 2N different learning problems can be defined by N data points. If for any of these problems, we can find a hypothesis h∈H that separates the positive examples from the negative, then we say H shatters N points. That is, any learning problem definable by N examples can be learned with no error by a hypothesis drawn from H. The maximum number of points that can be shattered by H is called the Vapnik-Chervonenkis (VC) dimension of H, is denoted as VC(H), and measures the capacity of H. The VC dimension of a classifier is defined by Vapnik and Chervonenkis to be the cardinality (size) of the largest set of points that the classification algorithm can shatter. Shattering is the ability of a model to classify a set of points perfectly. More generally, the model can create a function that can divide the points into two distinct classes without overlapping Definition: The Vapnik-Chervonenkis Dimension, VC(H), of hypothesis space H defined over instance space X is the size of the largest finite subset of X shattered by H. If arbitrarily large finite sets of X can be shattered by H, then VC(H) = ∞ In figure 2.6, we see that an axis-aligned rectangle can shatter four points in two dimensions. Then VC(H), when H is the hypothesis class of axis-aligned rectangles in two dimensions, is four. In calculating the VC dimension, it is enough that we find four points that can be shattered; it is not necessary that we be able to shatter any four points in two dimensions. For example, four points placed on a line cannot be shattered by rectangles. However, we cannot place five points in two dimensions anywhere such that a rectangle can separate the positive and negative examples for all possible labelings. 11 VC dimension may seem pessimistic. It tells us that using a rectangle our hypothesis class, we can learn only datasets containing four points and not more. A learning algorithm that can learn datasets of four points is not very useful. However, this is because the VC dimension is independent of the probability distribution from which instances are drawn. In real life, the world is smoothly changing, instances close by most of the time have the same labels, and we need not worry about all possible labelings. There are a lot of datasets containing many more data points than four that are learnable by our hypothesis class (figure 2.1). So even hypothesis classes with small VC dimensions are applicable and are preferred over those with large VC dimensions, for example, a lookup table that has infinite VC dimension. Example-2: Vapnik-Chervonenkis (VC) Dimension • VC (Vapnik-Chervonenkis) dimension is a measure of the capacity or complexity of a space of functions that can be learned by a classification algorithm (more specifically, hypothesis). • The basic definition of VC dimension is the capacity of a classification algorithm, and is defined as the maximum cardinality of the points that the algorithm is able to shatter. Linear Classifier with two data points A binary classifier, first is positive class 'A' and another is negative class 'B', with two data points. The possible combinations of data points are 2N • In our case 2², i.e. (++,+--+,--) • In all the cases, the linear classifier can separate the positive and negative data points. Linear Classifier with three data points • Binary classification with three data points (in 2D space • The 3 points can take either class A (+) or class B (-) which gives us 23 (-8) possible combinations (or learning problems). ● a line can shatter 3 points (in general position). 12 Linear Classifier with four data points ● Now, for the case of 4 points, we can have maximum of 24 (16) possible combinations. ● In Figure that the line was unable to shatter the two classes. • So, we can say that the linear classifier can shatter at most 3 points. Rectangle Classifier In Four data point set, The rectangle classifier can shattered in all possible ways • Given such 4 points, we assign them the {+,-} labels, in all possible ways. For each labeling it must exist a rectangle which produces such assignment, i.e. such classification • Our classifier: inside the rectangle positive and outside negative examples, respectively ● Given 4 points (linearly independent), we have the following assignments: a) All points are "+" ⇒ use a rectangle that includes them b) All points are "-" ⇒ use a empty rectangle c) 3 points "-" and 1 "+" ⇒ use a rectangle centered on the "+" points d) 3 points "+" and one "-" ⇒ we can always find a rectangle which exclude the "-" points e) 2 points “+” and 2 points "-" ⇒ we can define a rectangle which includes the 2 "+" and excludes the 2 "-". Rectangles Classifier with five data points For any 5-point set, we can define a rectangle which has the most extern points as vertices ● If we assign to such vertices the "+" label and to the internal point the "-" label, there will not be any rectangle which reproduces such assignment 13 Vapnik-Chervonenkis dimension (VC dim). ● A dataset containing N points. • These N points can be labeled in 2N ways as positive and negative • A hypothesis h E H that separates the positive examples from the negative, then we say H shatters N points. ● The maximum number of points that can be shattered by H is called the Vapnik - Chervonenkis(VC) dimension of H, is denoted as VC(H), and measures the capacity of H. Text Book Example: Rectangle can shatter four points. ● An axis-aligned rectangle can shatter four points in two dimensions. ● Then VC(H), when H is the hypothesis class of axis-aligned rectangles in two dimensions. • Rectangle can separate the positive and negative examples for all possible labeling. • Only rectangles covering two points, with all possible shatter are shown in the diagram. 3. PROBABLY APPROXIMATELY CORRECT (PAC) LEARNING In computational learning theory, probably approximately correct (PAC) learning is a framework for mathematical analysis of machine learning. It was proposed in 1984 by Leslie Valiant. In this framework, the learner receives samples and must select a generalization function (called the hypothesis) from a certain class of possible functions. The goal is that, with high probability (the "probably" part), the selected function will have low generalization error (the "approximately correct" part). The learner must be able to learn the concept given any arbitrary approximation ratio, probability of success, or distribution of the samples. The PAC model belongs to that class of learning models which is characterized by learning from examples. In these models, say if f is the target function to be learnt, the learner is provided with some random examples (actually these examples may be from some probability distribution over the input space) in the form of (X,f (X)) where X is a binary input instance, say of length n, and f (X) is the value (boolean TRUE or FALSE) of the target function at that instance. Based on these examples, the learner must succeed in deducing the target function f which we can now express as f : {0, 1}n → {0, 1}. ε gives an upper bound on the error in accuracy with which h approximated f and δ gives the probability of failure in achieving this accuracy. Using both these quantities, we can express the definition of a PAC Algorithm with more mathematical clarity. Consequently, we can say that, to qualify for PAC Learnability, the learner must find with probability of at least 1 − δ, a concept h such that the error between h and f is at most ε 14 Formal Definition of PAC-Learnable PAC Learnability • We would like to find an h such that errorD(h) = 0. This is not possible because (1) unless every possible instance of X is in the training set, there might be multiple hypotheses consistent with the training data and (2) there is a small chance that the training examples will be misleading • Therefore, we will require that errorD(h) < ε • Therefore, we will require that the probability of failure on a sequence of randomly drawn training examples be bounded by δ Example: Using the tightest rectangle, S, as our hypothesis, we would like to find how many examples we need. We would like our hypothesis to be approximately correct, namely, that the error probability be bounded by some value. We also would like to be confident in our hypothesis in that we want to know that our hypothesis will be correct most of the time (if not always); so we want to be probably correct as well (by a probability we can specify). In probably approximately correct (PAC) learning, given a class, C, and examples drawn from some unknown but fixed probability distribution, p(x), we want to find the number of examples, N, such that with probability at least 1 − δ, the hypothesis h has error at most €, for arbitrary δ ≤ 1/2 and € > 0 P{CΔh ≤ €} ≥ 1 − δ where CΔh is the region of difference between C and h. • • • • • • In our case, because S is the tightest possible rectangle, the error region between C and h = S is the sum of four rectangular strips (see figure 2.7). We would like to make sure that the probability of a positive example falling in here (and causing an error) is at most € For any of these strips, if we can guarantee that the probability is upper bounded by €/4, the error is at most 4(€/4) = €. The probability that a randomly drawn example misses this strip is 1 − €/4. The probability that all N independent draws miss the strip is (1−€/4)N, and the probability that all N independent draws miss any of the four strips is at most 4(1 − €/4)N, which we would like to be at most δ. We have the inequality 15 Therefore, provided that we take at least (4/€) log(4/δ) independent examples from C and use the tightest rectangle as our hypothesis h, with confidence probability at least 1 − δ, a given point will be misclassified with error probability at most €. Issues of PAC Learnability The computational limitation also imposes a polynomial constraint on the training set size, since a learner can process at most polynomial data in polynomial time. • How to prove PAC learnability: – First prove sample complexity of learning C using H is polynomial. – Second prove that the learner can train on a polynomial-sized data set in polynomial time. • To be PAC-learnable, there must be a hypothesis in H with arbitrarily small error for every concept in C, generally C<=H. 4. NOISE Noise is any unwanted anomaly in the data and due to noise, the class may be more difficult to learn and zero error may be infeasible with a simple hypothesis class (see figure 2.8). There are several interpretations of noise: There may be imprecision in recording the input attributes, which may shift the data points in the input space. There may be errors in labeling the data points, which may relabel positive instances as negative and vice versa. This is sometimes called teacher noise. There may be additional attributes, which we have not taken into account, that affect the label of an instance. Such attributes may be hidden or latent in that they may be unobservable. The effect of these neglected attributes is thus modelled as a random component and is included in “noise.” 16 As can be seen in figure 2.8, when there is noise, there is not a simple boundary between the positive and negative instances and to separate them, one needs a complicated hypothesis that corresponds to a hypothesis class with larger capacity. Using the simple rectangle (unless its training error is much bigger) makes more sense because of the following: 1. It is a simple model to use. It is easy to check whether a point is inside or outside a rectangle and we can easily check, for a future data instance, whether it is a positive or a negative instance. 2. It is a simple model to train and has fewer parameters. It is easier to find the corner values of a rectangle than the control points of an arbitrary shape. With a small training set when the training instances differ a little bit, we expect the simpler model to change less than a complex model. 3. It is a simple model to explain. A rectangle simply corresponds to defining intervals on the two attributes. By learning a simple model, we can extract information from the raw data given in the training set. 5. If indeed there is mislabeling or noise in input and the actual class is really a simple model like the rectangle, then the simple rectangle, because it has less variance and is less affected by single instances, will be a better discriminator than the wiggly shape, although the simple one may make slightly more errors on the training set. Given comparable empirical error, we say that a simple (but not too simple) model would generalize better than a complex model. This principle is known as Occam’s razor, which states that simpler explanations are more plausible and any unnecessary complexity should be shaved off. Occam’s razor argues that the simplest explanation is the one most likely to be correct. How is Occam’s Razor Relevant in Machine Learning? Occam’s Razor is one of the principles that guides us when we are trying to select the appropriate model for a particular machine learning problem. If the model is too simple, it will make useless predictions. If the model is too complex (loaded with attributes), it will not generalize well. 5. Learning multiple classes In our example of learning a family car, we have positive examples belonging to the class family car and the negative examples belonging to all other cars. This is a two-class problem. In the general case, we have K classes denoted as Ci, i = 1, . . . , K, and an input instance belongs to one and exactly one of them. The training set is now of the form 17 An example is given in figure 2.9 with instances from three classes: family car, sports car, and luxury sedan. In machine learning for classification, we would like to learn the boundary separating the instances of one class from the instances of all other classes. Thus, we view a K-class classification problem as K two-class problems. The training examples belonging to Ci are the positive instances of hypothesis hi and the examples of all other classes are the negative instances of hi . Thus in a K-class problem, we have K hypotheses to learn such that For a given x, ideally only one of hi(x), i = 1, . . . , K is 1 and we can choose a class. But when no, or two or more, hi(x) is 1, we cannot choose a class, and this is the case of doubt and the classifier rejects such cases. In our example of learning a family car, we used only one hypothesis and only modeled the positive examples. Any negative example outside is not a family car. Alternatively, sometimes we may prefer to build two hypotheses, one for the positive and the other for the negative instances. 18 This assumes a structure also for the negative instances that can be covered by another hypothesis. Separating family cars from sports cars is such a problem; each class has a structure of its own. The advantage is that if the input is a luxury sedan, we can have both hypotheses decide negative and reject the input. If in a dataset, we expect to have all classes with similar distribution— shapes in the input space—then the same hypothesis class can be used for all classes. For example, in a handwritten digit recognition dataset, we would expect all digits to have similar distributions. But in a medical diagnosis dataset, for example, where we have two classes for sick and healthy people, we may have completely different distributions for the two classes; there may be multiple ways for a person to be sick, reflected differently in the inputs: All healthy people are alike; each sick person is sick in his or her own way. Second Examples: Let us understand the concept in-depth, 1. What is Multi-class Classification? When we solve a classification problem having only two class labels, then it becomes easy for us to filter the data, apply any classification algorithm, train the model with filtered data, and predict the outcomes. But when we have more than two class instances in input train data, then it might get complex to analyze the data, train the model, and predict relatively accurate results. To handle these multiple class instances, we use multi-class classification. Multi-class classification is the classification technique that allows us to categorize the test data into multiple class labels present in trained data as a model prediction. There are mainly two types of multi-class classification techniques: • One vs. All (one-vs-rest) • One vs. One 19 2. Binary classification vs. multi-class classification Binary Classification • Only two class instances are present in the dataset. • It requires only one classifier model. • Confusion Matrix is easy to derive and understand. • Example: - Check email is spam or not, predicting gender based on height and weight. Multi-class Classification • Multiple class labels are present in the dataset. • The number of classifier models depends on the classification technique we are applying to. • One vs. All:- N-class instances then N binary classifier models • One vs. One:- N-class instances then N* (N-1)/2 binary classifier models • The Confusion matrix is easy to derive but complex to understand. • Example:- Check whether the fruit is apple, banana, or orange. 3. One vs. All (One-vs-Rest) In one-vs-All classification, for the N-class instances dataset, we have to generate the N-binary classifier models. The number of class labels present in the dataset and the number of generated binary classifiers must be the same. 20 As shown in the above image, consider we have three classes, for example, type 1 for Green, type 2 for Blue, and type 3 for Red. Now, as I told you earlier that we have to generate the same number of classifiers as the class labels are present in the dataset, So we have to create three classifiers here for three respective classes. • Classifier 1:- [Green] vs [Red, Blue] • Classifier 2:- [Blue] vs [Green, Red] • Classifier 3:- [Red] vs [Blue, Green] Now to train these three classifiers, we need to create three training datasets. So let’s consider our primary dataset is as follows, Figure 5: Primary Dataset You can see that there are three class labels Green, Blue, and Red present in the dataset. Now we have to create a training dataset for each class. Here, we created the training datasets by putting +1 in the class column for that feature value, which is aligned to that particular class only. For the costs of the remaining features, we put -1 in the class column. 21 Figure 6: Training dataset for Green class Figure 7: Training dataset for Blue class and Red class Let’s understand it by an example, • Consider the primary dataset, in the first row; we have x1, x2, x3 feature values, and the corresponding class value is G, which means these feature values belong to G class. So we put +1 value in the class column for the correspondence of green type. Then we applied the same for the x10, x11, x12 input train data. • For the rest of the values of the features which are not in correspondence with the Green class, we put -1 in their class column. I hope that you understood the creation of training datasets. Now, after creating a training dataset for each classifier, we provide it to our classifier model and train the model by applying an algorithm. 22 After the training model, when we pass input test data to the model, then that data is considered as input for all generated classifiers. If there is any possibility that our input test data belongs to a particular class, then the classifier created for that class gives a positive response in the form of +1, and all other classifier models provide an adverse reaction in the way of -1. Similarly, binary classifier models predict the probability of correspondence with concerning classes. By analyzing the probability scores, we predict the result as the class index having a maximum probability score. • Let’s understand with one example by taking three test features values as y1, y2, and y3, respectively. • We passed test data to the classifier models. We got the outcome in the form of a positive rating derived from the Green class classifier with a probability score of (0.9). • Again We got a positive rating from the Blue class with a probability score of (0.4) along with a negative classification score from the remaining Red classifier. • Hence, based on the positive responses and decisive probability score, we can say that our test input belongs to the Green class. 23 4. One vs. One (OvO) In One-vs-One classification, for the N-class instances dataset, we have to generate the N* (N-1)/2 binary classifier models. Using this classification approach, we split the primary dataset into one dataset for each class opposite to every other class. Taking the above example, we have a classification problem having three types: Green, Blue, and Red (N=3). We divide this problem into N* (N-1)/2 = 3 binary classifier problems: • Classifier 1: Green vs. Blue • Classifier 2: Green vs. Red • Classifier 3: Blue vs. Red Each binary classifier predicts one class label. When we input the test data to the classifier, then the model with the majority counts is concluded as a result. 24 Chapter-II Regression: Simple linear regression, multiple linear regression, model selection and generalization, Dimensions of supervised Machine learning algorithm, Bayesian classification. …………………………………………………………………………………………………………………………….. Regression analysis is a statistical method to model the relationship between a dependent (target) and independent (predictor) variables with one or more independent variables. More specifically, Regression analysis helps us to understand how the value of the dependent variable is changing corresponding to an independent variable when other independent variables are held fixed. It predicts continuous/real values such as temperature, age, salary, price, etc. We can understand the concept of regression analysis using the below example: Example: Suppose there is a marketing company A, who does various advertisement every year and get sales on that. The below list shows the advertisement made by the company in the last 5 years and the corresponding sales: Now, the company wants to do the advertisement of $200 in the year 2019 and wants to know the prediction about the sales for this year. So, to solve such type of prediction problems in machine learning, we need regression analysis. Regression is a supervised learning technique, which helps in finding the correlation between variables and enables us to predict the continuous output variable based on the one or more predictor variables. It is mainly used for prediction, forecasting, time series modeling, and determining the causal-effect relationship between variables. In Regression, we plot a graph between the variables which best fits the given datapoints, using this plot, the machine learning model can make predictions about the data. In simple words, "Regression shows a line or curve that passes through all the datapoints on target-predictor graph in such a way that the vertical distance between the datapoints and the regression line is minimum." The distance between datapoints and line tells whether a model has captured a strong relationship or not. Some examples of regression can be as: • Prediction of rain using temperature and other factors • Determining Market trends • Prediction of road accidents due to rash driving. 25 Terminologies Related to the Regression Analysis: • Dependent Variable: The main factor in Regression analysis which we want to predict or understand is called the dependent variable. It is also called target variable. • Independent Variable: The factors which affect the dependent variables or which are used to predict the values of the dependent variables are called independent variable, also called as a predictor. • Outliers: Outlier is an observation which contains either very low value or very high value in comparison to other observed values. An outlier may hamper the result, so it should be avoided. • Multicollinearity: If the independent variables are highly correlated with each other than other variables, then such condition is called Multicollinearity. It should not be present in the dataset, because it creates problem while ranking the most affecting variable. • Underfitting and Overfitting: If our algorithm works well with the training dataset but not well with test dataset, then such problem is called Overfitting. And if our algorithm does not perform well even with training dataset, then such problem is called underfitting. Why do we use Regression Analysis? As mentioned above, Regression analysis helps in the prediction of a continuous variable. There are various scenarios in the real world where we need some future predictions such as weather condition, sales prediction, marketing trends, etc., for such case we need some technology which can make predictions more accurately. So for such case we need Regression analysis which is a statistical method and used in machine learning and data science. Below are some other reasons for using Regression analysis: • Regression estimates the relationship between the target and the independent variable. • It is used to find the trends in data. • It helps to predict real/continuous values. • By performing the regression, we can confidently determine the most important factor, the least important factor, and how each factor is affecting the other factors. Types of Regression Algorithms There are various types of regressions which are used in data science and machine learning. Each type has its own importance on different scenarios, but at the core, all the regression methods analyze the effect of the independent variable on dependent variables. Here we are discussing some important types of regression which are given below: 1. Linear regression Linear Regression is an ML algorithm used for supervised learning. Linear regression performs the task to predict a dependent variable(target) based on the given independent variable(s). So, this regression technique finds out a linear relationship between a dependent variable and the other given independent variables. Hence, the name of this algorithm is Linear Regression. In the figure above, on X-axis is the independent variable and on Y-axis is the output. The regression line is the best fit line for a model. And our main objective in this algorithm is to find this best fit line. 26 Pros: • Linear Regression is simple to implement. • Less complexity compared to other algorithms. • Linear Regression may lead to over-fitting but it can be avoided using some dimensionality reduction techniques, regularization techniques, and cross-validation. Cons: • Outliers affect this algorithm badly. • It over-simplifies real-world problems by assuming a linear relationship among the variables, hence not recommended for practical use-cases. 2. Decision Tree The decision tree models can be applied to all those data which contains numerical features and categorical features. Decision trees are good at capturing non-linear interaction between the features and the target variable. Decision trees somewhat match human-level thinking so it’s very intuitive to understand the data. For example, if we are classifying how many hours a kid plays in particular weather then the decision tree looks like somewhat this above in the image. So, in short, a decision tree is a tree where each node represents a feature, each branch represents a decision, and each leaf represents an outcome(numerical value for regression). Pros: • • • Easy to understand and interpret, visually intuitive. It can work with numerical and categorical features. Requires little data preprocessing: no need for one-hot encoding, dummy variables, etc. Cons: • It tends to overfit. • A small change in the data tends to cause a big difference in the tree structure, which causes instability. 3. Support Vector Regression You must have heard about SVM i.e., Support Vector Machine. SVR also uses the same idea of SVM but here it tries to predict the real values. This algorithm uses hyperplanes to segregate the data. In case this separation is not possible then it uses kernel trick where the dimension is increased and then the data points become separable by a hyperplane. 27 In the figure above, the Blue line is the Hyper Plane; Red Line is the Boundary Line All the data points are within the boundary line(Red Line). The main objective of SVR is to basically consider the points that are within the boundary line. Pros: • Robust to outliers. • Excellent generalization capability • High prediction accuracy. Cons: • Not suitable for large datasets. • They do not perform very well when the data set has more noise. 4. Lasso Regression • LASSO stands for Least Absolute Selection Shrinkage Operator. Shrinkage is basically defined as a constraint on attributes or parameters. • The algorithm operates by finding and applying a constraint on the model attributes that cause regression coefficients for some variables to shrink toward a zero. • Variables with a regression coefficient of zero are excluded from the model. • So, lasso regression analysis is basically a shrinkage and variable selection method and it helps to determine which of the predictors are most important. Pros: • It avoids overfitting Cons: • LASSO will select only one feature from a group of correlated features • Selected features can be highly biased. 28 5. Random Forest Regressor Random Forests are an ensemble(combination) of decision trees. It is a Supervised Learning algorithm used for classification and regression. The input data is passed through multiple decision trees. It executes by constructing a different number of decision trees at training time and outputting the class that is the mode of the classes (for classification) or mean prediction (for regression) of the individual trees. Pros: • Good at learning complex and non-linear relationships • Very easy to interpret and understand Cons: • They are prone to overfitting • Using larger random forest ensembles to achieve higher performance slows down their speed and then they also need more memory. 29 Linear Regression in Machine Learning Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a statistical method that is used for predictive analysis. Linear regression makes predictions for continuous/real or numeric variables such as sales, salary, age, product price, etc. Linear regression algorithm shows a linear relationship between a dependent (y) and one or more independent (y) variables, hence called as linear regression. Since linear regression shows the linear relationship, which means it finds how the value of the dependent variable is changing according to the value of the independent variable. The linear regression model provides a sloped straight line representing the relationship between the variables. Consider the below image: Mathematically, we can represent a linear regression as: y= a0+a1x+ ε Here, Y= Dependent Variable (Target Variable) X= Independent Variable (predictor Variable) a0= intercept of the line (Gives an additional degree of freedom) a1 = Linear regression coefficient (scale factor to each input value). ε = random error The values for x and y variables are training datasets for Linear Regression model representation. Types of Linear Regression Linear regression can be further divided into two types of the algorithm: • • Simple Linear Regression: If a single independent variable is used to predict the value of a numerical dependent variable, then such a Linear Regression algorithm is called Simple Linear Regression. Multiple Linear regression: If more than one independent variable is used to predict the value of a numerical dependent variable, then such a Linear Regression algorithm is called Multiple Linear Regression. 30 Linear Regression Line A linear line showing the relationship between the dependent and independent variables is called a regression line. A regression line can show two types of relationship: • Positive Linear Relationship: If the dependent variable increases on the Y-axis and independent variable increases on X-axis, then such a relationship is termed as a Positive linear relationship. • Negative Linear Relationship: If the dependent variable decreases on the Y-axis and independent variable increases on the X-axis, then such a relationship is called a negative linear relationship. Assumptions of Linear Regression Below are some important assumptions of Linear Regression. These are some formal checks while building a Linear Regression model, which ensures to get the best possible result from the given dataset. • • Linear relationship between the features and target: Linear regression assumes the linear relationship between the dependent and independent variables. Small or no multicollinearity between the features: Multicollinearity means high-correlation between the independent variables. Due to multicollinearity, it may difficult to find the true relationship between the predictors and target variables. Or we can say, it is difficult to determine which predictor variable is affecting the target variable and which is 31 • • • not. So, the model assumes either little or no multicollinearity between the features or independent variables. Homoscedasticity Assumption: Homoscedasticity is a situation when the error term is the same for all the values of independent variables. With homoscedasticity, there should be no clear pattern distribution of data in the scatter plot. Normal distribution of error terms: Linear regression assumes that the error term should follow the normal distribution pattern. If error terms are not normally distributed, then confidence intervals will become either too wide or too narrow, which may cause difficulties in finding coefficients. It can be checked using the q-q plot. If the plot shows a straight line without any deviation, which means the error is normally distributed. No autocorrelations: The linear regression model assumes no autocorrelation in error terms. If there will be any correlation in the error term, then it will drastically reduce the accuracy of the model. Autocorrelation usually occurs if there is a dependency between residual errors. 1. Simple linear regression Simple Linear Regression is a type of Regression algorithms that models the relationship between a dependent variable and a single independent variable. The relationship shown by a Simple Linear Regression model is linear or a sloped straight line, hence it is called Simple Linear Regression. The key point in Simple Linear Regression is that the dependent variable must be a continuous/real value. However, the independent variable can be measured on continuous or categorical values. Simple Linear regression algorithm has mainly two objectives: • Model the relationship between the two variables. Such as the relationship between Income and expenditure, experience and Salary, etc. • Forecasting new observations. Such as Weather forecasting according to temperature, Revenue of a company according to the investments in a year, etc. Simple Linear Regression Model: The Simple Linear Regression model can be represented using the below equation: y= a0+a1x+ ε Where, a0= It is the intercept of the Regression line (can be obtained putting x=0) a1= It is the slope of the regression line, which tells whether the line is increasing or decreasing. ε = The error term. (For a good model it will be negligible) Implementation of Simple Linear Regression Algorithm using Python Problem Statement example for Simple Linear Regression: Here we are taking a dataset that has two variables: salary (dependent variable) and experience (Independent variable). The goals of this problem is: • We want to find out if there is any correlation between these two variables • We will find the best fit line for the dataset. • How the dependent variable is changing by changing the independent variable. 32 Here, we will create a Simple Linear Regression model to find out the best fitting line for representing the relationship between these two variables. To implement the Simple Linear regression model in machine learning using Python, we need to follow the below steps: Step-1: Data Pre-processing The first step for creating the Simple Linear Regression model is data pre-processing. We have already done it earlier in this tutorial. But there will be some changes, which are given in the below steps: a) First, we will import the three important libraries, which will help us for loading the dataset, plotting the graphs, and creating the Simple Linear Regression model. # Importing the libraries import numpy as np import matplotlib.pyplot as plt import pandas as pd b) Next, we will load the dataset into our code. After that, we need to extract the dependent and independent variables from the given dataset. The independent variable is years of experience, and the dependent variable is salary. # Importing the dataset dataset = pd.read_csv('Salary_Data.csv') X = dataset.iloc[:, :-1].values y = dataset.iloc[:, 1].values In the above lines of code, for x variable, we have taken -1 value since we want to remove the last column from the dataset. For y variable, we have taken 1 value as a parameter, since we want to extract the second column and indexing starts from the zero. c) Next, we will split both variables into the test set and training set. We have 30 observations, so we will take 20 observations for the training set and 10 observations for the test set. We are splitting our dataset so that we can train our model using a training dataset and then test the model using a test dataset. # Splitting the dataset into the Training set and Test set from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state = 0) Step-2: Fitting the Simple Linear Regression to the Training Set: Now the second step is to fit our model to the training dataset. To do so, we will import the LinearRegression class of the linear_model library from the scikit learn. After importing the class, we are going to create an object of the class named as a regressor. # Fitting Simple Linear Regression to the Training set from sklearn.linear_model import LinearRegression regressor = LinearRegression() regressor.fit(X_train, y_train) 33 In the above code, we have used a fit() method to fit our Simple Linear Regression object to the training set. In the fit() function, we have passed the x_train and y_train, which is our training dataset for the dependent and an independent variable. We have fitted our regressor object to the training set so that the model can easily learn the correlations between the predictor and target variables. Step: 3. Prediction of test set result: dependent (salary) and an independent variable (Experience). So, now, our model is ready to predict the output for the new observations. In this step, we will provide the test dataset (new observations) to the model to check whether it can predict the correct output or not. We will create a prediction vector y_pred, which will contain predictions of test dataset, and prediction of training set respectively. # Predicting the Test set results y_pred = regressor.predict(X_test) Step: 4. visualizing the Training set results: Now in this step, we will visualize the training set result. To do so, we will use the scatter() function of the pyplot library, which we have already imported in the pre-processing step. The scatter () function will create a scatter plot of observations. In the x-axis, we will plot the Years of Experience of employees and on the y-axis, salary of employees. In the function, we will pass the real values of training set, which means a year of experience x_train, training set of Salaries y_train, and color of the observations. Here we are taking a green color for the observation, but it can be any color as per the choice. Now, we need to plot the regression line, so for this, we will use the plot() function of the pyplot library. In this function, we will pass the years of experience for training set, predicted salary for training set x_pred, and color of the line. Next, we will give the title for the plot. So here, we will use the title() function of the pyplot library and pass the name ("Salary vs Experience (Training Dataset)". After that, we will assign labels for x-axis and y-axis using xlabel() and ylabel() function. # Visualising the Training set results plt.scatter(X_train, y_train, color = 'red') plt.plot(X_train, regressor.predict(X_train), color = 'blue') plt.title('Salary vs Experience (Training set)') plt.xlabel('Years of Experience') plt.ylabel('Salary') plt.show() In the above plot, we can see the real values observations in green dots and predicted values are covered by the red regression line. The regression line shows a correlation between the dependent and independent variable. The good fit of the line can be observed by calculating the difference between actual values and predicted values. But as we can see in the above plot, most of the observations are close to the regression line, hence our model is good for the training set. 34 Step: 5. visualizing the Test set results: In the previous step, we have visualized the performance of our model on the training set. Now, we will do the same for the Test set. The complete code will remain the same as the above code, except in this, we will use x_test, and y_test instead of x_train and y_train # Visualising the Test set results plt.scatter(X_test, y_test, color = 'red') plt.plot(X_train, regressor.predict(X_train), color = 'blue') plt.title('Salary vs Experience (Test set)') plt.xlabel('Years of Experience') plt.ylabel('Salary') plt.show() 2. Multiple linear regression Multiple Linear Regression is one of the important regression algorithms which models the linear relationship between a single dependent continuous variable and more than one independent variable. Example: Prediction of CO2 emission based on engine size and number of cylinders in a car. Some key points about MLR: • For MLR, the dependent or target variable(Y) must be the continuous/real, but the predictor or independent variable may be of continuous or categorical form. • Each feature variable must model the linear relationship with the dependent variable. • MLR tries to fit a regression line through a multidimensional space of data-points. The multiple regression equation explained above takes the following form: y = b1x1 + b2x2 + … + bnxn + c. Where, Y= Output/Response variable b0, b1, b2, b3 , bn....= Coefficients of the model. x1, x2, x3, x4,...= Various Independent/feature variable Assumptions for Multiple Linear Regression: • A linear relationship should exist between the Target and predictor variables. • The regression residuals must be normally distributed. • MLR assumes little or no multicollinearity (correlation between the independent variable) in data. 35 Implementation of Multiple Linear Regression model using Python: To implement MLR using Python, we have below problem: Problem Description: We have a dataset of 50 start-up companies. This dataset contains five main information: R&D Spend, Administration Spend, Marketing Spend, State, and Profit for a financial year. Our goal is to create a model that can easily determine which company has a maximum profit, and which is the most affecting factor for the profit of a company. Since we need to find the Profit, so it is the dependent variable, and the other four variables are independent variables. Below are the main steps of deploying the MLR model: 1. Data Pre-processing Steps 2. Fitting the MLR model to the training set 3. Predicting the result of the test set Step-1: Data Pre-processing Step: The very first step is data pre-processing , which we have already discussed in this tutorial. This process contains the below steps: • Importing libraries: Firstly, we will import the library which will help in building the model. Below is the code for it: # Importing the libraries import numpy as np import matplotlib.pyplot as plt import pandas as pd • Importing dataset: Now we will import the dataset(50_CompList), which contains all the variables. Extracting dependent and independent variables from it. # Importing the dataset dataset = pd.read_csv('50_Startups.csv') X = dataset.iloc[:, :-1] y = dataset.iloc[:, 4] • Convert the column into categorical columns states=pd.get_dummies(X['State'],drop_first=True) • Drop the state coulmn X=X.drop('State',axis=1) • concat the dummy variables X=pd.concat([X,states],axis=1) 36 • Now we will split the dataset into training and test set. # Splitting the dataset into the Training set and Test set from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0) Step: 2- Fitting our MLR model to the Training set: Now, we have well prepared our dataset in order to provide training, which means we will fit our regression model to the training set. It will be similar to as we did in Simple Linear Regression model. # Fitting Multiple Linear Regression to the Training set from sklearn.linear_model import LinearRegression regressor = LinearRegression() regressor.fit(X_train, y_train) Step: 3- Prediction of Test set results: The last step for our model is checking the performance of the model. We will do it by predicting the test set result. For prediction, we will create a y_pred vector. # Predicting the Test set results y_pred = regressor.predict(X_test) Now, checking the final results The above score tells that our model is 95% accurate with the training dataset and 93% accurate with the test dataset. Applications of Multiple Linear Regression: There are mainly two applications of Multiple Linear Regression: • Effectiveness of Independent variable on prediction: • Predicting the impact of changes: 37 3. MODEL SELECTION AND GENERALIZATION Let us assume that our model is learning from Boolean function from examples. In a Boolean function, all inputs and the output are binary. There are 2d possible ways to write d binary values and therefore, with d inputs, the training set has at most 2d examples. Each distinct training example removes half the hypotheses, namely, those whose guesses are wrong. We start with all possible hypotheses and as we see more training examples, we remove those hypotheses that are not consistent with the training data. In the case of a Boolean function, to end up with a single hypothesis we need to see all 2d training examples. If the training set we are given contains only a small subset of all possible instances, as it generally does—that is, if we know what the output should be for only a small percentage of the cases—the solution is not unique. After seeing N example cases, there remain 22 ^(d−N) possible functions. This is an example of an ill-posed problem where the data by itself is not sufficient to find a unique solution. So, because learning is ill-posed, and data by itself is not sufficient to find the solution, we should make some extra assumptions to have a unique solution with the data we have. The set of assumptions we make to have learning possible is called the inductive bias of the learning algorithm. Thus, learning is not possible without inductive bias, and now the question is how to choose the right bias. This is called model selection, which is choosing between possible H. How well a model trained on the training set predicts the right output for new instances is called generalization. Model selection is the process of choosing one of the models as the final model that addresses the problem. For best generalization, we should match the complexity of the hypothesis class H with the complexity of the function underlying the data. If H is less complex than the function, we have underfitting But if we have H that is too complex, the data is not enough to constrain it and we may end up with a bad hypothesis, h ∈ H, for example, when fitting two rectangles to data sampled from one rectangle. Or if there is noise, an overcomplex hypothesis may learn not only the underlying function but also the noise in the data and may make a bad fit, for example, when fitting a sixth-order polynomial to noisy data sampled from a third-order polynomial. This is called overfitting. In all learning algorithms that are trained from example data, there is a trade-off between three factors: • • • the complexity of the hypothesis we fit to data, namely, the capacity of the hypothesis class, the amount of training data, and the generalization error on new examples As the amount of training data increases, the generalization error decreases. 38 • • • To estimate generalization error, we need data unseen during training. We split the data as Training set (50%) Validation set (25%) Test (publication) set (25%) Resampling when there is few data 4. Dimensions of a Supervised Machine Learning Algorithm Let us now recapitulate and generalize. We have a sample The sample is independent and identically distributed. t indexes one of the N instances, xt is the arbitrary dimensional input, and rt is the associated desired output. The aim is to build a good and useful approximation to r t using the model g(xt |θ). In doing this, there are three decisions we must make. 1. Model we use in learning, denoted as g(x|θ) where g(·) is the model, x is the input, and θ are the parameters. g(·) defines the hypothesis class H, and a particular value of θ instantiates one hypothesis h ∈ H For example, in class learning, we have taken a rectangle as our model whose four coordinates make up θ; 2. Loss function, L(·), to compute the difference between the desired output, rt , and our approximation to it, g(xt |θ), given the current value of the parameters, θ. The approximation error, or loss, is the sum of losses over the individual instances For example, In class learning where outputs are 0/1, L(·) checks for equality or not 3. Optimization procedure to find θ∗ that minimizes the total error where arg min returns the argument that minimizes For this setting to work well, the following conditions should be satisfied 1. the hypothesis class of g(·) should be large enough 2. There should be enough training data to allow us to pinpoint the correct (or a good enough) hypothesis from the hypothesis class. 3. we should have a good optimization method that finds the correct hypothesis given the training data. 39 5. Bayesian classification Bayes theorem is one of the most popular machine learning concepts that helps to calculate the probability of occurring one event with uncertain knowledge while other one has already occurred. Bayes' theorem can be derived using product rule and conditional probability of event X with known event Y: • According to the product rule we can express as the probability of event X with known event Y as follows; 1. P(X ? Y)= P(X|Y) P(Y) • {equation 1} Further, the probability of event Y with known event X: 1. P(X ? Y)= P(Y|X) P(X) {equation 2} Mathematically, Bayes theorem can be expressed by combining both equations on right hand side. We will get: Here, both events X and Y are independent events which means probability of outcome of both events does not depends one another. The above equation is called as Bayes Rule or Bayes Theorem. • P(X|Y) is called as posterior, which we need to calculate. It is defined as updated probability after considering the evidence. • P(Y|X) is called the likelihood. It is the probability of evidence when hypothesis is true. • P(X) is called the prior probability, probability of hypothesis before considering the evidence • P(Y) is called marginal probability. It is defined as the probability of evidence under any consideration. Hence, Bayes Theorem can be written as: posterior = likelihood * prior / evidence Prerequisites for Bayes Theorem While studying the Bayes theorem, we need to understand few important concepts. These are as follows: 1. Experiment An experiment is defined as the planned operation carried out under controlled condition such as tossing a coin, drawing a card and rolling a dice, etc. 2. Sample Space During an experiment what we get as a result is called as possible outcomes and the set of all possible outcome of an event is known as sample space. For example, if we are rolling a dice, sample space will be: S1 = {1, 2, 3, 4, 5, 6} Similarly, if our experiment is related to toss a coin and recording its outcomes, then sample space will be: 40 S2 = {Head, Tail} 3. Event Event is defined as subset of sample space in an experiment. Further, it is also called as set of outcomes. Assume in our experiment of rolling a dice, there are two event A and B such that; A = Event when an even number is obtained = {2, 4, 6} B = Event when a number is greater than 4 = {5, 6} • Probability of the event A ''P(A)''= Number of favourable outcomes / Total number of possible outcomes P(E) = 3/6 =1/2 =0.5 • Similarly, Probability of the event B ''P(B)''= Number of favourable outcomes / Total number of possible outcomes =2/6 =1/3 =0.333 • Union of event A and B: A∪B = {2, 4, 5, 6} • Intersection of event A and B: A∩B= {6} 41 • Disjoint Event: If the intersection of the event A and B is an empty set or null then such events are known as disjoint event or mutually exclusive events also. 4. Random Variable: It is a real value function which helps mapping between sample space and a real line of an experiment. A random variable is taken on some random values and each value having some probability. However, it is neither random nor a variable but it behaves as a function which can either be discrete, continuous or combination of both. 5. Exhaustive Event: As per the name suggests, a set of events where at least one event occurs at a time, called exhaustive event of an experiment. Thus, two events A and B are said to be exhaustive if either A or B definitely occur at a time and both are mutually exclusive for e.g., while tossing a coin, either it will be a Head or may be a Tail. 6. Independent Event: Two events are said to be independent when occurrence of one event does not affect the occurrence of another event. In simple words we can say that the probability of outcome of both events does not depends one another. Mathematically, two events A and B are said to be independent if: P(A ∩ B) = P(AB) = P(A)*P(B) 7. Conditional Probability: Conditional probability is defined as the probability of an event A, given that another event B has already occurred (i.e. A conditional B). This is represented by P(A|B) and we can define it as: P(A|B) = P(A ∩ B) / P(B) 8. Marginal Probability: Marginal probability is defined as the probability of an event A occurring independent of any other event B. Further, it is considered as the probability of evidence under any consideration. P(A) = P(A|B)*P(B) + P(A|~B)*P(~B) Here ~B represents the event that B does not occur. 42 How to apply Bayes Theorem or Bayes rule in Machine Learning? Bayes theorem helps us to calculate the single term P(B|A) in terms of P(A|B), P(B), and P(A). This rule is very helpful in such scenarios where we have a good probability of P(A|B), P(B), and P(A) and need to determine the fourth term. Naïve Bayes classifier is one of the simplest applications of Bayes theorem which is used in classification algorithms to isolate data as per accuracy, speed and classes. Let's understand the use of Bayes theorem in machine learning with below example. Suppose, we have a vector A with I attributes. It means A = A1, A2, A3, A4……………Ai Further, we have n classes represented as C1, C2, C3, C4…………Cn. These are two conditions given to us, and our classifier that works on Machine Language has to predict A and the first thing that our classifier has to choose will be the best possible class. So, with the help of Bayes theorem, we can write it as: P(Ci/A)= [ P(A/Ci) * P(Ci)] / P(A) Here; P(A) is the condition-independent entity. P(A) will remain constant throughout the class means it does not change its value with respect to change in class. To maximize the P(Ci/A), we have to maximize the value of term P(A/Ci) * P(Ci). With n number classes on the probability list let's assume that the possibility of any class being the right answer is equally likely. Considering this factor, we can say that: P(C1)=P(C2)-P(C3)=P(C4)=…..=P(Cn). This process helps us to reduce the computation cost as well as time. This is how Bayes theorem plays a significant role in Machine Learning and Naïve Bayes theorem has simplified the conditional probability tasks without affecting the precision. Hence, we can conclude that: P(Ai/C)= P(A1/C)* P(A2/C)* P(A3/C)*……*P(An/C) Hence, by using Bayes theorem in Machine Learning we can easily describe the possibilities of smaller events. 2. Naive Bayes learning algorithm • Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem and used for solving classification problems. • It is mainly used in text classification that includes a high-dimensional training dataset. • Naïve Bayes Classifier is one of the simple and most effective Classification algorithms which helps in building the fast machine learning models that can make quick predictions. • It is a probabilistic classifier, which means it predicts on the basis of the probability of an object. • Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental analysis, and classifying articles. Why is it called Naïve Bayes? The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be described as: • Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is independent of the occurrence of other features. Such as if the fruit is identified on the bases of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an apple. Hence each feature individually contributes to identify that it is an apple without depending on each other. 43 • Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem. Bayes' Theorem: • Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the probability of a hypothesis with prior knowledge. It depends on the conditional probability. • The formula for Bayes' theorem is given as: Where, P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B. P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a hypothesis is true. P(A) is Prior Probability: Probability of hypothesis before observing the evidence. P(B) is Marginal Probability: Probability of Evidence. Working of Naïve Bayes' Classifier: Working of Naïve Bayes' Classifier can be understood with the help of the below example: Suppose we have a dataset of weather conditions and corresponding target variable "Play". So using this dataset we need to decide that whether we should play or not on a particular day according to the weather conditions. So to solve this problem, we need to follow the below steps: 1. Convert the given dataset into frequency tables. 2. Generate Likelihood table by finding the probabilities of given features. 3. Now, use Bayes theorem to calculate the posterior probability. 44 Problem: If the weather is sunny, then the Player should play or not? Solution: To solve this, first consider the below dataset: Frequency table for the Weather Conditions: Likelihood table weather condition: 45 Applying Bayes' theorem: P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny) P(Sunny|Yes)= 3/10= 0.3 P(Sunny)= 0.35 P(Yes)=0.71 Learn more So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60 P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny) P(Sunny|NO)= 2/4=0.5 P(No)= 0.29 P(Sunny)= 0.35 So P(No|Sunny)= 0.5*0.29/0.35 = 0.41 So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny) Hence on a Sunny day, Player can play the game. Advantages of Naïve Bayes Classifier: • Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets. • It can be used for Binary as well as Multi-class Classifications. • It performs well in Multi-class predictions as compared to the other Algorithms. • It is the most popular choice for text classification problems. Disadvantages of Naïve Bayes Classifier: • Naive Bayes assumes that all features are independent or unrelated, so it cannot learn the relationship between features. Applications of Naïve Bayes Classifier: • It is used for Credit Scoring. • It is used in medical data classification. • It can be used in real-time predictions because Naïve Bayes Classifier is an eager learner. • It is used in Text classification such as Spam filtering and Sentiment analysis. 46