Supervised Learning: Classification Models & Algorithms

UNIT – III - CHAPTER – II SUPERVISED LEARNING : CLASSIFICATION In supervised learning, the labelled training data provide the basis for learning. According to the definition of machine learning, this labelled training data is the experience or prior knowledge Some more examples of supervised learning are as follows: Prediction of results of a game based on the past analysis of results Predicting whether a tumor is malignant or benign on the basis of the analysis of data Price prediction in domains such as real estate, stocks, etc. 1 Training data is the past information with known value of class field or ‘label’. Hence, we say that the ‘training data is labelled’ in the case of supervised learning (refer Fig.). Contrary to this, there is no labelled training data for unsupervised learning. Semi-supervised learning, as depicted in Figure, uses a small amount of labelled data along with unlabelled data for training. 2 CLASSIFICATION MODEL  When we are trying to predict a categorical or nominal variable, the problem is known as a classification problem. A classification problem is one where the output variable is a category such as ‘red’ or ‘blue’ or ‘malignant tumor’ or ‘benign tumor’, etc.  Whereas when we are trying to predict a numerical variable such as ‘price’, ‘weight’, etc. the problem falls under the category of regression. Supervised machine learning is as good as the data used to train it. If the training data is poor in quality, the prediction will also be far from being precise. 3 The following fig depicts the typical process of classification where a classification model is obtained from the labelled training data by a classifier algorithm. On the basis of the model, a class label (e.g. ‘Intel’ as in the case of the test data referred in Fig.) is assigned to the test data. FIG. Classification model 4  classification is a type of supervised learning where a target feature, which is of categorical type, is predicted for test data on the basis of the information imparted by the training data. The target categorical feature is known as class . Some typical classification problems include the following: Image classification Disease prediction Win–loss prediction of games Prediction of natural calamity such as earthquake, flood, etc. Handwriting recognition 5 CLASSIFICATION LEARNING STEPS 6 CLASSIFICATION LEARNING STEPS • The Problem Identification problem needs to be a well-formed problem, i.e. a problem with well-defined goals and benefit, which has a long-term impact. • The required data set that represents the problem needs to Identification of Required Data be identified/evaluated. • For example: If the problem is to predict whether a tumor is malignant or benign, then the corresponding patient data sets related to tumors are to be identified • the data is gathered - raw format and is not ready for immediate Data Preprocessing analysis. • All the unnecessary/irrelevant data elements are removed. • It ensures, the data is ready to be fed into the ML algorithm. 7 CLASSIFICATION LEARNING STEPS •the user should decide what kind of data set is to be used as a Definition of Training Data Set training set. A set of data input (X) and corresponding outputs (Y) is gathered either from human experts or experiments. •For Ex: signature analysis, the training data set might be a single handwritten alphabet, a handwritten word or an entire line. • involves determining the structure of the learning function Algorithm Selection and the corresponding learning algorithm. On the basis of various parameters, the best algorithm for a given problem is chosen. Training • The learning algorithm identified in the previous step is run on the training set for further fine tuning. 8 CLASSIFICATION LEARNING STEPS Evaluation with the Test Data Set • Training data is run on the algorithm, and its performance is measured here. If a suitable result is not obtained, further training of parameters may be required 9 COMMON CLASSIFICATION ALGORITHMS Following are the most common classification algorithms 1. k-Nearest Neighbour(k-NN) 2. Decision tree 3. Random forest 4. Support Vector Machine (SVM) 5. Naïve Bayes classifier 10 k - Nearest Neighbour (k-NN)  The unknown and unlabelled data which comes for a prediction problem is judged on the basis of the training data set elements which are similar to the unknown element.  So, the class label of the unknown element is assigned on the basis of the class labels of the similar training data set elements ( can be considered as neighbours of the unknown element). 11 k - Nearest Neighbour (k-NN) Input: Training data set, test data set (or data points), value of ‘k’ (i.e. number of nearest neighbours to be considered) Steps: Do for all test data points Calculate the distance (usually Euclidean distance) of the test data point from the different training data points. Find the closest ‘k’ training data points, i.e. training data points whose distances are least from the test data point. If k = 1 Then assign class label of the training data point to the test data point Else Whichever class label is predominantly present in the training data points, assign that class label to the test data point End do 12  Let us consider a very simple Student data set as depicted in Figure.  It consists of 15 students studying in a class.  Each of the students has been assigned a score on a scale of 10 on two performance parameters – ‘Aptitude’ and ‘Communication’.  Also, a class value is assigned to each student based on the following criteria: 1. Students having good communication skills as well as a good level of aptitude have been classified as ‘Leader’ 2. Students having good communication skills but not so good level of aptitude have been classified as ‘Speaker’ 3. Students having not so good communication skill but a good level of aptitude have been classified as ‘Intel’ 13  while building a classification model, a part of the labelled input data is retained as test data.  The remaining portion of the input data is used to train the model – hence known as training data.  The motivation to retain a part of the data as test data is to evaluate the performance of the model.  The performance of the classification model is measured by the number of correct classifications made by the model when applied to an unknown data set.  As depicted in Figure, the record of the student named Josh is assumed to be the test data.  Now that we have the training data and test data identified, we can start with the modelling 14 In the kNN algorithm, the class label of Considering a very simple data set the test data elements is decided by the having two features (say f1 and f2), class label of the training data elements Euclidean distance between two data which are neighbouring, i.e. similar in elements d1 and d2 can be measured by nature. But there are two challenges: 1. What is the basis of this similarity or • for data element d1 when can we say that two data elements • are similar? • each test data element? are many measures of similarity, the most measure similarity f21 = value of feature f2 for data element d1 To answer the first question, though there common approach adopted f12 = value of feature f1 for data element d2 2. How many similar elements should be considered for deciding the class label of where f11 = value of feature f1 by kNN between two to • f22 = value of feature f2 for data element d2 data elements is Euclidean distance. 15 The training data points of the Student data set considering only the features ‘Aptitude’ and ‘Communication’ can be represented as dots in a two-dimensional feature space. The reason for considering two-dimensional data space is that we are considering just the two features of the Student data set, i.e. ‘Aptitude’ and ‘Communication’, for doing the classification. The feature ‘Name’ is ignored because, as we can understand, it has no role to play in deciding the class value. The test data point for student Josh is represented as an asterisk in the same space. To find out the closest or nearest neighbours of the test data point, Euclidean distance of the different dots need to be calculated from the asterisk. Then, the class value of the closest neighbours helps in assigning the class value of the test data element. 16 The second question is The answer lies in the value of ‘k’ which is a user-defined parameter given as an input to the algorithm. In the k-NN algorithm, the value of ‘k’ indicates the number of neighbors that need to be considered. For example, if the value of k is 3, only three nearest neighbours or three training data elements closest to the test data element are considered. Out of the three data elements, the class which is under majority voting is considered as the class label to be assigned to the test data. In case, the value of k is 1, only the closest training data element is considered. The class label of that data element is directly assigned to the test data element. 17 18  when the value of k is taken as 1, only one training data point needs to be considered. The training record for student Gouri comes as the closest one to test record of Josh, with a distance value of 1.118. Gouri has class value ‘Intel’. So, the test data point is also assigned a class label value ‘Intel’.  If the value of k is taken as 3, the closest neighbours of Josh in the training data set are Gouri, Susant, and Bobby with distances being 1.118, 1.414, and 1.5, respectively.  Gouri and Bobby have class value ‘Intel’, while Susant has class value ‘Leader’.  In this case, the class value of Josh is decided by majority voting. Because the class value of ‘Intel’ is formed by the majority of the neighbours, the class value of Josh is assigned as ‘Intel’. 19 It is difficult to decide the value of k. The reasons are as follows:  If the value of k is very large , the class label of the majority class of the training data set will be assigned to the test data regardless of the class labels of the neighbors nearest to the test data.  If the value of k is very small , the class value of a noisy data or outlier in the training data set which is the nearest neighbor to the test data will be assigned to the test data.  The best k value is somewhere between these two extremes. One common practice among few strategies is to set k equal to the square root of the number of training records. An alternative approach is to test several k values on a variety of test data sets and choose the one that delivers the best performance. Another interesting approach is to choose a larger value of k, but apply a weighted voting process in which the vote of close neighbors is considered more influential than the vote of distant neighbors. 20 Why the k-NN algorithm is called a lazy learner? K-NN algorithm, stores the training data and directly applies the philosophy of nearest neighborhood finding to arrive at the classification when the problem comes. So, for k-NN, there is no learning happening in the real sense. Therefore, k-NN falls under the category of lazy learner. Strengths of the k-NN algorithm 1. Extremely simple algorithm – easy to understand 2. Very effective in certain situations, e.g. for recommender system design 3. Very fast or almost no time required for the training phase Weaknesses of the k-NN algorithm 1. Does not learn anything in the real sense. Classification is done completely on the basis of the training data. So, it has a heavy reliance on the training data. 2. Because there is no model trained in real sense and the classification is done completely on the basis of the training data, the classification process is very slow. 3. Also, a large amount of computational space is required to load the training data for classification. Applications of the k-NN algorithm : Recommender system, searching documents/ contents similar to a given document/content. This is an area under information retrieval and is known as concept search. 21 Decision tree • Decision tree learning is one of the most widely adopted algorithms for classification. • As the name indicates, it builds a model in the structure. • A decision tree is used for multi-dimensional analysis with multiple classes. • It is characterized by fast execution time and ease in the interpretation of the rules. • The goal of decision tree learning is to create a model (based on the past data called past vector) that predicts the value of the output variable based on the input variables in the feature vector. 22  Each node (or decision node) of a decision tree corresponds to one of the feature vector.  From every node, there are edges to children, wherein there is an edge for each of the possible values (or range of values) of the feature associated with the node.  The tree terminates at different leaf nodes (or terminal nodes) where each leaf node represents a possible value for the output variable.  The output variable is determined by following a path that starts at the root and is guided by the values of the input variables. 23  A decision tree is usually represented in the format depicted in Figure.  Each internal node (represented by boxes) tests an attribute (represented as ‘A’/‘B’ within the boxes).  Each branch corresponds to an attribute value (T/F) in the above case.  Each leaf node assigns a classification.  The first node is called as ‘Root’ Node.  Branches from the root node are called as ‘Branch’ Nodes where ‘A’ is the Root Node (first node). ‘B’ is the Branch Node.  ‘T’ & ‘F’ are Leaf Nodes. Thus, a decision tree consists of three types of nodes:  Root Node  Branch Node  Leaf Node 24 Building a decision tree  Decision trees are built corresponding to the training data following an approach called recursive partitioning.  The approach splits the data into on the basis of the feature values.  It starts from the root node, which is nothing but the entire data set.  It first selects the feature which predicts the target class.  with data in each partition having a distinct value for the feature based on which the partitioning has happened.  This is the first set of branches.  Likewise, the algorithm continues splitting the nodes on the basis of the feature which helps in the best partition.  This continues till a stopping criterion is reached. The usual stopping criteria are – 25 1. All or most of the examples at a particular node have the same class 2. All features have been used up in the partitioning 3. The tree has grown to a pre-defined threshold limit 26 27 28 According to the table, job offer condition (i.e. the outcome) is FALSE for all the cases where Aptitude = Low, irrespective of other conditions. So, the feature Aptitude can be taken up as the first node of the decision tree. For Aptitude = High, job offer condition is TRUE for all the cases where Communication = Good. For cases where Communication = Bad, job offer condition is TRUE for all the cases where CGPA = High. 29 Searching a decision tree By using the above decision tree depicted in Figure, we need to predict whether Chandra might get a job offer for the given parameter values: CGPA = High, Communication = Bad, Aptitude = High, Programming skills = Bad. There are multiple ways to search through the trained decision tree for a solution to the given prediction problem. Exhaustive search 1. Place the item in the first group (class). Recursively examine solutions with the item in the first group (class). 2. Place the item in the second group (class). Recursively examine solutions with the item in the second group (class). 3. Repeat the above steps until the solution is reached. Exhaustive search travels through the decision tree exhaustively, but it will take much time when the decision tree is big with multiple leaves and multiple attribute values. 30 To overcome the problem in Exhaustive search, the following technique can be adopted: Branch and bound search  It uses an existing best solution to sidestep searching of the entire decision tree in full.  When the algorithm starts, the best solution is well defined to have the worst possible value; thus, any solution it finds out is an improvement.  This makes the algorithm initially run down to the left-most branch of the tree, even though that is unlikely to produce a realistic result.  In the partitioning problem, that solution corresponds to putting every item in one group, and it is an unacceptable solution.  A program can speed up the process by using a fast experiment to find an initial solution. This can be used as an input for branch and bound.  If the heuristic is right, the savings can be substantial. 31 Decision tree based on the training data (depicting a sample path) Fig. depicts a sample path (thick line) for the conditions CGPA = High, Communication = Bad, Aptitude = High and Programming skills = Bad. According to the above decision tree, the prediction can be made as Chandra will get the job offer. 32 There are many implementations of decision tree, the most prominent ones being  C5.0,  CART (Classification and Regression Tree),  CHAID (Chi-square Automatic Interaction Detector) and  ID3 (Iterative Dichotomiser 3) algorithms. The biggest challenge of a decision tree algorithm is to find out which feature to split upon. The main driver for identifying the feature is that the data should be split in such a way that the partitions created by the split should contain examples belonging to a single class. If that happens, the partitions are considered to be pure. 33  Entropy is a measure of impurity of an attribute or feature adopted by many algorithms such as ID3 and C5.0.  The information gain is calculated on the after a data set is split according to a particular attribute (A).  Constructing a decision tree is all about finding an attribute that returns the highest information gain (i.e. the most homogeneous branches). Entropy of a decision tree  Let us say S is the sample set of training examples.  Then, Entropy (S) measuring the impurity of S is defined as where c is the number of different class labels and p refers to the proportion of values falling into the i-th class label. 34  For example, with respect to the training data in Figure, we have two values for the target class ‘Job Offered?’ – Yes and No.  The value of p for class value ‘Yes’ is 0.44 (i.e. 8/18) and that  for class value ‘No’ is 0.56 (i.e. 10/18). So, we can calculate the entropy as Entropy(S) = -0.44 log (0.44) - 0.56 log (0.56) = 0.99. Information gain of a decision tree Information gain for a particular feature A is calculated by the difference in entropy before a split (or Sbs ) with the entropy after the split (Sas). Information Gain (S, A) = Entropy (Sbs ) − Entropy (Sas) 35 Random Forest Algorithm  It is a popular machine learning algorithm that belongs to the supervised learning technique.  It can be used for both Classification and Regression problems in ML.  It is based on the concept of ensemble learning, which is a process of combining multiple classifiers to solve a complex problem and to improve the performance of the model. As the name suggests, "Random Forest is a classifier that contains a number of decision trees on various subsets of the given dataset and takes the average to improve the predictive accuracy of that dataset." Instead of relying on one decision tree, the random forest takes the prediction from each tree and based on the majority votes of predictions, and it predicts the final output. The greater number of trees in the forest leads to higher accuracy and prevents the problem of overfitting. 36 For example: consider the fruit basket as the data as shown in the figure below. Now n number of samples are taken from the fruit basket, and an individual decision tree is constructed for each sample. Each decision tree will generate an output, as shown in the figure. The final output is considered based on majority voting. In the below figure, you can see that the majority decision tree gives output as an apple when compared to a banana, so the final output is taken as an apple. 37 Random Forest 38 The following steps explain the working Random Forest Algorithm: Step 1: Select random samples from a given data or training set. Step 2: This algorithm will construct a decision tree for every training data. Step 3: Voting will take place by averaging the decision tree. Step 4: Finally, select the most voted prediction result as the final prediction result. This combination of multiple models is called Ensemble. 1.Bagging 2. Boosting 39 Bagging  Bagging, also known as Bootstrap Aggregation, is the ensemble technique used by random forest.  Bagging chooses a random sample/random subset from the entire data set.  This random sample is called Bootstrap sample.  Here the bootstrap sample is taken from actual data (Bootstrap sample 01, Bootstrap sample 02, and Bootstrap sample 03) with a replacement which means there is a high possibility that each sample may be a duplicated sample.  The model (Model 01, Model 02, and Model 03) obtained from this bootstrap sample is trained independently and Each model generates results.  The final output is based on majority voting after combining the results of all models.  This step which involves combining all the results and generating output based on majority voting, is known as aggregation. 40 Bagging 41 Boosting  It is an ensembling technique that attempts to build a strong classifier from the number of weak classifiers.  It is done by building a model by using weak models in series.  Firstly, a model is built from the training data.  Then the second model is built which tries to correct the errors present in the first model.  This procedure is continued and models are added until either the complete training data set is predicted correctly or the maximum number of models is added. 42 Why use Random Forest? It takes less training time as compared to other algorithms. •It predicts output with high accuracy, even for the large dataset it runs efficiently. •It can also maintain accuracy, when a large proportion of data is missing. Applications of Random Forest There are mainly four sectors where Random forest mostly used: 1. Banking: Banking sector mostly uses this algorithm for the identification of loan risk. 2. Medicine: With the help of this algorithm, risks of the disease can be identified. 3.Land Use: We can identify the areas of similar land use by this algorithm. 4.Marketing: Marketing trends can be identified using this algorithm. 43 Advantages of Random Forest •It is capable of performing both Classification and Regression tasks. •It is capable of handling large datasets with high dimensionality. •It enhances the accuracy of the model and prevents the overfitting issue. Disadvantages of Random Forest •Although random forest can be used for both classification and regression tasks, it is not more suitable for Regression tasks. 44 Support Vector Machines SVM is a model, which can do linear classification as well as regression. Primarily, it is used for in Machine Learning. 2Dimensional Feature Space In SVM, a model is built to discriminate the data instances belonging to different classes. Let us assume for the sake of simplicity that the data instances are linearly separable. In this case, when mapped in a two- dimensional space, the data instances belonging to different classes fall in different sides of a straight line drawn in the two-dimensional space as depicted in Figure. 45 If the same concept is extended to a multi dimensional feature space, the straight line dividing data instances belonging to different classes transforms to a hyperplane as depicted in Figure. 46  The goal of the SVM algorithm is to create the best line or decision boundary that into classes so that we can easily put the new data point in the correct category in the future. This best decision boundary is called a hyperplane.  SVM chooses the extreme points/vectors that help  These extreme cases are called as support vectors, and hence algorithm is termed as Support Vector Machine. 47 Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-dimensional space, but we need to find out the best decision boundary that helps to classify the data points. This best boundary is known as the hyperplane of SVM. The dimensions of the hyperplane depend on the features present in the dataset, which means if there are 2 features (as shown in image), then hyperplane will be a straight line. And if there are 3 features, then hyperplane will be a 2-dimension plane. We always create a hyperplane that has a maximum margin, which means the maximum distance between the data points. Support Vectors: The data points or vectors that are the closest to the hyperplane and which affect the position of the hyperplane are termed as Support Vector. Since these vectors support the hyperplane, hence called a Support vector.  Consider the below diagram in which there are two different categories that are classified using a decision boundary or hyperplane: The working of the SVM algorithm can be understood by using an example. Suppose we have a dataset that has two tags (green and blue), and the dataset has two features x1 and x2. We want a classifier that can classify the pair(x1, x2) of coordinates in either green or blue. Consider the below image: So as it is 2-d space by just using a straight line, we can easily separate these two classes. But there can be multiple lines that can separate these classes. Consider the below image: Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary or region is called as a hyperplane. SVM algorithm finds the closest point of the lines from both the classes. These points are called support vectors. The distance between the vectors and the hyperplane is called as margin. And the goal of SVM is to maximize this margin. The hyperplane with maximum margin is called the optimal hyperplane.

Supervised Learning: Classification Models & Algorithms

Related documents

Products

Support

Supervised Learning: Classification Models & Algorithms

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib