School of Computing Science and Engineering Course Code : CSAI2030 Course Name: Predictive Analytics Introduction to ML Classification Techniques Name of the Faculty: VIKASH KUMAR MISHRA Program Traditional Programming Input Process x1,x2 y = x1 + x2 Or y = f (x1,x2) Output y Machine Learning x1 (Input) 2 3 4 5 6 7 8 y (Output) 4 6 8 10 12 14 16 y = 2 multiplied by x1 Machine Learning Input Output x1,x2 y Machine Learning Algorithm Process Machine Learning Algorithm y=f(x1,x2) What is requirement of Machine Learning Problem: Data Set Input (x1,x2,…) Output (y) From where to get Data? • UCI Repository • https://archive.ics.uci.edu • Kaggle web site: • https://www.kaggle.com/datasets • United Nations • http://data.un.org/ • India • https://data.gov.in/ What kind of problem Machine Learning can solve: x1,x2 y Training Input (X) Training Output(y) Machine Learning Algorithm Machine Learning Algorithm y=f(x1,x2) Machine Learning Model (X y) What kind of problem Machine Learning can solve: Training Input (X) Training Output(y) New Input (X) Machine Learning Algorithm Machine Learning Model Machine Learning Model (X y) Output (y) What kind of problem Machine Learning can solve: S. No. 1 2 3 4 5 6 7 8 9 10 11 Area bedrooms 585 400 306 665 636 416 388 416 480 550 720 3 2 3 3 2 3 3 3 3 3 3 bathrooms 1 1 1 1 1 1 2 1 1 2 2 stories 2 1 1 2 1 1 2 3 1 4 1 parking no no no yes no yes no no yes yes no This is called Regression Problem price 4200000 3850000 4950000 6050000 6100000 6600000 6600000 6900000 8380000 8850000 9000000 Regression • Calculates approximation of a continuous variable • For Example: Estimated household income for loan processing or credit card limit etc. Estimated Crop production in an area Estimated traffic at a place Estimated future requirement for saving/investment What kind of problem Machine Learning can solve: S. No. 1 2 3 4 5 6 7 8 9 10 11 Area bedrooms 585 400 306 665 636 416 388 416 480 550 720 3 2 3 3 2 3 3 3 3 3 3 bathrooms 1 1 1 1 1 1 2 1 1 2 2 stories 2 1 1 2 1 1 2 3 1 4 1 parking no no no yes no yes no no yes yes no This is called Classification Problem Price >650000 No No No No No Yes Yes Yes Yes Yes Yes What kind of problem Machine Learning can solve: S. No. 1 2 3 4 5 6 7 8 9 10 11 Area bedrooms 585 400 306 665 636 416 388 416 480 550 720 3 2 3 3 2 3 3 3 3 3 3 bathrooms 1 1 1 1 1 1 2 1 1 2 2 stories 2 1 1 2 1 1 2 3 1 4 1 parking no no no yes no yes no no yes yes no This is called Classification Problem House Cost Low Low Low Medium Medium Medium Medium High High High High Classification • Examining the feature of a newly presented object and assigning it to one of the predefined classes. • For Example: Classifying loan applications as Low, Medium and High Risk Assigning product into Categories and sub-categories Classifying people as BPL etc. Feature and Outcome S. No. 1 2 3 4 5 6 Area bedrooms 585 400 665 636 416 480 3 2 3 2 3 3 bathrooms 1 1 1 1 1 1 Floor 2 1 2 1 3 1 parking no no yes no no yes Features (X) : Area, bedrooms, bathrooms, Floor, parking Predictors, Independent Variable Outcome (y): House Cost Dependent Variable, Result House Cost Low Low Medium Medium High High Affinity Grouping • Which things go together? • To understand the purchase behaviour of customers • Market Basket Analysis For example: If someone buys a book on Data Science, it is most likely that he will also buy some book on Python. If someone buys Milk, he may buy bread or cornflakes. Clustering • Segmenting heterogenous group of population into a more homogenous sub groups or clusters. • For Example: Customer segmentation according to buying behaviours Creating cluster of patients with similar symptoms to identify deceases Abstract In the present scenario the data is being generated rapidly. In such a voluminous data world, finding out the correct data is very critical and important. On the basis of some criteria or pattern, the data can be sorted into classes, this is also known as classification. The process of grouping data into classes can be either supervised or unsupervised. 17 History of Classification • living organisms were simply classified as plants or animals • plants were also classified as herbs, trees and shrubs • Animals were also grouped into herbivores, carnivores and omnivores • The modern classification systems take into account the evolutionary relationships between living organisms 18 History of Classification o In 1758 o a Swedish botanist o Carl Linnaeus o developed a system that still is used today to classify species. Carolus Linnaeus Figure: seven taxonomic units of classification 19 Classification • It is the process of arranging data into homogeneous (similar) groups according to their common characteristics. • Raw data cannot be easily understood, and it is not fit for further analysis and interpretation. Arrangement of data helps users in comparison and analysis. • For example, the population of a town can be grouped according to sex, age, marital status, etc 20 21 Stages of Classification Performed in two stages: Pointer = 7 Model Construction Model Usage Pointer Results 5 Fail 8 Pass 4 Fail 9 Pass 3 Fail 8 Pass Pointer≤ 5 Fail PASS 23 Objective of Classification The primary objectives of data classification are: To consolidate the volume of data in such a way that similarities and differences can be quickly understood To aid comparison. To point out the important characteristics of the data at a flash. To give importance to the prominent data collected while separating the optional elements. To allow a statistical method of the materials gathered. 24 The basis of classification The primary objectives of data classification are: Geographical or spatial Classification Chronological or Temporal Classification Qualitative Classification Quantitative Classification 25 Classification Steps Define why you want a classified image, how will it be used? Decide, if you really need a classified image? Define the study area. Select or develop a classification Select Imagery Prepare Imagery for Classification Collect ancillary data Choose classification method and classify scheme. 26 Classification criteria The minimum level of interpretation accuracy should be at least 85 percent. The accuracy of interpretation for the several categories should be about equal. The classification system should be applicable over extensive areas. 27 Classification criteria The classification system should be suitable for Temporal data Effective use of subcategories can be possible. Aggregation of categories must be possible. Comparison should be possible. 28 Classification •Sentiment Analysis •Email Spam Classification •Document Classification •Image Segmentation •Speech Recognition •DNA Sequence Classification 29 Statistical Descriptors Statistical Classification requires some descriptors to find the similarity with the data going to be classified: Mean Distance Range Standard deviation 30 Types of Classification The Classification can be done mainly in three ways: 1. Supervised: Training, Classification and Testing. 2. Unsupervised: Classification and Testing. 3. Hybrid: A mixture of both the methods. 31 Topic Supervised Unsupervised Hybrid Process input and output variables will be given. only input data will be given A mixture of both the methods Input Data Algorithms are trained using labeled data. Algorithms are used against data which is not labeled They can establish state of art results on any task Algorithms Used Support vector machine, Neural network, Linear and Unsupervised algorithms can be divided into different Q-learning, State action reward state action, logistics regression, random forest, and Classification categories: like Cluster algorithms, K-means, Hierarchical Deep Q Network trees. clustering Computational Complexity a simpler method. Use of Data uses training data to learn a link between the input does not use output data. and the outputs Uses both input ad output data Accuracy of Results Highly accurate and trustworthy method Less accurate and trustworthy method. Highly accurate results Real-Time Learning Learning method takes place offline Learning method takes place in real-time Learning method takes place offline Number of Classes The number of classes is known. Number of classes is not known Main Drawback classifying big data can be a real challenge in We cannot get precise information regarding data sorting, and Supervised Learning. the output as data used in unsupervised learning is labeled and not known Past Knowledge allows you to collect data or produce a data output helps you to finds all kinds of unknown patterns in data. from the previous experience Type Regression and Classification are two types of Clustering and Association are two types of Unsupervised Decision Making supervised, learning machine-learning techniques Uses Image recognition, speech recognition, weather Pre-Process the data, pre-train supervised learning algorithm forecasting is computationally complex Complex computation Learns from past knowledge Warehouses, Inventory Management, Delivery management, power system, Financial system Supervised Classification 33 Sampling/Training Sampling is the process of selecting some data points from whole set in such a way that on the basis of selected data points the whole set could be predicted In remote sensing classification, some data from each existing classes are needed so that spectral information could be converted to LULC information. Selection of sample is very important and critical part of classification. Inaccurate sampling may lead to misclassification of remote sensing imagery. 34 Sampling Rules One should take care of following facts to avoid inaccurate sampling: i. The reference data and satellite data should belong to same time period. (Temporal resolution should be same) ii. We should have enough number of samples (η) and it could be decided through Equation 2.1: 𝜂=𝑧× 𝑝𝑞 𝑒2 z= 2x(standard deviation), p = standard accuracy i.e., between 85%-95%, q=1-p, e = elloyable error ≈ 100%-z 35 Sampling Rules One should take care of following facts to avoid inaccurate sampling: i. The reference data and satellite data should belong to same time period. (Temporal resolution should be same) ii. We should have enough number of samples (η) and it could be decided through Equation 2.1: 𝜂 = 𝑧2 × 𝑝𝑞 𝑒2 z= 2x(standard deviation) and the value for z = 2 is generalized from the standard normal deviate of 1.96 for the 95-percent two-sided confidence level, p = standard accuracy i.e., between 85%-95%, q=1-p, e = elloyable error ≈ 100%-z 36 Types of Sampling iii. If the area is large (more than million acres) or land use category is more than 12 then number of samples should be increased to 75 to 100. iv. More samples can be taken in more important categories and less in less important categories. v. The minimum number of pixel samples by class can be obtained by the Equation: 𝑆𝑎𝑚𝑝𝑙𝑒𝑠 ≡ 30 × N × C N is the number of discriminate variables and C, is the number of classes. 37 Sampling methods Samples from a Remote sensing imagery can be selected through any of the methods discussed as below: Simple Random Sample (RS) Stratified Random Sampling (SRS) Systematic Sampling (SS) Cluster Sampling (CS) 38 Classification methods 39 40 Minimum Distance to Mean Following steps are recommended to implement the above defined scheme: Define the possible number of classes and label them. select Training Data Set for each Class. Calculate mean value for each class on the basis of Training data set. Take a unclassified pixel find its distance to mean value of different classes and classify into a class having minimum distance to mean. Repeat Step 4 for all unclassified pixels and classify them into appropriate Class. After classification give name to each class. 41 Parallelepiped Following steps are recommended to implement the above defined scheme: Define the possible number of classes and select Training Data Set for each Class. Based on the training data set find Range i.e. [minimum spectral value, maximum spectral value] for a class. Classify a pixel into a class if its spectral value belongs to the range of that class. Repeat step 3 for all other pixel to classify them. 42 Classification 43 K-Means Following steps are followed: • Define the number of classes say K • From the data set choose k random values as mean values. • Now classify the remotely sensed data into K-classes on the basis of randomly chosen mean on the basis of minimum distance to mean. • Find the K number of means for newly classified K-classes. • Again classify the original remotely sensed data set into K-classes on the basis of means computed in step-4. • Repeat step-4 and step-5 to classify the original data set iteratively until mean for K-classes start giving same means for same classes. 44 ISODATA Following steps are recommended to implement the above defined scheme: Classes centers i.e. mean are placed randomly. Pixels are assigned to classes based on the minimum distance to mean method The standard deviation within each cluster and distance between two centers are computed. 3a. classes are split if the standard deviation is greater than the user defined threshold value. 3b. Classes are merged together if the distance between centers is less than users’ threshold values. Now iteration is performed with new class center. Iterations are performed until the average inter center distance falls below the user defined threshold or the maximum number of iteration is reached. 45 Classification Accuracy Assessment N×N confusion matrix is generated Diagonals represent sites classified correctly according to reference data Off-Diagonals represent sites classified correctly according to reference data Total Accuracy: Number of correct plots / total number of plots 46 References https://monkeylearn.com/blog/classification-algorithms/ https://online.stat.psu.edu/stat508/lesson/1a/1a.5 T. Lillesand, R. W. Kiefer, and J. Chipman, Remote sensing and image interpretation, John Wiley & Sons, fourth edition, pp 1-52 and 470-568, 2014 . P. M. Mather, and M. Koch, Computer processing of remotely-sensed images: an introduction, John Wiley & Sons, Fourth Edition, pp 1-26 and 229-283, 2011. J. R. Anderson, A land use and land cover classification system for use with remote sensor data, US Government Printing Office, 1976. Du, Z., Li, W., Zhou, D., et al.: ‘Analysis of Landsat-8 OLI imagery for land surface water mapping’, Remote Sens. Lett., 2014, 5(7), pp. 672–681 Jia, K., Wei, X., Gu, X., et al.: ‘Land cover classification using Landsat 8 operational land imager data in Beijing, China’, Geocarto Int., 2014, 29(8), pp. 941–951 Kondraju, T., Mandla, V. R. B., Mahendra, R. S. et al.: ‘Evaluation of various image classification techniques on Landsat to identify Coral Reefs’, Geomat. Nat. Haz. Risk, 2014, 5(2), pp. 173–184 Lu, D., Weng, Q.: ‘A survey of image classification methods and techniques for improving classification performance’, Int. J. Remote Sens., 2007, 28(5), pp. 823–870 47