Book Description Have you thought about a career in data science? It's where the money is right now, and it's only going to become more widespread as the world evolves. Machine learning is a big part of data science, and for those that already have experience in programming, it's the next logical step. Machine learning is a subsection of AI, or Artificial Intelligence, and computer science, using data and algorithms to imitate human thinking and learning. Through constant learning, machine learning gradually improves its accuracy, eventually providing the optimal results for the problem it has been assigned to. It is one of the most important parts of data science and, as big data continues to expand, so too will the need for machine learning and AI. Here's what you will learn in this quick guide to machine learning with Python for beginners: What machine learning is Why Python is the best computer programming language for machine learning The different types of machine learning How linear regression works The different types of classification How to use SVMs (Support Vector Machines) with Scikit-Learn How Decision Trees work with Classification How K-Nearest Neighbors works How to find patterns in data with unsupervised learning algorithms You will also find plenty of code examples to help you understand how everything works. If you are ready to take your programming further, scroll up, click Buy Now, and find out why machine learning is the next logical step. Python Machine Learning for Beginners All You Need to Know about Machine Learning with Python © Copyright 2022 - All rights reserved. Alex Campbell. The contents of this book may not be reproduced, duplicated, or transmitted without direct written permission from the author. Under no circumstances will any legal responsibility or blame be held against the publisher for any reparation, damages, or monetary loss due to the information herein, either directly or indirectly. Legal Notice: You cannot amend, distribute, sell, use, quote, or paraphrase any part of the content within this book without the consent of the author. Disclaimer Notice: Please note the information contained within this document is for educational and entertainment purposes only. No warranties of any kind are expressed or implied. Readers acknowledge that the author is not engaging in the rendering of legal, financial, medical, or professional advice. Please consult a licensed professional before attempting any techniques outlined in this book. By reading this document, the reader agrees that under no circumstances is the author responsible for any losses, direct or indirect, which are incurred as a result of the use of the information contained within this document, including, but not limited to, —errors, omissions, or inaccuracies. Table of Contents Introduction Prerequisites Chapter 1: What Is Machine Learning? How It Works Machine Learning Features Why We Need Machine Learning The History of Machine Learning Machine Learning Today Chapter 2: The Different Types of Machine Learning Which Algorithms to Use Chapter 3: Linear Regression What Is Regression? Linear Regression Implementing Linear Regression in Python Chapter 4: The Different Types of Classification Binary Classification Multi-Class Classification Multi-Label Classification Imbalanced Classification Chapter 5: Support Vector Machines with Scikit-Learn How Does SVM Work? Classifier Building in Scikit-learn Chapter 6: Using Decision Trees Decision Tree Algorithm Attribute Selection Measure Building a Decision Tree Classifier Visualizing Decision Trees Optimizing Decision Tree Performance Chapter 7: K-Nearest Neighbors Implementing KNN Algorithm With Scikit-Learn Chapter 8: Finding Patterns in Data The Difference between Supervised and Unsupervised Learning Preparing the Data Clustering Conclusion References Introduction Machine Learning (ML) and Artificial Intelligence (AI) are not just the latest buzz words. They are now the most important part of the world we live in and far more important and useful than science fiction would have you believe. Without them, we simply couldn't process the huge amounts of data we produce, at least not effectively or efficiently. Without them, more people would be stuck doing repetitive, mundane jobs instead of putting their skills to better use. Without them, companies couldn't make good business decisions or draw up effective strategies and solutions. While the human brain can process large amounts of data, it can only absorb so much data at any one time. AI doesn't have these limitations, and it is far more accurate, free from human error. However, it isn't the easiest technology to develop and requires the right programming language. That language is Python, for several reasons: 1. It has a wide range of libraries, modules that include pre-written code for certain functions or actions. This means developers don't have to start from scratch every time. Some of the best Python libraries are: Scikit-Learn - handles regression, clustering, classification, and other ML algorithms Pandas – for higher-level analysis and data structures, help with data merging and filtering, and gathering data from external sources Keras – a deep learning library for prototyping and calculations TensorFlow – another deep learning library that helps set up, train, and use artificial neural networks with vast datasets Matplotlib – creates visualizations, like histograms, 2D plots, charts, etc. 2. It is easy to learn with intuitive syntax, which means it can easily be used for ML with little effort. 3. It is a flexible language, offering a choice of scripting or OOPs, with no requirement to recompile source code, which allows changes to be quickly implemented. It can also be combined with other languages where the need arises. 4. It isn't dependent on any platform and can be used on more than 20. It also isn't too difficult to transfer it from one platform to another. 5. It is simple to read, so all developers can understand anyone's code and change it if they need to. 6. It offers a great choice of visualization tools so that data can be presented in a human-readable format. 7. It also has one of the largest communities of any programming language, with developers and others waiting to help and provide resources. Prerequisites This book is aimed at those who already have programming experience with Python. If you are completely new to programming, you really need to go and learn the basics at the very least before attempting any of the coding and examples in this book. If you are experienced and are ready to take your knowledge up a notch, let's dive in and learn all about machine learning using Python. Chapter 1: What Is Machine Learning? The real world is full of humans whose brains have a vast capacity to learn from their experiences and machines or computers that work from human instruction. One question that has long been asked is, "can a computer learn from past data or experiences the same way humans do?" That question is answered with machine learning. One of the fastest-growing technologies, machine learning is all about teaching computers to learn from past data. It does this by using algorithms that build mathematical models and use historical information or data to make predictions. You already use machine learning in your everyday life, most likely without even knowing it. It is used to filter spam emails from your inbox, image recognition, auto-tagging in Facebook, speech recognition, recommender systems in Netflix, Amazon, etc., and so much more. Machine learning is a subset of Artificial Intelligence, and the term was first coined in 1959 by Arthur Samuel. It can be defined as enabling machines to learn automatically from data, use past experiences to improve their performance, and make predictions without needing to be explicitly programmed. We use samples of historical data, called training data, to teach machine learning algorithms how to build mathematical models that make decisions or predictions. This branch of computer science combines statistics and computer science to build predictive models, and they use or construct algorithms to learn from the data. The more data we provide, the better the machine learning model's performance. How It Works When we provide a machine learning system with sufficient historical data, it builds a prediction model. When we give it new data, it will use that model to predict the output for that data. The accuracy of that output is entirely dependent on the amount of data we provide – the more data we give, the more accurate the output will be. Let's say we are dealing with a complex problem, and we need some predictions. Rather than writing code from scratch, we can use pre-built, generic algorithms. We give the data to those algorithms, and they use that data to build the logic and provide the predicted outputs. In short, machine learning changes how we think about problem-solving. Machine Learning Features Machine learning offers plenty of features: It uses data to find patterns in datasets It learns from past data and uses it to improve automatically The technology is purely data-driven Machine learning can be seen as similar to data mining in that both deal with vast amounts of data Why We Need Machine Learning Machine learning is fast becoming a requirement for everyday life, and the need for it increases as each day passes. Why do we need it so badly? For a start, it can take the place of humans. That isn't a bad thing – some jobs are incredibly mundane and time-consuming, and allowing machine learning to take over means freeing up time that is better spent elsewhere. Conversely, some jobs are far too complex for humans to do – we do have some limitations and don't have a way to manually access the large amounts of data needed for these jobs. That's where computer systems, specifically machine learning, come into the picture. We can give this vast amount of data to a machine learning algorithm. They explore that data, build their models, and predict the output. But the amount of data we give these models isn't the only thing that affects their performance – it also comes down to the cost function, and machine learning can save us significant money and time. We can also understand just how important machine learning is by looking at its use cases. Some prominent uses are cyber fraud detection, self-driving cars, Facebook friend suggestions, facial and speech recognition, and spam email filtering. Plus, major companies like Amazon and Netflix use it to analyze user preferences and provide product recommendations. To recap the importance of machine learning, it can: Analyze and learn from ever-increasing amounts of data Solve problems too complex for humans Make decision making more efficient in many sectors Find patterns hidden in the data and extract information from it The History of Machine Learning Until 40 or 50 years ago, machine learning was the stuff of science fiction. Today, it is a prominent part of our lives, making things much easier for us, from self-driving cars to product recommendations and virtual assistants (think Siri, Alexa, Cortana, etc.) However, while machine learning is still relatively new, the idea behind it has been around for many years. Here are some of the more important milestones in its history: 1834 The father of the computer, Charles Babbage, came up with the idea of a device that could easily be programmed with punch cards. The device was never built, but modern computers rely on its logical structure. 1936 Alan Turing devised the theory that machines can learn a set of instructions and execute them. 1940 This year saw the invention of ENIAC, the first manually operated computer., and the first general-purpose, electronic computer. This led to EDSAC (1949) and EDVAC (1951), among other stored-program computers, being invented. 1943 - 1950 1943 saw the first modeling of a human neural network with an electrical circuit. Scientists began applying this idea to work in 1950, analyzing how human neurons potentially worked. In 1950, Alan Turing also published a seminal paper on artificial intelligence. His paper was called "Computer Machinery and Intelligence," and it asked an important question – can machines think. 1952 The pioneer of machine learning, Arthur Samuel, developed a program to help an IBM computer play checkers. The more it played, the better it got. 1959 Arthur Samuel coined the term "machine learning." 1974 – 1980 This was a tough era for ML and AI researchers, which became known as the "AI Winter." This was a time when machine translations failed, and interest in AI began to wane. This led to a reduction in research funding by the governments. 1959 For the first time, a real-world problem was the subject of a neural network application designed to use adaptive filters to remove echoes from phone lines. 1985 Charles Rosenberg and Terry Sejnowski invented NETtalk, a neural network that could teach itself to pronounce 20,000 words correctly in just seven days. 1997 The Deep Blue intelligent computer from IBM beat Garry Kasparov, a Russian Chess Grandmaster, at his own game, becoming the first computer ever to beat a human at chess. 2006 A computer scientist called Geoffrey Hinton renamed neural net research, calling it 'deep learning.' Today it is one of the top-trending technologies. 2012 Google developed a deep neural network that could recognize images of cats and humans from videos on YouTube. 2014 A chatbot called Eugene Goostman passed the Turing test, becoming the first chatbot ever to convince the human judges on the panel that it was human, not a machine. 33% of the panel were human. In the same year, Facebook created its own deep neural network called DeepFace, claiming it had the same precision as humans in recognizing specific people. 2016 A computer program called AlphaGo beat Lee Sodol, the second-best player in the world, in a game of Go. The following year, it would go on to beat Ke Jie, the world's number one player. 2017 An intelligent system was built by Alphabet's Jigsaw team, which could learn online trolling. By reading millions and millions of comments from different sites, it learned how to stop online trolling. Machine Learning Today These days, machine learning has come a long way, and it continues to advance thanks to research. Modern ML models are now used to predict diseases, weather, analyze the stock market, and much more. In the next chapter, we will delve into the different types of machine learning we can use today. Chapter 2: The Different Types of Machine Learning Like many things, there is more than one way to train a machine learning algorithm, and each way comes with its own set of pros and cons. To understand those pros and cons, we first need to look at the type of data they use. There are two types of data in machine learning – labeled and unlabeled. Labeled Data – contains input and output parameters in a pattern only readable by a machine. However, a significant amount of human labor is required to read that data to start with. Unlabeled Data – no more than one parameter is in machine-readable form, which means human labor is not required, but the solutions are way more complex. Machine learning algorithms are separated into four different types: Supervised Learning In supervised learning, the algorithm learns by example. The algorithm is given a known dataset, which contains the desired inputs and outputs. It’s down to the algorithm to find the right method to work out how to get to those inputs and outputs. The operator already knows the right answer to the problem, but the algorithm will identify specific patterns in the data. Then it will learn from its observations and use that to make predictions. If the prediction is wrong, the operator corrects it, and the process is repeated until the algorithm has achieved the highest possible level of accuracy and performance. Supervised learning tasks include: Classification – in these tasks, the machine learning models draw conclusions from observed values and select the best categories for new data. For example, a program that determines whether an email is spam or not must look at existing data and learn how to filter the emails. Regression – in these tasks, the models must understand and estimate relationships between variables. During regression analysis, one dependent variable becomes the focus, along with several changeable variables, making classification one of the best tools for forecasting and prediction. Forecasting – in these tasks, predictions are made about the future based on present and past data. This is commonly used in trend analysis. Semi-Supervised Learning Semi-supervised learning only differs from supervised learning in that it uses labeled and unlabeled data. The labeled data has tags that allow the algorithm to understand the data, while the unlabeled data doesn’t have any information. Using a combination of labeled and unlabeled data means that the algorithms learn how to put labels on unlabeled data. Unsupervised Learning In unsupervised learning, the algorithm examines the data looking for patterns with no human instruction and no key to learn from. Instead, it analyzes the data given to it and determines relationships and correlations. Unsupervised learning leaves the machine to interpret vast data and determine how to deal with it by organizing the data to describe its structure. This could be clustering or arranging it in another way that makes it easier to read. The more data an unsupervised learning algorithm accesses, the better its decision-making gets. Unsupervised learning tasks include: Clustering - sets of data are grouped by similarity, based on predefined criteria. This is useful when data needs to be segmented into multiple groups and analysis performed on each one to find the patterns. Dimension Reduction – this reduces how many variables need to be considered to find the required information. Reinforcement Learning Reinforcement learning revolves around controlled learning processes where a specific set of actions is provided to the algorithm, with the parameters and the required outputs. Because the rules are pre-defined, the algorithm can explore the possibilities and options, monitoring each result and evaluating them to determine the optimal one. Reinforcement learning is all about trial and error. Past experiences are studied, and the algorithm continually adapts its approach until the best result is achieved. Which Algorithms to Use Making sure you choose the right algorithm is dependent on a few factors, such as: Size of data Quality of data Diversity of data The answers required to derive useful insights from the data Algorithm accuracy How long does it take to train The required parameters Data points This is not an exhaustive list, and choosing the right one is a combination of specification, business need, time available, and experimentation. Even the best data scientists in the world cannot tell you the best algorithm to use right off the bat. It requires experimentation, but below, you can find a list of the most popular machine learning algorithms: Naïve Bayes Classifier (Supervised learning, classification) –based on Bayes Theorem, this algorithm classifies every value independently. This algorithm uses probability to help us predict categories or classes based on a provided feature set. It may be a simple algorithm, but it works very well and tends to be used a lot because it outperforms many of the more sophisticated algorithms. K-Means Clustering (Unsupervised learning, clustering) – this algorithm places unlabeled data into categories. It searches the data and finds groups, representing the number of groups by a variable K. It iteratively assigns data points to a K group based on the provided features. Support Vector Machine (Supervised learning, classification) – these algorithms are used in regression and classification analysis. The algorithm is given a set of training data, with each set belonging to one of two categories. The algorithm builds a model that can take new data and assign it to one of these categories. Linear Regression (Supervised learning, regression) – this is regression at its most basic level, allowing us to understand existing relationships between continuous variables. Logistic Regression (Supervised learning, classification) – this type of regression estimates an event’s probability based on previous data. It covers binary dependent variables, where there can only be two values, 1 and 0, to represent the outcomes. Artificial Neural Networks (Reinforcement learning) – ANNs are made up of units in layers. Each layer connects to those on either side of it. The inspiration for these comes from the brain and other biological systems and how they process information. Essentially, they are processing elements, all interconnected and working together to solve a problem. Decision Trees (Supervised learning, classification and regression) – decision trees are flow charts with a tree structure. A branching method is used to illustrate all possible outcomes of a decision, with each node representing a test on a variable. Each branch is that test’s outcome. Random Forests (Supervised learning, classification and regression) – these come under the ensemble learning methods, where several algorithms are combined to get better results for regression and classification tasks, among others. Each classification algorithm is weak on its own but, combined with others, it can give excellent results. It begins with a decision tree with an input at the top. The algorithm traverses the tree, segmenting the data into ever smaller sets based on certain variables. K-Nearest Neighbors (Supervised learning) – this algorithm is used to estimate the likelihood of a data point belonging to one group or another. It examines the data points surrounding a single point to see what group it is in. For example, a point is on a grid, and KNN wants to determine whether it belongs to group A or group B. It looks at the nearest data points to see which group most of the points are in. As you can see, choosing the right algorithm is quite involved, and to help you out, we will go into more detail in the coming chapters for some of these algorithms, starting with linear regression. Chapter 3: Linear Regression Linear regression is one of the basic techniques anyone new to machine learning and statistical techniques should study before moving on to more complex methods. First, let’s take a look at what regression is. What Is Regression? Regression is a technique that looks for relationships between variables. For example, you could look at details of employees in a specific company and determine the relationship between their salary and things like education, experience, age, where they live, etc. These are known as features. Each employee’s data represents an observation in this type of regression problem. There is a presumption that the features are classed as independent while the salary is dependent on the features. In the same way, you could examine house prices to determine a mathematical dependence based on their features, such as the number of bedrooms, living area, how close they are to the city center, etc. Typically regression analysis considers phenomena of interest and has several observations, while each observation has at least two features. If we follow an assumption that at least one feature is dependent on the other features, we try to find some kind of relationship between them. In other words, we need a function that will sufficiently map features and variables to others. Dependent features – these are the dependent variables, otherwise known as the outputs or the responses Independent features – these are the independent variables, otherwise known as the inputs or the predictors Typically, a regression problem will contain two dependent variables – one continuous and one unbounded. However, the inputs may be discrete, continuous, or they may even be categorical data, like brand, nationality, gender, etc. Best practice recommends using y to denote the outputs and x for the inputs. For two or more independent variables, the vector �� = ( �� ₁, …, �� ᵣ ) can be used, where r denotes the number of inputs. When Is Regression Needed? Regression is usually used to solve problems asking where something influences something else and how, or when trying to find the relationship between multiple variables. For example, regression can be used to work out if gender or experience affects salaries and to what extent. It is also used to forecast responses using new predictors. For example, given the time of day, external temperature, and the number of people in a household, you could try to predict a household’s electricity consumption for the next hour. Many different fields make use of regression, including computer sciences, economy, social sciences, etc. Every day, it becomes more important as more and more data becomes available and we become more aware of how to use the data. Linear Regression One of the most widely used techniques in regression, and possibly the most important, is linear regression. One of the easiest regression methods to use, it has the advantage of the easy interpretation of the results. Problem Formulation Let's say we want to implement linear regression of a dependent variable (y) on �� = ( �� ₁, …, �� ᵣ ), a set of independent variables where r denotes the number of predictors. We assume the relationship between �� and �� is linear: �� = �� ₀ + �� ₁ �� ₁ + ⋯ + �� ᵣ �� ᵣ + �� That is the regression equation. The regression coefficients are �� ₀, �� ₁, …, �� ᵣ , while the random error is �� . Linear regression will calculate the regression coefficient's estimators od the predicted weights, which are indicated by �� ₀, �� ₁, …, �� ᵣ . These weights define �� ( �� ) = �� ₀ + �� ₁ �� ₁ + ⋯ + �� ᵣ �� ᵣ , which is the estimated regression function that should be able to capture the dependencies between the outputs and inputs well. Each observation i = 1, …, n, has a predicted or estimated response of �� ( �� ᵢ ), which should be as near to yi as possible – the actual corresponding response. The differences for all the observations �� = 1, …, �� is �� ᵢ - �� ( �� ᵢ ), and these are known as residuals. Regression is all about working out the best-predicted weights that correspond to the smallest residual. So how do you get the best weights? The SSR (sum of squared residuals) for all the observations �� = 1, …, �� must be minimized: : SSR = Σ ᵢ ( �� ᵢ - �� ( �� ᵢ ))² This is known as the method of ordinary least squares. Regression Performance The actual responses �� ᵢ , �� = 1, …, �� vary according to their dependence on Xi, the predictors. However, we also consider the output's inherent variance. The coefficient of determination, R2, indicates how much of y's variation is dependent on X using the specific regression model. The larger R2 is, the better the fit, and the model can explain the output variation with different inputs much better. R2 = 1 corresponds to SSR = 0. This tells you that you have the best fit because predicted and actual response values fit perfectly. Implementing Linear Regression in Python Now you know what linear regression is all about, let's look at how to implement it in Python. It really is nothing more difficult than implementing the right libraries and their classes and functions. Linear Regression Packages A fundamental package is NumPy, a scientific package that lets you perform high-performance arrays, whether single or multi-dimensional. It is opensource and also offers plenty of useful mathematical routines. Scikit-Learn is another useful machine learning package built on NumPy and other packages. Scikit-Learn gives you what you need to preprocess the data, reduce the dimensionality, implement the regression, clustering or classification, and much more. It is also an open-source package. Simple Linear Regression Let's dive in with simple linear regression. When you implement any linear regression, you need to follow five steps: 1. Import the correct packages and classes 2. Provide the model with data and do the required transformations 3. Build a regression model, fitting it with existing data 4. Check the model fitting results so you know if you have the right one 5. Apply the model to get your predictions. Step 1: Import the correct packages and classes First, you need to import the NumPy package and the LinearRegression class from sklearn.linear_model: import numpy as np from sklearn.linear_model import LinearRegression That gives you everything you need to implement the linear regression. NumPy's fundamental data type is numpy.ndarray, which is the array type. For the remainder of this chapter, we will use 'array' to refer to all instances of numpy.ndarray . We use the sklearn.linear_model.LinearRegression class to do both linear and polynomial regression, making the predictions accordingly. Step 2: Provide the data Next, we need to define the data we are working with. The inputs, which are (regressors, x) and the outputs, which are (predictor, y) must be arrays or similar – this is easiest way to provide the data required for the regression: x = np.array([5, 15, 25, 35, 45, 55]).reshape((-1, 1)) y = np.array([5, 20, 14, 32, 22, 38]) You should now have two arrays – input x and output y. the array must be two-dimensional, which means having one column and however many rows are required, so we must call .reshape() on x. The number of columns and rows is denoted by the argument (-1, 1) in .reshape(). X and y now look like this: >>> >>> print(x) [[ 5] [15] [25] [35] [45] [55]] >>> print(y) Output: [ 5 20 14 32 22 38] There are two dimensions for x and x.shape is (6, 1) and one dimension for y and y.shape is (6, ). Step 3: Build your model and fit it Next, we need to build our linear regression model and use the existing data to fit it. First, an instance of the LinearRegression class needs to be created to represent our model: The next step is to create a linear regression model and fit it using the existing data. model = LinearRegression() The variable called model is created as an instance of LinearRegression, which can take several parameters, all optional: fit_interpret - a Boolean which has a default of True. It determines whether the intercept �� ₀ should be calculated (True) or considered equal to zero (False) normalize – a Boolean with a default of False. It determines whether the input variables should be normalized (True) or not (False) copy_X – a Boolean with a default of True. It determines whether the input variables should be copied (True) or not (False) n_jobs – an integer or None, which is the default. It represents how many jobs are used in parallel computation. None indicates one job, while -1 indicates all processors used. Our example will use the defaults for all the parameters. Now we need to use our model. First, .fit() needs to be called: model.fit(x, y) Once this has been called, we can calculate the best weight values for �� ₀ and �� ₁. To do this, x and y (the existing input and output) are used as the arguments. Simply put, .fit() will fit our model. The variable model is returned as self, which is why the last two statements can be replaced with one: model = LinearRegression().fit(x, y) This is just a shorter version of the other two statements, but it does exactly the same. Step 4: Get the results Once the model has been fitted, the results can be obtained to tell you if the model works well. We call .score() on model to get R2, which is the coefficient of determination: >>> >>> r_sq = model.score(x, y) >>> print('coefficient of determination:', r_sq) Output: coefficient of determination: 0.715875613747954 When you apply .score(), the predictor x and the regressor y are the arguments, and the return should be a value R2. The model’s attributes are .intercept(), representing �� ₀ (the coefficient) and .coef_, representing �� ₁: >>> >>> print('intercept:', model.intercept_) Output: intercept: 5.633333333333329 >>> print('slope:', model.coef_) Output: slope: [0.54] The code shows you how to get �� ₀ and �� ₁. Note that .coef is an array and .intercept_ is a scalar. The value of �� ₀ = 5.63 shows that when x is zero, the model will predict a response of 5.63, while �� ₁ = 0.54 indicates that, when x increases by one, the predicted response will rise by 0.54. Also, note that y may be provided as a two-dimensional array, and a similar result would be obtained. It might look like this: >>> >>> new_model = LinearRegression().fit(x, y.reshape((-1, 1))) >>> print('intercept:', new_model.intercept_) intercept: [5.63333333] >>> print('slope:', new_model.coef_) Output: slope: [[0.54]] You can see that this example is much like the last one, but, here, .intercept is a single-dimensional array containing one element, �� ₀, while .coef is two dimensional and one element of �� 1. Step 5: Predict the response When you are satisfied with your model, you use existing or new data to make predictions. To get the predicted response, you use .predict(): >>> >>> y_pred = model.predict(x) >>> print('predicted response:', y_pred, sep='\n') Output: predicted response: [ 8.33333333 13.73333333 19.13333333 24.53333333 29.93333333 35.33333333] When you apply .predict(), the regressor is passed as an argument, and you get the predicted response that corresponds to it. >>> >>> y_pred = model.intercept_ + model.coef_ * x >>> print('predicted response:', y_pred, sep='\n') Output: predicted response: [[ 8.33333333] [13.73333333] [19.13333333] [24.53333333] [29.93333333] [35.33333333]] Each element x is multiplied with model.coef_ and model.intercept is added to the product in this example. Here, the only difference in the output from the last example is in the dimensions. In the first example, the predicted response was onedimensional, and, in this one, it is two-dimensional. If the number of dimensions of x was reduced to one, both examples would give us the same result. To do this, replace x with one of the following when you multiply it with model.coef: x.reshape(-1) x.flatten(), or x.ravel(). In practice, we typically use regression for forecasting, which means fitted models can be used to calculate outputs based on new inputs. >>> >>> x_new = np.arange(5).reshape((-1, 1)) >>> print(x_new) [[0] [1] [2] [3] [4]] >>> y_new = model.predict(x_new) >>> print(y_new) Output: [5.63333333 6.17333333 6.71333333 7.25333333 7.79333333] In this example, we applied .predict() to x_new (a regressor, and the result is y_new. The array containing elements from 0 to 5 (inclusive and exclusive respectively), is generated using arrange() . The array is 0, 1, 2, 3, 4. Let’s now look at the different classification types. Chapter 4: The Different Types of Classification Classification is a common type of machine learning task and is used to assign specific classes with label values. It can then determine if a class is of one type or another. Perhaps the most common example of this is filtering spam emails, where an email is classified as spam or not spam. Throughout your journey, you will come across plenty of challenges, and there are several approaches in terms of the model type that fits each challenge. Classification Predictive Modeling Typically, classification refers to problems where the predicted result is a type of class label obtained from the provided data. Some of the more popular types of challenges include: Spam email classification – determining if an email is spam or not Handwriting classification – determining if a handwritten character is a known one or not User behavior classification – determines if recent behavior is churn or not All classification models need a training dataset containing plenty of input and output examples. The model uses this dataset to train itself. The data must have all possible scenarios for the specific problem, and each label must be represented by enough data for the model to learn from and train itself. Often, the class labels are returned as string values, which means they must be encoded into an integer—for example, 0 to represent spam and 1 to represent not spam. The only way to determine the best model for the problem is to experiment and work out the best configuration and algorithm to provide the best performance for the problem. In predictive modeling, the algorithms are all compared against their results. One of the best metrics used to evaluate a model’s performance on class label predictions is classification accuracy. It may not be the best parameter, but it is certainly one of the best places to start from in most classification tasks. Rather than a class label, some tasks might predict class membership probabilities of specified inputs. In cases like this, one of the most helpful indicators of model accuracy is the ROC curve. In your machine learning journey, you will probably come across four classification task types, and the different predictive model types are: Binary classification Multi-label classification Multi-class classification Imbalanced classification Let’s look into each one, with code examples to show you how they work. Binary Classification Binary classification covers tasks that can provide an output of one of two class labels. Typically, one class label is the normal state, while the other is the abnormal state. We can understand this better with the following examples: Detecting spam emails – normal state = not spam, while abnormal state = spam Conversion prediction - normal state = not churned, while abnormal state = churn Conversion prediction – normal state = purchased an item, while abnormal state = didn’t purchase an item Conversion prediction – normal state = no cancer detected, while abnormal state = cancer detected The notation typically followed is that the normal state is assigned 0, while the abnormal state is assigned 1. For example, a model can also be created to predict an output’s Bernoulli probability. Simply put, a discrete value is returned to cover every case, and the output is given as 1 or 0. Once the two states are associated, an output can be given for one of the present values. The commonly used binary classification algorithms are: K-Nearest Neighbors Logistic Regression Support Vector Machine Decision Trees Naive Bayes Some of these algorithms are designed specifically for binary classification and do not have native support for any more than two class types. This includes Logistic Regression and Support Vector Machines. To show you how binary classification works, we will create a dataset and apply the classification to it. We will generate a binary classification for the dataset using the make_blobs() function from Scikit-Learn. In our example, we have a dataset containing 1,000 examples belonging to one of the two present classes with two input features: from numpy import where from collections import Counter from sklearn.datasets import make_blobs from matplotlib import pyplot X, y = make_blobs(n_samples=5000, centers=2, random_state=1) print(X.shape, y.shape) counter = Counter(y) print(counter) for i in range(10): print(X[i], y[i]) for label, _ in counter.items(): row_ix = where(y == label)[0] pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label)) pyplot.legend() pyplot.show() Output: (5000, 2) (5000,) Counter({1: 2500, 0: 2500}) [-11.5739555 -3.2062213] 1 [0.05752883 3.60221288] 0 [-1.03619773 3.97153319] 0 [-8.22983437 -3.54309524] 1 [-10.49210036 -4.70600004] 1 [-10.74348914 -5.9057007 ] 1 [-3.20386867 4.51629714] 0 [-1.98063705 4.9672959 ] 0 [-8.61268072 -3.6579652 ] 1 [-10.54840697 -2.91203705] 1 In this example, a dataset is created with 5,000 samples divided into two elements – input X and output Y. The resulting distribution would show you that any instance can belong to class 0 or class 1, and each has approximately 50% of the instances. The first 10 examples are displayed with numeric input values and a target value of an integer representing class membership. The input variables are then shown on a scatter plot, with color-coded points based on the class values. Multi-Class Classification As the name indicates, these problems do not have two fixed labels; instead, they can have multiple labels. Some of the most common multi-class classification types are: Facial classification Plant species classification Optical character classification There is no abnormal or normal outcome, and the result belongs to any one of multiple variables of known classes. There may also be tons of labels, such as the prediction of images compared to how closely they resemble one of potentially thousands in a facial recognition system. You could also consider a challenge whereby the next word in a sequence needs to be predicted as a multi-class classification problem. In a scenario like this, all the words define every possible number of classes and could run into the millions. Categorical distribution is typically used for these types of models, whereas Bernoulli is used for binary classification. In categorical distributions, events can have several results or endpoints, and the models make predictions on the input probability regarding the individual output labels. The following algorithms are commonly used for multi-class classification: K-Nearest Neighbors Naive Bayes Decision trees Gradient Boosting Random Forest The binary classification algorithms can also be used with multi-class classification based on the notion of one vs. rest – one class vs. all the others – or one vs. one – one model for a pair of classes. One vs. Rest – the primary task is one model being fit for each class which faces all the others One vs. One – the primary task is defining a single binary model for each class pair As in binary classification, we will use the make_blobs() function: from numpy import where from collections import Counter from sklearn.datasets import make_blobs from matplotlib import pyplot X, y = make_blobs(n_samples=1000, centers=4, random_state=1) print(X.shape, y.shape) counter = Counter(y) print(counter) for i in range(10): print(X[i], y[i]) for label, _ in counter.items(): row_ix = where(y == label)[0] pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label)) pyplot.legend() pyplot.show() Output: (1000, 2) (1000,) Counter({1: 250, 2: 250, 0: 250, 3: 250}) [-10.45765533 -3.30899488] 1 [-5.90962043 -7.80717036] 2 [-1.00497975 4.35530142] 0 [-6.63784922 -4.52085249] 3 [-6.3466658 -8.89940182] 2 [-4.67047183 -3.35527602] 3 [-5.62742066 -1.70195987] 3 [-6.91064247 -2.83731201] 3 [-1.76490462 5.03668554] 0 [-8.70416288 -4.39234621] 1 Here, it’s clear that we have more than two types of classes, and they can be separately classified into all the different types. Multi-Label Classification Multi-label classification covers classification tasks where two or more class labels need to be assigned. These are class labels predicted for each example. One example would be the classification of photos, where one image may contain multiple objects, such as fruit, an animal, a person, etc. The biggest difference lies in the fact that these models can predict more than one label. Multi-class and binary classification models cannot be used for multi-label classification. You also need the algorithm to be modified to be used for multiple classes, making this more challenging than a Yes or No statement. Some of the algorithms commonly used in multi-label classification are: Multi-label Random Forests Multi-label Decision trees Multi-label Gradient Boosting You could use a different approach. A separate classification algorithm could be used to predict the labels for each class type. The multi-label classification dataset will be generated using a Scikit-Learn library in our example, and the code shows you how to create the multi-label classification and shows it working with 1000 samples and 4 class types: from sklearn.datasets import make_multilabel_classification X, y = make_multilabel_classification(n_samples=1000, n_features=3, n_classes=4, n_labels=4, random_state=1) print(X.shape, y.shape) for i in range(10): print(X[i], y[i]) Output: (1000, 3) (1000, 4) [ 8. 11. 13.] [1 1 0 1] [ 5. 15. 21.] [1 1 0 1] [15. 30. 14.] [1 0 0 0] [ 3. 15. 40.] [0 1 0 0] [ 7. 22. 14.] [1 0 0 1] [12. 28. 15.] [1 0 0 0] [ 7. 30. 24.] [1 1 0 1] [15. 30. 14.] [1 1 1 1] [10. 23. 21.] [1 1 1 1] [10. 19. 16.] [1 1 0 1] Imbalanced Classification Imbalanced classification is used for tasks with an uneven distribution of examples in each class. Typically, these are binary classification tasks where a large percentage of the training set is classified as normal, and the rest are abnormal. This type of classification is commonly used in: Fraud detection Medical diagnosis Outlier detection Special techniques are used to turn these problems into binary classification tasks. You can choose between over-sampling for the smaller number of classes or under-sampling for the bigger number of classes. SMOTE Oversampling and random under-sampling are two of the best examples. When you are fitting the model on the training dataset, you can also use special modeling algorithms to focus more on the smaller class. This includes using cost-sensitive algorithms, such as: Cost-Sensitive Logistic Regression Cost-Sensitive Decision Trees Cost-Sensitive Support Vector Machines Once the model is chosen, we need to access and score it. We can do that using the Precision, Recall, or F-Measure score. We need a dataset developed for the problem, and we’ll generate a synthetic, imbalanced binary classification dataset containing 1000 samples: from numpy import where from collections import Counter from sklearn.datasets import make_classification from matplotlib import pyplot X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_classes=2, n_clusters_per_class=1, weights=[0.99,0.01], random_state=1) print(X.shape, y.shape) counter = Counter(y) print(counter) for i in range(10): print(X[i], y[i]) for label, _ in counter.items(): row_ix = where(y == label)[0] pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label)) pyplot.legend() pyplot.show() Output: (1000, 2) (1000,) Counter({0: 983, 1: 17}) [0.86924745 1.18613612] 0 [1.55110839 1.81032905] 0 [1.29361936 1.01094607] 0 [1.11988947 1.63251786] 0 [1.04235568 1.12152929] 0 [1.18114858 0.92397607] 0 [1.1365562 1.17652556] 0 [0.46291729 0.72924998] 0 [0.18315826 1.07141766] 0 [0.32411648 0.53515376] 0 Here, the label distribution can be seen, along with a serious class imbalance, wither 17 belonging to one type and the remaining 983 belonging to the other one. As expected, we get a majority of class or type 0. In the next chapter, we’ll look at SVMs or Support Vector Machines using Scikit-Learn. Chapter 5: Support Vector Machines with Scikit-Learn This chapter will take an in-depth look at a popular algorithm used in supervised machine learning - Support Vector Machines. SVM is highly accurate, more so than logistic regression, decision trees, and other similar classifiers. Perhaps one of its best-known features is a kernel trick that can handle non-linear input spaces. It is used in several applications, including intrusion detection, facial recognition, email classification, web pages, news articles, gene classification, and handwriting recognition. SVM is one of the most exciting algorithms with simple concepts. The classifier uses a hyperplane with the biggest margin to separate the data points, finding the optimal hyperplane for classifying new data points. Support Vector Machines Typically, a Support Vector Machine is considered a classification problem approach but can easily be employed in regression problems. It also handles multiple categorical and continuous variables with ease. The hyperplane is constructed in multidimensional space to keep the different classes separate. An optimal hyperplane is generated iteratively and used for minimizing the risk of an error. The primary idea behind the SVM is to find the MMH (Maximum Marginal Hyperplane) that efficiently splits the dataset into classes. Support Vectors - These are the data points nearest to the hyperplane, and they calculate margins to ensure the separating line is better defined. The support vectors are relevant to the classifier's construction. Hyperplane – This is a decision plane separating a set of objects with different class memberships Margin – This is the gap between a pair of lines on the nearest class points and is calculated as the perpendicular distance between the line and the closest points or support vectors. The larger the margin between the classes, the better, while smaller margins are considered bad. How Does SVM Work? The SVMs primary objective is to split the dataset most efficiently. The distance between the nearest pints is called the margin, and the idea is to choose the best hyperplane with the biggest margin possible between the dataset's support vectors. Non-Linear and Inseparable Planes A linear hyperplane cannot be used to solve all problems. A kernel trick is used to turn the input space into a higher-dimensional space in situations where it cannot. We plot the data points on the x-axis and the z-axis, allowing you to use linear separation to segregate the points. SVM Kernels In practice, a kernel is used to implement the SVM algorithm, transforming the input data space into the right format. This is done using a technique known as the 'kernel trick, 'where the low-dimensional input space is turned into a high-dimensional space. In simple terms, more dimension is added to a non-separable problem, and it is converted into separable problems. This is incredibly useful in problems revolving around non-linear separation, and it helps construct more accurate classifiers. Linear Kernel You can use a linear kernel as a normal dot product for two specified observations. The product of the two vectors is the result of each input value pair being multiplied. K(x, xi) = sum(x * xi) Polynomial Kernel These are generalized forms of linear kernels, and they can tell the difference between non-linear and curved input spaces. K(x,xi) = 1 + sum(x * xi)^d In this, d indicates the degree of the polynomial. d = 1 is much the same as the linear transformation, and the degree must be specified manually in the algorithm. Radial Basis Function Kernel This is one of the most popular kernel functions in SVM classification and can map input spaces in infinite-dimensional space. K(x,xi) = exp(-gamma * sum((x – xi^2)) In this, the parameter is gamma, ranging from 0 to 1. Higher gamma values fit the training set perfectly, resulting in over-fitting. The best default value is Gamma=0.1, and, as with the polynomial kernel, the gamma value must be specified manually in the algorithm. Classifier Building in Scikit-learn Now you know the theory behind SVMs, it's time to look at how to implement it in Python using Scikit-Learn. We'll be using the well-known cancer dataset, a popular multi-class classification problem computed from digitized images of FNA (fine needle aspirates) of breast mass and describing the cell nuclei characteristics from the images. There are 30 features in the dataset: mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension radius error texture error perimeter error area error smoothness error compactness error concavity error concave points error symmetry error fractal dimension error worst radius worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension It also has a target, which is the type of cancer. There are two types – malignant and benign. We want to build a model to classify the cancer types – the dataset can be downloaded from the Scikit-Learn library or the UCI Machine Learning Library. Step One - Loading Data First, we need to load the dataset – we'll get it from the Scikit-Learn library: #Import scikit-learn dataset library from sklearn import datasets #Load dataset cancer = datasets.load_breast_cancer() Step Two - Exploring Data Once the dataset is loaded, we can look at it to see more information about it. We'll look at the 30 features and the target names: # print the names of the 30 features print("Features: " cancer.feature_names) # print the label type of cancer('malignant' 'benign') print("Labels: ", cancer.target_names) Output: Features: ['mean radius' 'mean texture' 'mean perimeter' 'mean area' 'mean smoothness' 'mean compactness' 'mean concavity' 'mean concave points' 'mean symmetry' 'mean fractal dimension' 'radius error' 'texture error' 'perimeter error' 'area error' 'smoothness error' 'compactness error' 'concavity error' 'concave points error' 'symmetry error' 'fractal dimension error' 'worst radius' 'worst texture' 'worst perimeter' 'worst area' 'worst smoothness' 'worst compactness' 'worst concavity' 'worst concave points' 'worst symmetry' 'worst fractal dimension'] Labels: ['malignant' 'benign'] Let's look at a bit deeper and check the dataset's shape: # print data(feature)shape cancer.data.shape Output: (569, 30) And now we can look at the first five records in the features: # print the cancer data features (first five records) print(cancer.data[0:5]) Output: [[1.799e+01 1.038e+01 1.228e+02 1.001e+03 1.184e-01 2.776e-01 3.001e-01 1.471e-01 2.419e-01 7.871e-02 1.095e+00 9.053e-01 8.589e+00 1.534e+02 6.399e-03 4.904e-02 5.373e-02 1.587e-02 3.003e-02 6.193e-03 2.538e+01 1.733e+01 1.846e+02 2.019e+03 1.622e-01 6.656e-01 7.119e-01 2.654e-01 4.601e-01 1.189e-01] [2.057e+01 1.777e+01 1.329e+02 1.326e+03 8.474e-02 7.864e-02 8.690e-02 7.017e-02 1.812e-01 5.667e-02 5.435e-01 7.339e-01 3.398e+00 7.408e+01 5.225e-03 1.308e-02 1.860e-02 1.340e-02 1.389e-02 3.532e-03 2.499e+01 2.341e+01 1.588e+02 1.956e+03 1.238e-01 1.866e-01 2.416e-01 1.860e-01 2.750e-01 8.902e-02] [1.969e+01 2.125e+01 1.300e+02 1.203e+03 1.096e-01 1.599e-01 1.974e-01 1.279e-01 2.069e-01 5.999e-02 7.456e-01 7.869e-01 4.585e+00 9.403e+01 6.150e-03 4.006e-02 3.832e-02 2.058e-02 2.250e-02 4.571e-03 2.357e+01 2.553e+01 1.525e+02 1.709e+03 1.444e-01 4.245e-01 4.504e-01 2.430e-01 3.613e-01 8.758e-02] [1.142e+01 2.038e+01 7.758e+01 3.861e+02 1.425e-01 2.839e-01 2.414e-01 1.052e-01 2.597e-01 9.744e-02 4.956e-01 1.156e+00 3.445e+00 2.723e+01 9.110e-03 7.458e-02 5.661e-02 1.867e-02 5.963e-02 9.208e-03 1.491e+01 2.650e+01 9.887e+01 5.677e+02 2.098e-01 8.663e-01 6.869e-01 2.575e-01 6.638e-01 1.730e-01] [2.029e+01 1.434e+01 1.351e+02 1.297e+03 1.003e-01 1.328e-01 1.980e-01 1.043e-01 1.809e-01 5.883e-02 7.572e-01 7.813e-01 5.438e+00 9.444e+01 1.149e-02 2.461e-02 5.688e-02 1.885e-02 1.756e-02 5.115e-03 2.254e+01 1.667e+01 1.522e+02 1.575e+03 1.374e-01 2.050e-01 4.000e-01 1.625e-01 2.364e-01 7.678e-02]] And the target set. # print the cancer labels (0:malignant, 1:benign) print(cancer.target) Output: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1000000001011111001001111010011110100 1010011100100011101100111001111011011 1111110001001110010100100110110111101 1111111101111001011001100111101100010 1011101100100001000101011010000110011 1011111001101100101111011111010000000 0000000111111010110110100111111111111 1011010111111111111110111010111100011 1101010111011111110001111111111100100 0100111110111110111011001111110111111 1011111011011111111111101001011111011 0101101011111111001111110111111111101 1111110101101111100101011111011010100 1110111111111110100111111111111111111 1 1 1 1 1 1 1 0 0 0 0 0 0 1] Step Three - Splitting the Data The best way to understand how the model performs is to split the dataset into a training set and a test set. We do this with a function called train_test_split() with three parameters – features, target, test_set size. You can also choose random records using random_state: # Import train_test_split function from sklearn.model_selection import train_test_split # Split the dataset into a training set and a test set X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, test_size=0.3,random_state=109) # 70% training and 30% test Step Four - Generating the Model Now we can build our SVM. First, we should import the SVM module and then pass kernel, the linear kernel in the SVC() function, as an argument to create our support classifier object. Then, we use fit() to fit our model on the training set and predict() to make the predictions on the test set. #Import svm model from sklearn import svm #Create the SVM Classifier clf = svm.SVC(kernel='linear') # Linear Kernel #Train the model on the training set clf.fit(X_train, y_train) #Predict the response for the test dataset y_pred = clf.predict(X_test) Step Five - Evaluating the Model The final step is to estimate the model's accuracy in predicting breast cancer. To do this, we compare the actual values from the test set with the predicted values: #Import scikit-learn metrics module for accuracy calculation from sklearn import metrics # Model Accuracy: how often is the classifier correct? print("Accuracy:",metrics.accuracy_score(y_test, y_pred)) Output: Accuracy: 0.9649122807017544 The classification rate accuracy is a very good 96.49%. However, we can take things further by checking the model's precision and recall: # Model Precision: what percentage of the positive tuples are correctly labeled? print("Precision:",metrics.precision_score(y_test, y_pred)) # Model Recall: what percentage of positive tuples correctly labeled? print("Recall:",metrics.recall_score(y_test, y_pred)) Output: Precision: 0.9811320754716981 Recall: 0.9629629629629629 This time, we got an even better score – precision is 98%, and recall is 96% Tuning the Hyperparameters The kernel's primary function is to turn the provided dataset's input data into the correct form. There are three function types – polynomial, linear, and RBF (radial basis function.) RBF and polynomial are best for non-linear hyperplanes, both computing the separation lines in higher dimensions. However, some applications require a more complex to separate non-linear or curved classes, thus leading to highly accurate classifiers: Regularization - this parameter in Scikit-Learn helps maintain regularization. The penalty parameter is C, representing the error or misclassification term. This term determines the most amount of error acceptable to the SVM optimization and is how the trade-off is controlled between the misclassification term and the decision boundary. A smaller C value results in a small-margin hyperplane, while a large value results in a larger-margin hyperplane. Gamma – a lower Gamma value fits the training set loosely, while a higher value fits it exactly, resulting in over-fitting. In simple terms, a low Gamma value only considers the nearest points when it calculates the separation line, while the higher value considers all data points. Advantages SVM classifiers have better accuracy and are faster at prediction than the Naïve Bayes algorithm They don't use so much memory because a subset of the training points is used in making the decision. SVMs work very well with high dimensional space and a clear margin of separation. Disadvantages SVM classifiers don't work well with large datasets because they take too long to train, more so than Naïve Bayes. They also don't work very well with overlapping classes. They are sensitive to the kernel type used. In the next chapter, we will look at using decision trees. Chapter 6: Using Decision Trees Let's say you are a marketing manager and you want the customers most likely to buy your products. Decision trees help you find your audience and make the best use of your marketing budget. Or you could be a loans manager who needs to weed out the risky applications to get a better loan default rate. Using a decision tree, you can classify the applicants into two groups – potential clients and non-potential clients, or safe applications and risky applications – giving you a better chance of making the right decisions. Classification requires two steps – the learning step and the prediction step. In the first step, we use existing training data to develop the model, while in the second step, we use the model to make predictions on given data. The decision tree is a popular classification algorithm and certainly one of the easiest to understand. It isn't restricted to classification problems, either; you can also use it for regression problems. Decision Tree Algorithm Decision trees are similar to flowcharts, a structure with internal nodes representing attributes or features, branches representing decision rules, and leaf nodes representing the outcomes. The top node in the tree is the root node, which uses the attribute value to learn how to partition the tree recursively. This structure helps you make better decisions. The results are visualized in a diagram similar to a flowchart, mimicking thinking on a human level, making them incredibly easy to interpret. Decision trees are described as "white box" algorithms, sharing the internal logic for making decisions not found in neural networks and other "black box" algorithms. It trains faster than neural networks, with a time complexity being a function of the number of attributes and records in the data. Decision trees are classified as non-parametric or distribution-free, which means they are not dependent on assumptions about probability distribution. They also produce good accuracy with high-dimensional data. How Decision Trees Work The idea behind the algorithm is: Use ASM (Attribute Selection Measures) to choose the best attribute to split the records Transform the attribute into a decision node and split the dataset into subsets Repeat the process recursively for every child node to build the tree until one of the following conditions is matched: Every tuple is part of the same attribute value There aren't any more instances There aren't any more attributes Attribute Selection Measure This is a heuristic to help choose the right splitting measures to partition the data efficiently. This is also called "splitting rules" because it helps us work out the breakpoints for any tuple on a specified node. ASM gives each attribute or feature a rank based on the dataset, and the best score or rank is selected as the splitting attribute. Where continuous-valued attributes are concerned, we also need to define the split points for the branches. Some of the more popular methods of measure selection are Information Gain, Gain Ratio, and Gini Index – let's look at these in more detail. Information Gain Claude Shannon developed the concept of entropy in a 1948 paper, "A Mathematical Theory of Communication." In mathematics and physics, entropy is the impurity or randomness in a system, while in information theory, it is the impurity present in a set of examples. Information gain indicates a decrease in entropy and computes the differences between entropy before being split, and the average entropy after the database has been split based on the values of specific attributes. In this equation, Pi indicates the probability of an arbitrary tuple from D belonging to class Ci. In these equations, Info(D) indicates the average level of information required to identify a D tuple's class label, and |Dj|/|D| acts as the jth partition's weight. InfoA(D) is the required information for classifying a D tuple based on A's partitioning. The attribute A with Gain(A) – the most information gain – is picked as the best attribute for splitting at node N(). Gain Ratio Information gain is somewhat biased for the attribute with several potential outcomes, which means the attribute with many distinct values is preferred. For example, an attribute with a unique identifier (customer_ID, for example) has no info(D). Pure partition causes this, maximizing information gain and creating nothing more than useless partitioning. C4.5 improves on ID3, using an information gain extension called the gain ratio. This uses Split Info to normalize information gain, thus taking care of the bias issue. In this equation, |Dj|/|D| acts as the jth partition's weight, while v indicates attribute A's discrete values. We can define the gain ratio as: The attribute with the most gain ratio is picked as the splitting attribute. Gini Index CART (Classification and Regression Tree) is a decision tree algorithm that uses Gini to create the split points: In this equation, Pi indicates the probability of a tuple from D belonging to class Ci. The Gini Index takes each attribute's binary split into account. A weighted sum of each partition's impurity can be computed; where a binary split on attribute A results in data D being split into D1 and D2, then D's Gini Index is: Where we have a discrete-valued attribute, we select the subset, providing the minimum Gini Index for the chosen attribute as the splitting attribute. Where we have continuous-valued attributes, each pair of adjacent values is chosen as a potential splitting point. The one with the smallest Gini Index becomes the actual splitting point. Building a Decision Tree Classifier Let's build our decision tree classifier. Step One – Import the Libraries The first step is to import the libraries we need: # Load libraries import pandas as pd from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier from sklearn.model_selection import train_test_split # Import train_test_split function from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation Step Two – Load the Data We are using the Pima Indian Diabetes dataset for this, so we need to import it using the read CSV function in Panda. The dataset can be downloaded from here. col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label'] # load dataset pima = pd.read_csv("pima-indians-diabetes.csv", header=None, names=col_names) If you want to see what the dataset looks like, execute this: pima.head() When you execute this, you will see 0 1 2 3 4 PREGNANT 6 1 8 1 0 GLUCOSE 148 85 183 89 137 BP 72 66 64 66 40 SKIN 35 29 0 23 35 INSULIN 0 0 0 94 168 BMI 33.6 26.6 23.3 28.1 43.1 PEDIGREE 0.27 0.351 0.672 0.167 2.288 Step Three - Feature Selection Now the given columns need to be divided into two variable types – dependent (target variable) and independent (feature variables). #split dataset in features and target variable feature_cols = ['pregnant', 'insulin', 'bmi', 'age','glucose','bp','pedigree'] X = pima[feature_cols] # Features y = pima.label # Target variable Step Four – Splitting the Data Understanding how the model performs requires dividing the dataset into two – training and test sets. We'll use a function called train_test_split() to do this, passing three parameters – features, target, and test_set size. # Split the dataset into a training set and test set X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # 70% training and 30% test Step Five – Building the Model We'll use Scikit-Learn to build the decision tree classifier model: # Create Decision Tree classifier object clf = DecisionTreeClassifier() # Train Decision Tree Classifier clf = clf.fit(X_train,y_train) #Predict the response for test dataset y_pred = clf.predict(X_test) Step Six - Evaluating the Model Now we can estimate the accuracy of the model in predicting cultivar types. We compute the accuracy by comparing the actual values from the test set with the predicted values: # Model Accuracy, how often is the classifier correct? print("Accuracy:",metrics.accuracy_score(y_test, y_pred)) Output: Accuracy: 0.6753246753246753 We got a good accuracy of 67.53%, but we can improve this by tuning the algorithm's parameters. Visualizing Decision Trees Scikit-learn has a good function called export_graphviz that lets us use a Jupyter notebook to visualize the tree. You also need pydotplus and graphviz to plot the tree. pip install graphviz pip install pydotplus export_graphviz – this function converts the classifier into a dot file, which is then converted by pydotplus into a png or a displayable form in a Jupyter notebook. from sklearn.tree import export_graphviz from sklearn.externals.six import StringIO from IPython.display import Image import pydotplus dot_data = StringIO() export_graphviz(clf, out_file=dot_data, filled=True, rounded=True, special_characters=True, feature_names = feature_cols,class_names=['0','1']) graph = pydotplus.graph_from_dot_data(dot_data.getvalue()) graph.write_png('diabetes.png') Image(graph.create_png()) In the resulting chart, the internal nodes each have a decision rule used to split the data. Gini, otherwise known as the Gini Ratio, measures the node's impurity. A node is said to be pure when all the records in it share the same class, such as the leaf node. The resulting tree is not pruned; it is difficult to explain or understand, so we need to prune it and optimize it. Optimizing Decision Tree Performance When we use Scikit-Learn, we can only optimize the decision tree classifier model by pre-pruning it. A tree's maximum depth may be used as one of the control variables for this. In the example below, a decision tree can be plotted on the same data using max_depth=3. The pre-pruning parameters are: criterion – Optional (default=" gini") or Choose the attribute selection measure. This parameter provides the ability to use the attribution selection measure different-different. The criteria supported by it are 'gini' for the Gini Index and 'entropy' for information gain. splitter – a string and optional (default=" best") or Split Strategy. Using this parameter, we can decide on the split parameter. max_depth – int or None, optional (default=None), or the Maximum Depth of a Tree. If this is None, we expand the nodes until every leaf has samples that are less than min_samples_split. The higher the maximum depth value, the more chance of overfitting, while lower values risk underfitting. Other than using these parameters, we can also use entropy or other attribute selection measures: # Create Decision Tree classifier object clf = DecisionTreeClassifier(criterion="entropy", max_depth=3) # Train Decision Tree Classifer clf = clf.fit(X_train,y_train) #Predict the response for test dataset y_pred = clf.predict(X_test) # Model Accuracy, how often is the classifier correct? print("Accuracy:",metrics.accuracy_score(y_test, y_pred)) Accuracy: 0.7705627705627706 The classifier produced an accuracy of 77.05%. much better than the last omodel. Visualizing Decision Trees from sklearn.externals.six import StringIO from IPython.display import Image from sklearn.tree import export_graphviz import pydotplus dot_data = StringIO() export_graphviz(clf, out_file=dot_data, filled=True, rounded=True, special_characters=True, feature_names = feature_cols,class_names=['0','1']) graph = pydotplus.graph_from_dot_data(dot_data.getvalue()) graph.write_png('diabetes.png') Image(graph.create_png()) The resulting pruned model isn't so complex, can be easily explained, and are much easier to understand than the unpruned version. Pros Decision trees are simple to visualize and interpret They can capture non-linear patterns easily Not so much data preprocessing is required, i.e., columns don't need to be normalized We can use decision trees for feature engineering for selecting variables, such as missing value predictions The algorithm is non-parametric and doesn't make assumptions about distribution Cons Decision trees are highly sensitive data and can overfit it Small data variance can result in a completely different tree. However, using boosting and bagging algorithms can reduce this They are biased for imbalanced datasets – the dataset should be balanced before the tree is created. Next, we move on to KNN or K-Nearest Neighbors. Chapter 7: K-Nearest Neighbors KNN is a supervised learning algorithm and, in its most basic form, it is easy to implement, yet it can perform incredibly complex classification. As there is no special training phase, it is considered a lazy algorithm. Instead, it uses all the provided data to train on while classifying new instances or data points. Like the decision tree, it is a non-parametric algorithm and assumes nothing about the underlying data. This is one of the most useful features because real-world data doesn't tend to follow uniform distribution, linear separability, or any other theoretical assumption. This chapter will dive into KNN and how to use Scikit-Learn to implement it. We'll start with the theory behind it. The Theory Behind KNN KNN is one of the easiest algorithms in the supervised learning family, and all it does is calculates distances of a data point to all others. The distance may be Euclidean, Manhattan, or any other type. It chooses the K-nearest points once the distance is calculated, where K is an integer. Lastly, the data point is assigned to the class where most of the K data points belong. For example, let's say we have a dataset containing two variables. The algorithm needs to classify a data point with X into one of two classes – Blue or Red. The data point has coordinates of x=45 and y=50, and we'll assume K has a value of 3. The algorithm's first step is to calculate the distance from X to all the other points. Next, it finds the three nearest points with the least distance to X. Lastly, the algorithm assigns the new data point to the class where most of the three nearest points belong. We'll assume that two of our points are in the red class, while the third is in blue, so the new point is classified as red. Implementing KNN Algorithm With Scikit-Learn Now we will look at how to implement KNN using Scikit-Learn and no more than 20 lines of code. Step One – Import the Dataset As always, we need to import the libraries and the dataset first. We're using the iris dataset, with its four attributes: sepal-width sepal-length petal-width petal-length These attributes indicate the specific iris plant types, and our task is to predict the class the plants belong to. There are three classes in the dataset: Iris-setosa Iris-versicolor Iris-Virginica You can download the dataset here. Importing Libraries import numpy as np import matplotlib.pyplot as plt import pandas as pd Importing the Dataset Now the dataset needs to be imported and loaded into the Pandas DataFrame: url = "https://archive.ics.uci.edu/ml/machine-learningdatabases/iris/iris.data" # Assign column names to the dataset names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class'] # Read dataset to pandas dataframe dataset = pd.read_csv(url, names=names) If you want to see what the dataset looks like, execute this: dataset.head() You will see the first five rows of the dataset:: sepal-length sepal-width 3.5 0 5.1 3.0 1 4.9 3.2 2 4.7 3.1 3 4.6 3.6 4 5.0 Step Two – Data Preprocessing petal-length 1.4 1.4 1.3 1.5 1.4 petal-width 0.2 0.2 0.2 0.2 0.2 Class Iris-setosa Iris-setosa Iris-setosa Iris-setosa Iris-setosa Next, the dataset needs to be divided into attributes and labels: X = dataset.iloc[:, :-1].values y = dataset.iloc[:, 4].values The X variable has the dataset's first four columns (the attributes), while Y has the labels. Step Three – Train Test Split We don't want to risk overfitting, so the dataset should be split into training and test sets. This will tell us what the algorithm's performance was during testing. By splitting the data, the algorithm learns on one set of data and is tested on previously unseen data, exactly what would happen in a real-world application. Here's how to create the two sets: from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20) This script divides the dataset, putting 80% into a training set and the remaining 20% in the test set. The dataset contains 150 records, so 120 go for training and 30 for testing. Step Four – Feature Scaling Before the algorithm can make any predictions, we should scale the features to ensure they are all evaluated uniformly. Why do we need to do this? The raw data contains a range of widely varying values, and in some algorithms, the functions cannot work as they should without normalization. For example, most classifiers use Euclidean distance to calculate the distance between a pair of points. If one feature has a wide range of values, this specific feature governs the distance. As such, the range of every feature must be normalized to ensure each one contributes to the resulting distance proportionately. The gradient descent algorithm is commonly used in training neural networks, among other algorithms, and it will converge much faster when the features are normalized. Here’s how to do the feature scaling: from sklearn.preprocessing import StandardScaler scaler = StandardScaler() scaler.fit(X_train) X_train = scaler.transform(X_train) X_test = scaler.transform(X_test) Step Five - Training and Predictions Training the KNN algorithm to make predictions is straightforward, particularly when Scikit-Learn is used: from sklearn.neighbors import KNeighborsClassifier classifier = KNeighborsClassifier(n_neighbors=5) classifier.fit(X_train, y_train) First, the KNeighborsClassifier class is imported from the library called sklearn.neighbors. Second, we initialized the class with a single parameter, n_neighbors, which is K's value. K doesn't have an ideal value; it is chosen once the testing and evaluation stages are out of the way. However, the most commonly used value is 5. Now we need to use our test data to make predictions: y_pred = classifier.predict(X_test) Step Six - Evaluating the Algorithm A few metrics commonly used in algorithm evaluation are confusion matrix, precision, recall, and F1 score. We can use two methods from sklearn.metrics to calculate the metrics – confusion_matrix and classification_report. from sklearn.metrics import classification_report, confusion_matrix print(confusion_matrix(y_test, y_pred)) print(classification_report(y_test, y_pred)) Output: [[11 0 0] 0 13 0] 0 1 6]] precision recall f1-score support Iris-setosa Iris-versicolor 1.00 1.00 1.00 1.00 1.00 1.00 11 13 Iris-virginica 1.00 1.00 1.00 6 avg / total 1.00 1.00 1.00 30 These results indicate that KNN classified all of the test set records with an accuracy of 100% - it doesn't get better than that. However, while the algorithm showed excellent accuracy on this dataset, this won't be the case with every application. KNN isn't known for working well with categorical features or high-dimensionality. Compare the Error Rate with K Value As mentioned earlier, there is no way of knowing upfront which K value will give us the best results the first time around. We chose 5, but this was random – luckily for us, it worked, providing 100% accuracy. One way of finding the best K value is to plot a graph with the value of K and the dataset's corresponding error rate. Our next step is plotting the mean error for the test set's predicted values for the K values between 1 and 40. First, we should calculate the mean of error: error = [] # Calculating error for K values between 1 and 40 for i in range(1, 40): knn = KNeighborsClassifier(n_neighbors=i) knn.fit(X_train, y_train) pred_i = knn.predict(X_test) error.append(np.mean(pred_i != y_test)) Here, a loop is executed from 1 to 40. Each iteration calculates the predicted value's mean error, appending the result to the error list. Next, we need to plot the error values against the values of K: plt.figure(figsize=(12, 6)) plt.plot(range(1, 40), error, color='red', linestyle='dashed', marker='o', markerfacecolor='blue', markersize=10) plt.title('Error Rate K Value') plt.xlabel('K Value') plt.ylabel('Mean Error') The output graph shows that when K's value is between 5 and 18, the mean error is zero. Play about with the value of K to see how it affects the prediction accuracy. Pros KNN Is very easy to implement. It is considered lazy learning and doesn't need any training before making real-time predictions. This means it is faster than SVM, linear regression, and all the other algorithms that need to be trained first. KNN only requires two parameters - the value of K and your chosen distance function (Manhattan, Euclidean, etc. Because no training is needed, we can add new data seamlessly. Cons KNN isn't effective with high-dimensional data because too many dimensions make it difficult to calculate the distance in each one. The bigger the dataset, the higher the prediction cost because the distance between a new data point and existing ones is higher in larger datasets. KNN isn't good with categorical features because they don't make it easy to find the distance between the dimensions. KNN is one of the simpler algorithms, yet it is incredibly powerful. Training is one of the hardest parts of any ML algorithm, and KNN doesn't require this. It has been used in pattern recognition, finding similarities between documents, preprocessing and dimensionality reduction in computer vision, especially facial recognition, and developing recommender systems. In our final chapter, we will look at using unsupervised learning techniques to find patterns in data. Chapter 8: Finding Patterns in Data Unsupervised learning analyzes unlabeled datasets and clusters them using different machine learning algorithms, which find hidden patterns in the data without needing a human helping hand. We use unsupervised learning models for three primary problems: Clustering – this data mining technique groups unlabeled data based on differences or similarities. For example, the K-Means clustering algorithms place similar data into groups, where K indicates the granularity and size of the grouping. This works well in image compression, market segmentation, and other similar tasks. Association – this unsupervised learning method finds relationships in a given dataset between the different variables. It does this by using different rules and is usually used in recommendation engines and market basket analysis. For example, Netflix recommends movies and TV series based on what you watched previously. Dimensionality Reduction – this unsupervised learning technique is typically used when a given dataset has a high number of dimensions or features. The data inputs are reduced to a more manageable number while the data integrity is preserved. This technique is often used in data preprocessing, for example, when noise is removed from visual data to improve quality. The Difference between Supervised and Unsupervised Learning The primary difference is that supervised learning uses labeled data while unsupervised learning doesn't. A supervised learning algorithm uses a training dataset to learn from, making iterative predictions and adjusting to get the right answer. They are more accurate than unsupervised learning algorithms, but humans must first label the data. For example, we can use supervised learning to predict the length of a commute based on weather conditions, time, etc., but the model has to be trained first to know that bad weather extends commuting time. In contrast, unsupervised learning models work alone, without human intervention, to see the unlabeled data structure. They do still require a certain amount of human intervention in the output variable validation. For example, we can use unsupervised learning to determine that some online shoppers buy groups of products together. However, this would need to be validated by a data analyst to ensure it is right for a recommendation engine to put baby clothes in the same group as applesauce, diapers, and sippy cups. Preparing the Data We will use the iris dataset to make our predictions. To recap, it has 150 records, four attributes – petal length, petal width, sepal length, and sepal width – and three classes – setosa, versicolor, and virginica. The four features will be fed into the algorithm to predict the classes each flower belongs to. Scikit-Learn is used to load the dataset, and Matplotlib is used for visualization. The code below shows you how to explore the dataset: # Importing Modules from sklearn import datasets import matplotlib.pyplot as plt # Loading dataset iris_df = datasets.load_iris() # Available methods on dataset print(dir(iris_df)) # Features print(iris_df.feature_names) # Targets print(iris_df.target) # Target Names print(iris_df.target_names) Output: label = {0: 'red', 1: 'blue', 2: 'green'} # Dataset Slicing x_axis = iris_df.data[:, 0] # Sepal Length y_axis = iris_df.data[:, 2] # Sepal Width # Plotting plt.scatter(x_axis, y_axis, c=iris_df.target) plt.show() ['DESCR', 'data', 'feature_names', 'target', 'target_names'] ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'] Output: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00000000000011111111111111111111111111 11111111111111111111111122222222222222 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2] ['setosa' 'versicolor' 'virginica'] The resultant plot will show three colors – violet representing setosa, green representing versicolor, and yellow representing viriginica. Clustering Clustering divides the data into groups based on trait similarity. When an input is provided for prediction, the algorithm looks in the cluster it should belong to, based on the features and makes the prediction. Let's look at a few types of clustering algorithms. K-Means Clustering K-Means is an iterative algorithm for finding the local maxima in every iteration. To start with, we choose the number of clusters we want. We know we have three classes for this example, so the algorithm is programmed to group all the data into those classes. We do that by passing the algorithm the "n_clusters" parameter. Three inputs or points are randomly assigned into three clusters, and the centroid distance between each point is used to segregate the next lot of points into the right clusters. After that, the centroids for all clusters are computed again. A centroid is a group of feature values used to define the groups. We can interpret the type of group represented by each cluster by examining its respective centroid. The K-Means algorithm is imported from Scikit-Learn, the features are fitted, and the predictions are made: # Importing Modules from sklearn import datasets from sklearn.cluster import KMeans # Loading dataset iris_df = datasets.load_iris() # Declaring Model model = KMeans(n_clusters=3) # Fitting Model model.fit(iris_df.data) # Predicting a single input predicted_label = model.predict([[7.2, 3.5, 0.8, 1.6]]) # Prediction on the entire data all_predictions = model.predict(iris_df.data) # Printing Predictions print(predicted_label) print(all_predictions) Output: [0] [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 00000000000022122222222222222222222222 21222222222222222222222212111121111112 2 1 1 1 1 2 1 2 1 2 1 1 2 2 1 1 1 1 1 2 1 1 1 1 2 1 1 1 2 1 1 1 2 1 1 2] Hierarchical Clustering The hierarchical algorithm builds a hierarchy of clusters, as the name implies. At the start, all the data is assigned to a cluster. Then the nearest two clusters are joined in the same cluster, and so on until one cluster is left. Here's an example using a grain dataset, which can be downloaded from here. # Importing Modules from scipy.cluster.hierarchy import linkage, dendrogram import matplotlib.pyplot as plt import pandas as pd # Reading the DataFrame seeds_df = pd.read_csv( "https://raw.githubusercontent.com/vihar/unsupervised-learningwith-python/master/seeds-less-rows.csv") # Remove the grain species from the DataFrame, save for later varieties = list(seeds_df.pop('grain_variety')) # Extract the measurements as a NumPy array samples = seeds_df.values """ Perform hierarchical clustering on samples using the linkage() function with the method='complete' keyword argument. Assign the result to mergings. """ mergings = linkage(samples, method='complete') """ Plot a dendrogram using the dendrogram() function on mergings, specifying the keyword arguments labels=varieties, leaf_rotation=90, and leaf_font_size=6. """ dendrogram(mergings, labels=varieties, leaf_rotation=90, leaf_font_size=6, ) plt.show() The result will be shown as a dendrogram plot. K-Means vs. Hierarchical Clustering There are a couple of differences worth mentioning: K-Means can handle big data efficiently, while hierarchical clustering cannot. This is because K-Means has linear time complexity, i.e., O(n), while hierarchical clustering has quadratic time complexity, i.e., O(n2) An arbitrary cluster choice is used at the start of K-Means, and the results will likely differ when the algorithm is run several times. In hierarchical clustering, the results are reproducible. K-Means works well with hyper spherical cluster shapes, like a sphere in 3D or a circle in 2D Noisy data is not allowed in K-Means but can be used in hierarchical clustering. T-SNE Clustering One of the best unsupervised learning algorithms for visualization is t-SNE, otherwise known as t-distributed stochastic neighbor embedding. This algorithm is used for mapping higher-dimensional space into twodimensional or three-dimensional space so it can be visualized. More specifically, each high-dimensional object is modeled by a two-dimensional or three-dimensional point in a way that nearby points model similar objects, while distant points with high probability model dissimilar points. Here's t-SNE implemented on the iris dataset: # Importing Modules from sklearn import datasets from sklearn.manifold import TSNE import matplotlib.pyplot as plt # Loading dataset iris_df = datasets.load_iris() # Defining Model model = TSNE(learning_rate=100) # Fitting Model transformed = model.fit_transform(iris_df.data) # Plotting 2d t-Sne x_axis = transformed[:, 0] y_axis = transformed[:, 1] plt.scatter(x_axis, y_axis, c=iris_df.target) plt.show() The resulting plot has three colors: violet to represent setosa, green to represent versicolor, and yellow to represent virginica. There are four features, so the dataset is four-dimensional. We transformed the dataset, representing it as a two-dimensional figure. You can also apply tSNE to datasets with n-features. DBSCAN Clustering DBSCAN, otherwise known as "density-based spatial clustering of applications with noise," is one of the more popular algorithms used in place of the K-Means algorithm in predictive analysis tasks. An input is not required to indicate how many clusters there are, but two parameters need tuning. Again, we can implement this from Scikit-Learn, and defaults are provided for the min_samples and eps parameters. However, you do need to tune these. The min_samples parameter indicates the minimum number of data points required in a neighborhood to be treated as a cluster, while the eps parameter indicates the maximum distance between a pair of data points to consider them in the same neighborhood. Here is a DBSCAN clustering implementation: # Importing Modules from sklearn.datasets import load_iris import matplotlib.pyplot as plt from sklearn.cluster import DBSCAN from sklearn.decomposition import PCA # Load Dataset iris = load_iris() # Declaring Model dbscan = DBSCAN() # Fitting dbscan.fit(iris.data) # Transforming Using PCA pca = PCA(n_components=2).fit(iris.data) pca_2d = pca.transform(iris.data) # Plot based on Class for i in range(0, pca_2d.shape[0]): if dbscan.labels_[i] == 0: c1 = plt.scatter(pca_2d[i, 0], pca_2d[i, 1], c='r', marker='+') elif dbscan.labels_[i] == 1: c2 = plt.scatter(pca_2d[i, 0], pca_2d[i, 1], c='g', marker='o') elif dbscan.labels_[i] == -1: c3 = plt.scatter(pca_2d[i, 0], pca_2d[i, 1], c='b', marker='*') plt.legend([c1, c2, c3], ['Cluster 1', 'Cluster 2', 'Noise']) plt.title('DBSCAN finds 2 clusters and Noise') plt.show() The output is a plot visualizing the data. Conclusion First, I want to thank you for taking the time to read my beginner's guide to machine learning with Python. Machine learning is a huge subject, and it wouldn't have been possible to cover every single subject in this guide. Instead, I chose the basic subjects that anyone who wants to get into machine learning should learn. As well as learning what machine learning is and how it all came about, you also learned some of the basic algorithms and tasks involved. You learned: The different types of machine learning How linear regression works What the different types of machine learning classification are What SVM is How decision trees work for classification What KNN is How to use unsupervised learning to detect patterns in data Once you understand the topics covered in this guide, you will be ready to take the next step and compound your learning with more complex subjects. There are plenty of guides available and lots of courses on the internet – sign up and improve your skills. Thank you, once again, for choosing my guide. I hope it helped you, and you now have a better understanding of machine learning. References "5 Types of Regression Analysis and When to Use Them | Appier." 2021. Appier. January 15, 2021. https://www.appier.com/blog/5-types-of-regression-analysis-and-when-to-use-them/. "A Complete Guide to Understand Classification in Machine Learning." 2021. Analytics Vidhya. September 9, 2021. https://www.analyticsvidhya.com/blog/2021/09/a-complete-guide-to-understandclassification-in-machine-learning/. Avinash Navlani. 2018a. "Decision Tree Classification in Python." DataCamp Community. 2018. https://www.datacamp.com/community/tutorials/decision-tree-classification-python. ———. 2018b. "Support Vector Machines in Scikit-Learn." DataCamp Community. 2018. https://www.datacamp.com/community/tutorials/svm-classification-scikit-learn-python. "Decision Tree." 2017. GeeksforGeeks. October 16, 2017. https://www.geeksforgeeks.org/decisiontree/?ref=lbp. "How to Use Unsupervised Learning with Python to Find Patterns in Data." 2019. Built In. 2019. https://builtin.com/data-science/unsupervised-learning-python. "Machine Learning Tutorial | Machine Learning with Python - Javatpoint." n.d. Www.javatpoint.com. https://www.javatpoint.com/machine-learning. Python, Real. n.d. "The K-Nearest Neighbors (KNN) Algorithm in Python – Real Python." Realpython.com. Accessed January 15, 2022. https://realpython.com/knn-python/. Real Python. 2019. "Linear Regression in Python." Realpython.com. Real Python. April 15, 2019. https://realpython.com/linear-regression-in-python/. Robinson, Scott. 2018. "K-Nearest Neighbors Algorithm in Python and Scikit-Learn." Stack Abuse. Stack Abuse. February 15, 2018. https://stackabuse.com/k-nearest-neighbors-algorithm-in-python-andscikit-learn/. Wakefield, Katrina. 2019. "A Guide to Machine Learning Algorithms and Their Applications." Sas.com. 2019. https://www.sas.com/en_gb/insights/articles/analytics/machine-learningalgorithms.html. "What Is Machine Learning: Definition, Types, Applications and Examples." 2019. Potentia Analytics. December 19, 2019. https://www.potentiaco.com/what-is-machine-learning-definition-typesapplications-and-examples/#:~:text=These%20are%20three%20types%20of.