Uploaded by John Cranpipe

Campbell A. Python - Machine Learning for Beginners. All You Need to Know...2022

advertisement
Book Description
Have you thought about a career in data science? It's where the money is right now,
and it's only going to become more widespread as the world evolves. Machine
learning is a big part of data science, and for those that already have experience in
programming, it's the next logical step.
Machine learning is a subsection of AI, or Artificial Intelligence, and computer
science, using data and algorithms to imitate human thinking and learning. Through
constant learning, machine learning gradually improves its accuracy, eventually
providing the optimal results for the problem it has been assigned to.
It is one of the most important parts of data science and, as big data continues to
expand, so too will the need for machine learning and AI.
Here's what you will learn in this quick guide to machine learning with Python for
beginners:
What machine learning is
Why Python is the best computer programming language for machine
learning
The different types of machine learning
How linear regression works
The different types of classification
How to use SVMs (Support Vector Machines) with Scikit-Learn
How Decision Trees work with Classification
How K-Nearest Neighbors works
How to find patterns in data with unsupervised learning algorithms
You will also find plenty of code examples to help you understand how everything
works.
If you are ready to take your programming further, scroll up, click Buy Now, and
find out why machine learning is the next logical step.
Python Machine Learning for
Beginners
All You Need to Know about Machine
Learning with Python
© Copyright 2022 - All rights reserved. Alex Campbell.
The contents of this book may not be reproduced, duplicated, or transmitted without direct written
permission from the author.
Under no circumstances will any legal responsibility or blame be held against the publisher for any
reparation, damages, or monetary loss due to the information herein, either directly or indirectly.
Legal Notice:
You cannot amend, distribute, sell, use, quote, or paraphrase any part of the content within this book
without the consent of the author.
Disclaimer Notice:
Please note the information contained within this document is for educational and entertainment
purposes only. No warranties of any kind are expressed or implied. Readers acknowledge that the
author is not engaging in the rendering of legal, financial, medical, or professional advice. Please
consult a licensed professional before attempting any techniques outlined in this book.
By reading this document, the reader agrees that under no circumstances is the author responsible for
any losses, direct or indirect, which are incurred as a result of the use of the information contained
within this document, including, but not limited to, —errors, omissions, or inaccuracies.
Table of Contents
Introduction
Prerequisites
Chapter 1: What Is Machine Learning?
How It Works
Machine Learning Features
Why We Need Machine Learning
The History of Machine Learning
Machine Learning Today
Chapter 2: The Different Types of Machine Learning
Which Algorithms to Use
Chapter 3: Linear Regression
What Is Regression?
Linear Regression
Implementing Linear Regression in Python
Chapter 4: The Different Types of Classification
Binary Classification
Multi-Class Classification
Multi-Label Classification
Imbalanced Classification
Chapter 5: Support Vector Machines with Scikit-Learn
How Does SVM Work?
Classifier Building in Scikit-learn
Chapter 6: Using Decision Trees
Decision Tree Algorithm
Attribute Selection Measure
Building a Decision Tree Classifier
Visualizing Decision Trees
Optimizing Decision Tree Performance
Chapter 7: K-Nearest Neighbors
Implementing KNN Algorithm With Scikit-Learn
Chapter 8: Finding Patterns in Data
The Difference between Supervised and Unsupervised Learning
Preparing the Data
Clustering
Conclusion
References
Introduction
Machine Learning (ML) and Artificial Intelligence (AI) are not just the latest
buzz words. They are now the most important part of the world we live in
and far more important and useful than science fiction would have you
believe. Without them, we simply couldn't process the huge amounts of data
we produce, at least not effectively or efficiently. Without them, more people
would be stuck doing repetitive, mundane jobs instead of putting their skills
to better use. Without them, companies couldn't make good business
decisions or draw up effective strategies and solutions.
While the human brain can process large amounts of data, it can only absorb
so much data at any one time. AI doesn't have these limitations, and it is far
more accurate, free from human error. However, it isn't the easiest
technology to develop and requires the right programming language. That
language is Python, for several reasons:
1. It has a wide range of libraries, modules that include pre-written code
for certain functions or actions. This means developers don't have to
start from scratch every time. Some of the best Python libraries are:
Scikit-Learn - handles regression, clustering, classification, and
other ML algorithms
Pandas – for higher-level analysis and data structures, help with
data merging and filtering, and gathering data from external
sources
Keras – a deep learning library for prototyping and calculations
TensorFlow – another deep learning library that helps set up,
train, and use artificial neural networks with vast datasets
Matplotlib – creates visualizations, like histograms, 2D plots,
charts, etc.
2. It is easy to learn with intuitive syntax, which means it can easily be
used for ML with little effort.
3. It is a flexible language, offering a choice of scripting or OOPs, with
no requirement to recompile source code, which allows changes to be
quickly implemented. It can also be combined with other languages
where the need arises.
4. It isn't dependent on any platform and can be used on more than 20. It
also isn't too difficult to transfer it from one platform to another.
5. It is simple to read, so all developers can understand anyone's code and
change it if they need to.
6. It offers a great choice of visualization tools so that data can be
presented in a human-readable format.
7. It also has one of the largest communities of any programming
language, with developers and others waiting to help and provide
resources.
Prerequisites
This book is aimed at those who already have programming experience with
Python. If you are completely new to programming, you really need to go and
learn the basics at the very least before attempting any of the coding and
examples in this book.
If you are experienced and are ready to take your knowledge up a notch, let's
dive in and learn all about machine learning using Python.
Chapter 1: What Is Machine Learning?
The real world is full of humans whose brains have a vast capacity to learn
from their experiences and machines or computers that work from human
instruction. One question that has long been asked is, "can a computer learn
from past data or experiences the same way humans do?" That question is
answered with machine learning.
One of the fastest-growing technologies, machine learning is all about
teaching computers to learn from past data. It does this by using algorithms
that build mathematical models and use historical information or data to
make predictions.
You already use machine learning in your everyday life, most likely without
even knowing it. It is used to filter spam emails from your inbox, image
recognition, auto-tagging in Facebook, speech recognition, recommender
systems in Netflix, Amazon, etc., and so much more.
Machine learning is a subset of Artificial Intelligence, and the term was first
coined in 1959 by Arthur Samuel. It can be defined as enabling machines to
learn automatically from data, use past experiences to improve their
performance, and make predictions without needing to be explicitly
programmed.
We use samples of historical data, called training data, to teach machine
learning algorithms how to build mathematical models that make decisions or
predictions. This branch of computer science combines statistics and
computer science to build predictive models, and they use or construct
algorithms to learn from the data. The more data we provide, the better the
machine learning model's performance.
How It Works
When we provide a machine learning system with sufficient historical data, it
builds a prediction model. When we give it new data, it will use that model to
predict the output for that data. The accuracy of that output is entirely
dependent on the amount of data we provide – the more data we give, the
more accurate the output will be.
Let's say we are dealing with a complex problem, and we need some
predictions. Rather than writing code from scratch, we can use pre-built,
generic algorithms. We give the data to those algorithms, and they use that
data to build the logic and provide the predicted outputs. In short, machine
learning changes how we think about problem-solving.
Machine Learning Features
Machine learning offers plenty of features:
It uses data to find patterns in datasets
It learns from past data and uses it to improve automatically
The technology is purely data-driven
Machine learning can be seen as similar to data mining in that both
deal with vast amounts of data
Why We Need Machine Learning
Machine learning is fast becoming a requirement for everyday life, and the
need for it increases as each day passes. Why do we need it so badly? For a
start, it can take the place of humans. That isn't a bad thing – some jobs are
incredibly mundane and time-consuming, and allowing machine learning to
take over means freeing up time that is better spent elsewhere. Conversely,
some jobs are far too complex for humans to do – we do have some
limitations and don't have a way to manually access the large amounts of data
needed for these jobs. That's where computer systems, specifically machine
learning, come into the picture.
We can give this vast amount of data to a machine learning algorithm. They
explore that data, build their models, and predict the output. But the amount
of data we give these models isn't the only thing that affects their
performance – it also comes down to the cost function, and machine learning
can save us significant money and time.
We can also understand just how important machine learning is by looking at
its use cases. Some prominent uses are cyber fraud detection, self-driving
cars, Facebook friend suggestions, facial and speech recognition, and spam
email filtering. Plus, major companies like Amazon and Netflix use it to
analyze user preferences and provide product recommendations.
To recap the importance of machine learning, it can:
Analyze and learn from ever-increasing amounts of data
Solve problems too complex for humans
Make decision making more efficient in many sectors
Find patterns hidden in the data and extract information from it
The History of Machine Learning
Until 40 or 50 years ago, machine learning was the stuff of science fiction.
Today, it is a prominent part of our lives, making things much easier for us,
from self-driving cars to product recommendations and virtual assistants
(think Siri, Alexa, Cortana, etc.) However, while machine learning is still
relatively new, the idea behind it has been around for many years. Here are
some of the more important milestones in its history:
1834
The father of the computer, Charles Babbage, came up with the idea of a
device that could easily be programmed with punch cards. The device was
never built, but modern computers rely on its logical structure.
1936
Alan Turing devised the theory that machines can learn a set of instructions
and execute them.
1940
This year saw the invention of ENIAC, the first manually operated
computer., and the first general-purpose, electronic computer. This led to
EDSAC (1949) and EDVAC (1951), among other stored-program computers,
being invented.
1943 - 1950
1943 saw the first modeling of a human neural network with an electrical
circuit. Scientists began applying this idea to work in 1950, analyzing how
human neurons potentially worked.
In 1950, Alan Turing also published a seminal paper on artificial intelligence.
His paper was called "Computer Machinery and Intelligence," and it asked an
important question – can machines think.
1952
The pioneer of machine learning, Arthur Samuel, developed a program to
help an IBM computer play checkers. The more it played, the better it got.
1959
Arthur Samuel coined the term "machine learning."
1974 – 1980
This was a tough era for ML and AI researchers, which became known as the
"AI Winter." This was a time when machine translations failed, and interest
in AI began to wane. This led to a reduction in research funding by the
governments.
1959
For the first time, a real-world problem was the subject of a neural network
application designed to use adaptive filters to remove echoes from phone
lines.
1985
Charles Rosenberg and Terry Sejnowski invented NETtalk, a neural network
that could teach itself to pronounce 20,000 words correctly in just seven days.
1997
The Deep Blue intelligent computer from IBM beat Garry Kasparov, a
Russian Chess Grandmaster, at his own game, becoming the first computer
ever to beat a human at chess.
2006
A computer scientist called Geoffrey Hinton renamed neural net research,
calling it 'deep learning.' Today it is one of the top-trending technologies.
2012
Google developed a deep neural network that could recognize images of cats
and humans from videos on YouTube.
2014
A chatbot called Eugene Goostman passed the Turing test, becoming the first
chatbot ever to convince the human judges on the panel that it was human,
not a machine. 33% of the panel were human.
In the same year, Facebook created its own deep neural network called
DeepFace, claiming it had the same precision as humans in recognizing
specific people.
2016
A computer program called AlphaGo beat Lee Sodol, the second-best player
in the world, in a game of Go. The following year, it would go on to beat Ke
Jie, the world's number one player.
2017
An intelligent system was built by Alphabet's Jigsaw team, which could learn
online trolling. By reading millions and millions of comments from different
sites, it learned how to stop online trolling.
Machine Learning Today
These days, machine learning has come a long way, and it continues to
advance thanks to research. Modern ML models are now used to predict
diseases, weather, analyze the stock market, and much more. In the next
chapter, we will delve into the different types of machine learning we can use
today.
Chapter 2: The Different Types of Machine Learning
Like many things, there is more than one way to train a machine learning
algorithm, and each way comes with its own set of pros and cons. To
understand those pros and cons, we first need to look at the type of data they
use. There are two types of data in machine learning – labeled and unlabeled.
Labeled Data – contains input and output parameters in a pattern only
readable by a machine. However, a significant amount of human labor
is required to read that data to start with.
Unlabeled Data – no more than one parameter is in machine-readable
form, which means human labor is not required, but the solutions are
way more complex.
Machine learning algorithms are separated into four different types:
Supervised Learning
In supervised learning, the algorithm learns by example. The algorithm is
given a known dataset, which contains the desired inputs and outputs. It’s
down to the algorithm to find the right method to work out how to get to
those inputs and outputs. The operator already knows the right answer to the
problem, but the algorithm will identify specific patterns in the data. Then it
will learn from its observations and use that to make predictions. If the
prediction is wrong, the operator corrects it, and the process is repeated until
the algorithm has achieved the highest possible level of accuracy and
performance.
Supervised learning tasks include:
Classification – in these tasks, the machine learning models draw
conclusions from observed values and select the best categories for
new data. For example, a program that determines whether an email is
spam or not must look at existing data and learn how to filter the
emails.
Regression – in these tasks, the models must understand and estimate
relationships between variables. During regression analysis, one
dependent variable becomes the focus, along with several changeable
variables, making classification one of the best tools for forecasting
and prediction.
Forecasting – in these tasks, predictions are made about the future
based on present and past data. This is commonly used in trend
analysis.
Semi-Supervised Learning
Semi-supervised learning only differs from supervised learning in that it uses
labeled and unlabeled data. The labeled data has tags that allow the algorithm
to understand the data, while the unlabeled data doesn’t have any
information. Using a combination of labeled and unlabeled data means that
the algorithms learn how to put labels on unlabeled data.
Unsupervised Learning
In unsupervised learning, the algorithm examines the data looking for
patterns with no human instruction and no key to learn from. Instead, it
analyzes the data given to it and determines relationships and correlations.
Unsupervised learning leaves the machine to interpret vast data and
determine how to deal with it by organizing the data to describe its structure.
This could be clustering or arranging it in another way that makes it easier to
read. The more data an unsupervised learning algorithm accesses, the better
its decision-making gets.
Unsupervised learning tasks include:
Clustering - sets of data are grouped by similarity, based on predefined criteria. This is useful when data needs to be segmented into
multiple groups and analysis performed on each one to find the
patterns.
Dimension Reduction – this reduces how many variables need to be
considered to find the required information.
Reinforcement Learning
Reinforcement learning revolves around controlled learning processes where
a specific set of actions is provided to the algorithm, with the parameters and
the required outputs. Because the rules are pre-defined, the algorithm can
explore the possibilities and options, monitoring each result and evaluating
them to determine the optimal one. Reinforcement learning is all about trial
and error. Past experiences are studied, and the algorithm continually adapts
its approach until the best result is achieved.
Which Algorithms to Use
Making sure you choose the right algorithm is dependent on a few factors,
such as:
Size of data
Quality of data
Diversity of data
The answers required to derive useful insights from the data
Algorithm accuracy
How long does it take to train
The required parameters
Data points
This is not an exhaustive list, and choosing the right one is a combination of
specification, business need, time available, and experimentation. Even the
best data scientists in the world cannot tell you the best algorithm to use right
off the bat. It requires experimentation, but below, you can find a list of the
most popular machine learning algorithms:
Naïve Bayes Classifier (Supervised learning, classification) –based on
Bayes Theorem, this algorithm classifies every value independently.
This algorithm uses probability to help us predict categories or classes
based on a provided feature set. It may be a simple algorithm, but it
works very well and tends to be used a lot because it outperforms many
of the more sophisticated algorithms.
K-Means Clustering (Unsupervised learning, clustering) – this
algorithm places unlabeled data into categories. It searches the data and
finds groups, representing the number of groups by a variable K. It
iteratively assigns data points to a K group based on the provided
features.
Support Vector Machine (Supervised learning, classification) – these
algorithms are used in regression and classification analysis. The
algorithm is given a set of training data, with each set belonging to one
of two categories. The algorithm builds a model that can take new data
and assign it to one of these categories.
Linear Regression (Supervised learning, regression) – this is
regression at its most basic level, allowing us to understand existing
relationships between continuous variables.
Logistic Regression (Supervised learning, classification) – this type of
regression estimates an event’s probability based on previous data. It
covers binary dependent variables, where there can only be two values,
1 and 0, to represent the outcomes.
Artificial Neural Networks (Reinforcement learning) – ANNs are
made up of units in layers. Each layer connects to those on either side
of it. The inspiration for these comes from the brain and other
biological systems and how they process information. Essentially, they
are processing elements, all interconnected and working together to
solve a problem.
Decision Trees (Supervised learning, classification and regression) –
decision trees are flow charts with a tree structure. A branching method
is used to illustrate all possible outcomes of a decision, with each node
representing a test on a variable. Each branch is that test’s outcome.
Random Forests (Supervised learning, classification and regression) –
these come under the ensemble learning methods, where several
algorithms are combined to get better results for regression and
classification tasks, among others. Each classification algorithm is
weak on its own but, combined with others, it can give excellent
results. It begins with a decision tree with an input at the top. The
algorithm traverses the tree, segmenting the data into ever smaller sets
based on certain variables.
K-Nearest Neighbors (Supervised learning) – this algorithm is used to
estimate the likelihood of a data point belonging to one group or
another. It examines the data points surrounding a single point to see
what group it is in. For example, a point is on a grid, and KNN wants
to determine whether it belongs to group A or group B. It looks at the
nearest data points to see which group most of the points are in.
As you can see, choosing the right algorithm is quite involved, and to help
you out, we will go into more detail in the coming chapters for some of these
algorithms, starting with linear regression.
Chapter 3: Linear Regression
Linear regression is one of the basic techniques anyone new to machine
learning and statistical techniques should study before moving on to more
complex methods. First, let’s take a look at what regression is.
What Is Regression?
Regression is a technique that looks for relationships between variables. For
example, you could look at details of employees in a specific company and
determine the relationship between their salary and things like education,
experience, age, where they live, etc. These are known as features.
Each employee’s data represents an observation in this type of regression
problem. There is a presumption that the features are classed as independent
while the salary is dependent on the features. In the same way, you could
examine house prices to determine a mathematical dependence based on their
features, such as the number of bedrooms, living area, how close they are to
the city center, etc.
Typically regression analysis considers phenomena of interest and has several
observations, while each observation has at least two features. If we follow
an assumption that at least one feature is dependent on the other features, we
try to find some kind of relationship between them. In other words, we need a
function that will sufficiently map features and variables to others.
Dependent features – these are the dependent variables, otherwise
known as the outputs or the responses
Independent features – these are the independent variables, otherwise
known as the inputs or the predictors
Typically, a regression problem will contain two dependent variables – one
continuous and one unbounded. However, the inputs may be discrete,
continuous, or they may even be categorical data, like brand, nationality,
gender, etc. Best practice recommends using y to denote the outputs and x for
the inputs. For two or more independent variables, the vector �� =
( �� ₁, …, �� ᵣ ) can be used, where r denotes the number of inputs.
When Is Regression Needed?
Regression is usually used to solve problems asking where something
influences something else and how, or when trying to find the relationship
between multiple variables. For example, regression can be used to work out
if gender or experience affects salaries and to what extent.
It is also used to forecast responses using new predictors. For example, given
the time of day, external temperature, and the number of people in a
household, you could try to predict a household’s electricity consumption for
the next hour.
Many different fields make use of regression, including computer sciences,
economy, social sciences, etc. Every day, it becomes more important as more
and more data becomes available and we become more aware of how to use
the data.
Linear Regression
One of the most widely used techniques in regression, and possibly the most
important, is linear regression. One of the easiest regression methods to use,
it has the advantage of the easy interpretation of the results.
Problem Formulation
Let's say we want to implement linear regression of a dependent variable (y)
on �� = ( �� ₁, …, �� ᵣ ), a set of independent variables where r
denotes the number of predictors. We assume the relationship between ��
and �� is linear:
�� = �� ₀ + �� ₁ �� ₁ + ⋯ + �� ᵣ �� ᵣ + ��
That is the regression equation. The regression coefficients are �� ₀,
�� ₁, …, �� ᵣ , while the random error is �� .
Linear regression will calculate the regression coefficient's estimators od the
predicted weights, which are indicated by �� ₀, �� ₁, …, �� ᵣ . These
weights define �� ( �� ) = �� ₀ + �� ₁ �� ₁ + ⋯ +
�� ᵣ �� ᵣ , which is the estimated regression function that should be able
to capture the dependencies between the outputs and inputs well.
Each observation i = 1, …, n, has a predicted or estimated response of
�� ( �� ᵢ ), which should be as near to yi as possible – the actual
corresponding response. The differences for all the observations �� = 1,
…, �� is �� ᵢ - �� ( �� ᵢ ), and these are known as residuals.
Regression is all about working out the best-predicted weights that
correspond to the smallest residual.
So how do you get the best weights? The SSR (sum of squared residuals) for
all the observations �� = 1, …, �� must be minimized:
: SSR = Σ ᵢ ( �� ᵢ - �� ( �� ᵢ ))²
This is known as the method of ordinary least squares.
Regression Performance
The actual responses �� ᵢ , �� = 1, …, �� vary according to their
dependence on Xi, the predictors. However, we also consider the output's
inherent variance.
The coefficient of determination, R2, indicates how much of y's variation is
dependent on X using the specific regression model. The larger R2 is, the
better the fit, and the model can explain the output variation with different
inputs much better.
R2 = 1 corresponds to SSR = 0. This tells you that you have the best fit
because predicted and actual response values fit perfectly.
Implementing Linear Regression in Python
Now you know what linear regression is all about, let's look at how to
implement it in Python. It really is nothing more difficult than implementing
the right libraries and their classes and functions.
Linear Regression Packages
A fundamental package is NumPy, a scientific package that lets you perform
high-performance arrays, whether single or multi-dimensional. It is opensource and also offers plenty of useful mathematical routines.
Scikit-Learn is another useful machine learning package built on NumPy and
other packages. Scikit-Learn gives you what you need to preprocess the data,
reduce the dimensionality, implement the regression, clustering or
classification, and much more. It is also an open-source package.
Simple Linear Regression
Let's dive in with simple linear regression. When you implement any linear
regression, you need to follow five steps:
1. Import the correct packages and classes
2. Provide the model with data and do the required transformations
3. Build a regression model, fitting it with existing data
4. Check the model fitting results so you know if you have the right one
5. Apply the model to get your predictions.
Step 1: Import the correct packages and classes
First, you need to import the NumPy package and the LinearRegression class
from sklearn.linear_model:
import numpy as np
from sklearn.linear_model import LinearRegression
That gives you everything you need to implement the linear regression.
NumPy's fundamental data type is numpy.ndarray, which is the array type.
For the remainder of this chapter, we will use 'array' to refer to all instances
of numpy.ndarray .
We use the sklearn.linear_model.LinearRegression class to do both linear and
polynomial regression, making the predictions accordingly.
Step 2: Provide the data
Next, we need to define the data we are working with. The inputs, which are
(regressors, x) and the outputs, which are (predictor, y) must be arrays or
similar – this is easiest way to provide the data required for the regression:
x = np.array([5, 15, 25, 35, 45, 55]).reshape((-1, 1))
y = np.array([5, 20, 14, 32, 22, 38])
You should now have two arrays – input x and output y. the array must be
two-dimensional, which means having one column and however many rows
are required, so we must call .reshape() on x. The number of columns and
rows is denoted by the argument (-1, 1) in .reshape().
X and y now look like this:
>>>
>>> print(x)
[[ 5]
[15]
[25]
[35]
[45]
[55]]
>>> print(y)
Output:
[ 5 20 14 32 22 38]
There are two dimensions for x and x.shape is (6, 1) and one dimension for y
and y.shape is (6, ).
Step 3: Build your model and fit it
Next, we need to build our linear regression model and use the existing data
to fit it. First, an instance of the LinearRegression class needs to be created to
represent our model:
The next step is to create a linear regression model and fit it using the
existing data.
model = LinearRegression()
The variable called model is created as an instance of LinearRegression,
which can take several parameters, all optional:
fit_interpret - a Boolean which has a default of True. It determines
whether the intercept �� ₀ should be calculated (True) or considered
equal to zero (False)
normalize – a Boolean with a default of False. It determines whether
the input variables should be normalized (True) or not (False)
copy_X – a Boolean with a default of True. It determines whether the
input variables should be copied (True) or not (False)
n_jobs – an integer or None, which is the default. It represents how
many jobs are used in parallel computation. None indicates one job,
while -1 indicates all processors used.
Our example will use the defaults for all the parameters.
Now we need to use our model. First, .fit() needs to be called:
model.fit(x, y)
Once this has been called, we can calculate the best weight values for �� ₀
and �� ₁. To do this, x and y (the existing input and output) are used as the
arguments. Simply put, .fit() will fit our model. The variable model is
returned as self, which is why the last two statements can be replaced with
one:
model = LinearRegression().fit(x, y)
This is just a shorter version of the other two statements, but it does exactly
the same.
Step 4: Get the results
Once the model has been fitted, the results can be obtained to tell you if the
model works well. We call .score() on model to get R2, which is the
coefficient of determination:
>>>
>>> r_sq = model.score(x, y)
>>> print('coefficient of determination:', r_sq)
Output:
coefficient of determination: 0.715875613747954
When you apply .score(), the predictor x and the regressor y are the
arguments, and the return should be a value R2.
The model’s attributes are .intercept(), representing �� ₀ (the coefficient)
and .coef_, representing �� ₁:
>>>
>>> print('intercept:', model.intercept_)
Output:
intercept: 5.633333333333329
>>> print('slope:', model.coef_)
Output:
slope: [0.54]
The code shows you how to get �� ₀ and �� ₁. Note that .coef is an
array and .intercept_ is a scalar.
The value of �� ₀ = 5.63 shows that when x is zero, the model will predict
a response of 5.63, while �� ₁ = 0.54 indicates that, when x increases by
one, the predicted response will rise by 0.54.
Also, note that y may be provided as a two-dimensional array, and a similar
result would be obtained. It might look like this:
>>>
>>> new_model = LinearRegression().fit(x, y.reshape((-1, 1)))
>>> print('intercept:', new_model.intercept_)
intercept: [5.63333333]
>>> print('slope:', new_model.coef_)
Output:
slope: [[0.54]]
You can see that this example is much like the last one, but, here, .intercept is
a single-dimensional array containing one element, �� ₀, while .coef is
two dimensional and one element of �� 1.
Step 5: Predict the response
When you are satisfied with your model, you use existing or new data to
make predictions. To get the predicted response, you use .predict():
>>>
>>> y_pred = model.predict(x)
>>> print('predicted response:', y_pred, sep='\n')
Output:
predicted response:
[ 8.33333333 13.73333333 19.13333333 24.53333333 29.93333333
35.33333333]
When you apply .predict(), the regressor is passed as an argument, and you
get the predicted response that corresponds to it.
>>>
>>> y_pred = model.intercept_ + model.coef_ * x
>>> print('predicted response:', y_pred, sep='\n')
Output:
predicted response:
[[ 8.33333333]
[13.73333333]
[19.13333333]
[24.53333333]
[29.93333333]
[35.33333333]]
Each element x is multiplied with model.coef_ and model.intercept is added
to the product in this example.
Here, the only difference in the output from the last example is in the
dimensions. In the first example, the predicted response was onedimensional, and, in this one, it is two-dimensional.
If the number of dimensions of x was reduced to one, both examples would
give us the same result. To do this, replace x with one of the following when
you multiply it with model.coef:
x.reshape(-1)
x.flatten(), or
x.ravel().
In practice, we typically use regression for forecasting, which means fitted
models can be used to calculate outputs based on new inputs.
>>>
>>> x_new = np.arange(5).reshape((-1, 1))
>>> print(x_new)
[[0]
[1]
[2]
[3]
[4]]
>>> y_new = model.predict(x_new)
>>> print(y_new)
Output:
[5.63333333 6.17333333 6.71333333 7.25333333 7.79333333]
In this example, we applied .predict() to x_new (a regressor, and the result is
y_new. The array containing elements from 0 to 5 (inclusive and exclusive
respectively), is generated using arrange() . The array is 0, 1, 2, 3, 4.
Let’s now look at the different classification types.
Chapter 4: The Different Types of Classification
Classification is a common type of machine learning task and is used to
assign specific classes with label values. It can then determine if a class is of
one type or another. Perhaps the most common example of this is filtering
spam emails, where an email is classified as spam or not spam. Throughout
your journey, you will come across plenty of challenges, and there are several
approaches in terms of the model type that fits each challenge.
Classification Predictive Modeling
Typically, classification refers to problems where the predicted result is a
type of class label obtained from the provided data. Some of the more
popular types of challenges include:
Spam email classification – determining if an email is spam or not
Handwriting classification – determining if a handwritten character is a
known one or not
User behavior classification – determines if recent behavior is churn or
not
All classification models need a training dataset containing plenty of input
and output examples. The model uses this dataset to train itself. The data
must have all possible scenarios for the specific problem, and each label must
be represented by enough data for the model to learn from and train itself.
Often, the class labels are returned as string values, which means they must
be encoded into an integer—for example, 0 to represent spam and 1 to
represent not spam.
The only way to determine the best model for the problem is to experiment
and work out the best configuration and algorithm to provide the best
performance for the problem. In predictive modeling, the algorithms are all
compared against their results. One of the best metrics used to evaluate a
model’s performance on class label predictions is classification accuracy. It
may not be the best parameter, but it is certainly one of the best places to start
from in most classification tasks.
Rather than a class label, some tasks might predict class membership
probabilities of specified inputs. In cases like this, one of the most helpful
indicators of model accuracy is the ROC curve. In your machine learning
journey, you will probably come across four classification task types, and the
different predictive model types are:
Binary classification
Multi-label classification
Multi-class classification
Imbalanced classification
Let’s look into each one, with code examples to show you how they work.
Binary Classification
Binary classification covers tasks that can provide an output of one of two
class labels. Typically, one class label is the normal state, while the other is
the abnormal state. We can understand this better with the following
examples:
Detecting spam emails – normal state = not spam, while abnormal
state = spam
Conversion prediction - normal state = not churned, while abnormal
state = churn
Conversion prediction – normal state = purchased an item, while
abnormal state = didn’t purchase an item
Conversion prediction – normal state = no cancer detected, while
abnormal state = cancer detected
The notation typically followed is that the normal state is assigned 0, while
the abnormal state is assigned 1. For example, a model can also be created to
predict an output’s Bernoulli probability. Simply put, a discrete value is
returned to cover every case, and the output is given as 1 or 0. Once the two
states are associated, an output can be given for one of the present values.
The commonly used binary classification algorithms are:
K-Nearest Neighbors
Logistic Regression
Support Vector Machine
Decision Trees
Naive Bayes
Some of these algorithms are designed specifically for binary classification
and do not have native support for any more than two class types. This
includes Logistic Regression and Support Vector Machines.
To show you how binary classification works, we will create a dataset and
apply the classification to it. We will generate a binary classification for the
dataset using the make_blobs() function from Scikit-Learn. In our example,
we have a dataset containing 1,000 examples belonging to one of the two
present classes with two input features:
from numpy import where
from collections import Counter
from sklearn.datasets import make_blobs
from matplotlib import pyplot
X, y = make_blobs(n_samples=5000, centers=2, random_state=1)
print(X.shape, y.shape)
counter = Counter(y)
print(counter)
for i in range(10):
print(X[i], y[i])
for label, _ in counter.items():
row_ix = where(y == label)[0]
pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))
pyplot.legend()
pyplot.show()
Output:
(5000, 2) (5000,)
Counter({1: 2500, 0: 2500})
[-11.5739555 -3.2062213] 1
[0.05752883 3.60221288] 0
[-1.03619773 3.97153319] 0
[-8.22983437 -3.54309524] 1
[-10.49210036 -4.70600004] 1
[-10.74348914 -5.9057007 ] 1
[-3.20386867 4.51629714] 0
[-1.98063705 4.9672959 ] 0
[-8.61268072 -3.6579652 ] 1
[-10.54840697 -2.91203705] 1
In this example, a dataset is created with 5,000 samples divided into two
elements – input X and output Y. The resulting distribution would show you
that any instance can belong to class 0 or class 1, and each has approximately
50% of the instances.
The first 10 examples are displayed with numeric input values and a target
value of an integer representing class membership.
The input variables are then shown on a scatter plot, with color-coded points
based on the class values.
Multi-Class Classification
As the name indicates, these problems do not have two fixed labels; instead,
they can have multiple labels. Some of the most common multi-class
classification types are:
Facial classification
Plant species classification
Optical character classification
There is no abnormal or normal outcome, and the result belongs to any one of
multiple variables of known classes. There may also be tons of labels, such as
the prediction of images compared to how closely they resemble one of
potentially thousands in a facial recognition system.
You could also consider a challenge whereby the next word in a sequence
needs to be predicted as a multi-class classification problem. In a scenario
like this, all the words define every possible number of classes and could run
into the millions.
Categorical distribution is typically used for these types of models, whereas
Bernoulli is used for binary classification. In categorical distributions, events
can have several results or endpoints, and the models make predictions on the
input probability regarding the individual output labels.
The following algorithms are commonly used for multi-class classification:
K-Nearest Neighbors
Naive Bayes
Decision trees
Gradient Boosting
Random Forest
The binary classification algorithms can also be used with multi-class
classification based on the notion of one vs. rest – one class vs. all the others
– or one vs. one – one model for a pair of classes.
One vs. Rest – the primary task is one model being fit for each class
which faces all the others
One vs. One – the primary task is defining a single binary model for
each class pair
As in binary classification, we will use the make_blobs() function:
from numpy import where
from collections import Counter
from sklearn.datasets import make_blobs
from matplotlib import pyplot
X, y = make_blobs(n_samples=1000, centers=4, random_state=1)
print(X.shape, y.shape)
counter = Counter(y)
print(counter)
for i in range(10):
print(X[i], y[i])
for label, _ in counter.items():
row_ix = where(y == label)[0]
pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))
pyplot.legend()
pyplot.show()
Output:
(1000, 2) (1000,)
Counter({1: 250, 2: 250, 0: 250, 3: 250})
[-10.45765533 -3.30899488] 1
[-5.90962043 -7.80717036] 2
[-1.00497975 4.35530142] 0
[-6.63784922 -4.52085249] 3
[-6.3466658 -8.89940182] 2
[-4.67047183 -3.35527602] 3
[-5.62742066 -1.70195987] 3
[-6.91064247 -2.83731201] 3
[-1.76490462 5.03668554] 0
[-8.70416288 -4.39234621] 1
Here, it’s clear that we have more than two types of classes, and they can be
separately classified into all the different types.
Multi-Label Classification
Multi-label classification covers classification tasks where two or more class
labels need to be assigned. These are class labels predicted for each example.
One example would be the classification of photos, where one image may
contain multiple objects, such as fruit, an animal, a person, etc. The biggest
difference lies in the fact that these models can predict more than one label.
Multi-class and binary classification models cannot be used for multi-label
classification. You also need the algorithm to be modified to be used for
multiple classes, making this more challenging than a Yes or No statement.
Some of the algorithms commonly used in multi-label classification are:
Multi-label Random Forests
Multi-label Decision trees
Multi-label Gradient Boosting
You could use a different approach. A separate classification algorithm could
be used to predict the labels for each class type. The multi-label classification
dataset will be generated using a Scikit-Learn library in our example, and the
code shows you how to create the multi-label classification and shows it
working with 1000 samples and 4 class types:
from sklearn.datasets import make_multilabel_classification
X, y = make_multilabel_classification(n_samples=1000, n_features=3,
n_classes=4, n_labels=4, random_state=1)
print(X.shape, y.shape)
for i in range(10):
print(X[i], y[i])
Output:
(1000, 3) (1000, 4)
[ 8. 11. 13.] [1 1 0 1]
[ 5. 15. 21.] [1 1 0 1]
[15. 30. 14.] [1 0 0 0]
[ 3. 15. 40.] [0 1 0 0]
[ 7. 22. 14.] [1 0 0 1]
[12. 28. 15.] [1 0 0 0]
[ 7. 30. 24.] [1 1 0 1]
[15. 30. 14.] [1 1 1 1]
[10. 23. 21.] [1 1 1 1]
[10. 19. 16.] [1 1 0 1]
Imbalanced Classification
Imbalanced classification is used for tasks with an uneven distribution of
examples in each class. Typically, these are binary classification tasks where
a large percentage of the training set is classified as normal, and the rest are
abnormal.
This type of classification is commonly used in:
Fraud detection
Medical diagnosis
Outlier detection
Special techniques are used to turn these problems into binary classification
tasks. You can choose between over-sampling for the smaller number of
classes or under-sampling for the bigger number of classes. SMOTE Oversampling and random under-sampling are two of the best examples.
When you are fitting the model on the training dataset, you can also use
special modeling algorithms to focus more on the smaller class. This includes
using cost-sensitive algorithms, such as:
Cost-Sensitive Logistic Regression
Cost-Sensitive Decision Trees
Cost-Sensitive Support Vector Machines
Once the model is chosen, we need to access and score it. We can do that
using the Precision, Recall, or F-Measure score. We need a dataset developed
for the problem, and we’ll generate a synthetic, imbalanced binary
classification dataset containing 1000 samples:
from numpy import where
from collections import Counter
from sklearn.datasets import make_classification
from matplotlib import pyplot
X, y = make_classification(n_samples=1000, n_features=2,
n_informative=2, n_redundant=0, n_classes=2,
n_clusters_per_class=1, weights=[0.99,0.01], random_state=1)
print(X.shape, y.shape)
counter = Counter(y)
print(counter)
for i in range(10):
print(X[i], y[i])
for label, _ in counter.items():
row_ix = where(y == label)[0]
pyplot.scatter(X[row_ix, 0], X[row_ix, 1], label=str(label))
pyplot.legend()
pyplot.show()
Output:
(1000, 2) (1000,)
Counter({0: 983, 1: 17})
[0.86924745 1.18613612] 0
[1.55110839 1.81032905] 0
[1.29361936 1.01094607] 0
[1.11988947 1.63251786] 0
[1.04235568 1.12152929] 0
[1.18114858 0.92397607] 0
[1.1365562 1.17652556] 0
[0.46291729 0.72924998] 0
[0.18315826 1.07141766] 0
[0.32411648 0.53515376] 0
Here, the label distribution can be seen, along with a serious class imbalance,
wither 17 belonging to one type and the remaining 983 belonging to the other
one. As expected, we get a majority of class or type 0.
In the next chapter, we’ll look at SVMs or Support Vector Machines using
Scikit-Learn.
Chapter 5: Support Vector Machines with Scikit-Learn
This chapter will take an in-depth look at a popular algorithm used in
supervised machine learning - Support Vector Machines.
SVM is highly accurate, more so than logistic regression, decision trees, and
other similar classifiers. Perhaps one of its best-known features is a kernel
trick that can handle non-linear input spaces. It is used in several
applications, including intrusion detection, facial recognition, email
classification, web pages, news articles, gene classification, and handwriting
recognition.
SVM is one of the most exciting algorithms with simple concepts. The
classifier uses a hyperplane with the biggest margin to separate the data
points, finding the optimal hyperplane for classifying new data points.
Support Vector Machines
Typically, a Support Vector Machine is considered a classification problem
approach but can easily be employed in regression problems. It also handles
multiple categorical and continuous variables with ease.
The hyperplane is constructed in multidimensional space to keep the different
classes separate. An optimal hyperplane is generated iteratively and used for
minimizing the risk of an error. The primary idea behind the SVM is to find
the MMH (Maximum Marginal Hyperplane) that efficiently splits the dataset
into classes.
Support Vectors - These are the data points nearest to the hyperplane,
and they calculate margins to ensure the separating line is better
defined. The support vectors are relevant to the classifier's
construction.
Hyperplane – This is a decision plane separating a set of objects with
different class memberships
Margin – This is the gap between a pair of lines on the nearest class
points and is calculated as the perpendicular distance between the line
and the closest points or support vectors. The larger the margin
between the classes, the better, while smaller margins are considered
bad.
How Does SVM Work?
The SVMs primary objective is to split the dataset most efficiently. The
distance between the nearest pints is called the margin, and the idea is to
choose the best hyperplane with the biggest margin possible between the
dataset's support vectors.
Non-Linear and Inseparable Planes
A linear hyperplane cannot be used to solve all problems. A kernel trick is
used to turn the input space into a higher-dimensional space in situations
where it cannot. We plot the data points on the x-axis and the z-axis, allowing
you to use linear separation to segregate the points.
SVM Kernels
In practice, a kernel is used to implement the SVM algorithm, transforming
the input data space into the right format. This is done using a technique
known as the 'kernel trick, 'where the low-dimensional input space is turned
into a high-dimensional space. In simple terms, more dimension is added to a
non-separable problem, and it is converted into separable problems. This is
incredibly useful in problems revolving around non-linear separation, and it
helps construct more accurate classifiers.
Linear Kernel
You can use a linear kernel as a normal dot product for two specified
observations. The product of the two vectors is the result of each input value
pair being multiplied.
K(x, xi) = sum(x * xi)
Polynomial Kernel
These are generalized forms of linear kernels, and they can tell the difference
between non-linear and curved input spaces.
K(x,xi) = 1 + sum(x * xi)^d
In this, d indicates the degree of the polynomial. d = 1 is much the same as
the linear transformation, and the degree must be specified manually in the
algorithm.
Radial Basis Function Kernel
This is one of the most popular kernel functions in SVM classification and
can map input spaces in infinite-dimensional space.
K(x,xi) = exp(-gamma * sum((x – xi^2))
In this, the parameter is gamma, ranging from 0 to 1. Higher gamma values
fit the training set perfectly, resulting in over-fitting. The best default value is
Gamma=0.1, and, as with the polynomial kernel, the gamma value must be
specified manually in the algorithm.
Classifier Building in Scikit-learn
Now you know the theory behind SVMs, it's time to look at how to
implement it in Python using Scikit-Learn.
We'll be using the well-known cancer dataset, a popular multi-class
classification problem computed from digitized images of FNA (fine needle
aspirates) of breast mass and describing the cell nuclei characteristics from
the images.
There are 30 features in the dataset:
mean radius
mean texture
mean perimeter
mean area
mean smoothness
mean compactness
mean concavity
mean concave points
mean symmetry
mean fractal dimension
radius error
texture error
perimeter error
area error
smoothness error
compactness error
concavity error
concave points error
symmetry error
fractal dimension error
worst radius
worst texture
worst perimeter
worst area
worst smoothness
worst compactness
worst concavity
worst concave points
worst symmetry
worst fractal dimension
It also has a target, which is the type of cancer. There are two types –
malignant and benign. We want to build a model to classify the cancer types
– the dataset can be downloaded from the Scikit-Learn library or the UCI
Machine Learning Library.
Step One - Loading Data
First, we need to load the dataset – we'll get it from the Scikit-Learn library:
#Import scikit-learn dataset library
from sklearn import datasets
#Load dataset
cancer = datasets.load_breast_cancer()
Step Two - Exploring Data
Once the dataset is loaded, we can look at it to see more information about it.
We'll look at the 30 features and the target names:
# print the names of the 30 features
print("Features: " cancer.feature_names)
# print the label type of cancer('malignant' 'benign')
print("Labels: ", cancer.target_names)
Output:
Features: ['mean radius' 'mean texture' 'mean perimeter' 'mean area'
'mean smoothness' 'mean compactness' 'mean concavity'
'mean concave points' 'mean symmetry' 'mean fractal dimension'
'radius error' 'texture error' 'perimeter error' 'area error'
'smoothness error' 'compactness error' 'concavity error'
'concave points error' 'symmetry error' 'fractal dimension error'
'worst radius' 'worst texture' 'worst perimeter' 'worst area'
'worst smoothness' 'worst compactness' 'worst concavity'
'worst concave points' 'worst symmetry' 'worst fractal dimension']
Labels: ['malignant' 'benign']
Let's look at a bit deeper and check the dataset's shape:
# print data(feature)shape
cancer.data.shape
Output:
(569, 30)
And now we can look at the first five records in the features:
# print the cancer data features (first five records)
print(cancer.data[0:5])
Output:
[[1.799e+01 1.038e+01 1.228e+02 1.001e+03 1.184e-01 2.776e-01
3.001e-01
1.471e-01 2.419e-01 7.871e-02 1.095e+00 9.053e-01 8.589e+00
1.534e+02
6.399e-03 4.904e-02 5.373e-02 1.587e-02 3.003e-02 6.193e-03
2.538e+01
1.733e+01 1.846e+02 2.019e+03 1.622e-01 6.656e-01 7.119e-01
2.654e-01
4.601e-01 1.189e-01]
[2.057e+01 1.777e+01 1.329e+02 1.326e+03 8.474e-02 7.864e-02
8.690e-02
7.017e-02 1.812e-01 5.667e-02 5.435e-01 7.339e-01 3.398e+00
7.408e+01
5.225e-03 1.308e-02 1.860e-02 1.340e-02 1.389e-02 3.532e-03
2.499e+01
2.341e+01 1.588e+02 1.956e+03 1.238e-01 1.866e-01 2.416e-01
1.860e-01
2.750e-01 8.902e-02]
[1.969e+01 2.125e+01 1.300e+02 1.203e+03 1.096e-01 1.599e-01
1.974e-01
1.279e-01 2.069e-01 5.999e-02 7.456e-01 7.869e-01 4.585e+00
9.403e+01
6.150e-03 4.006e-02 3.832e-02 2.058e-02 2.250e-02 4.571e-03
2.357e+01
2.553e+01 1.525e+02 1.709e+03 1.444e-01 4.245e-01 4.504e-01
2.430e-01
3.613e-01 8.758e-02]
[1.142e+01 2.038e+01 7.758e+01 3.861e+02 1.425e-01 2.839e-01
2.414e-01
1.052e-01 2.597e-01 9.744e-02 4.956e-01 1.156e+00 3.445e+00
2.723e+01
9.110e-03 7.458e-02 5.661e-02 1.867e-02 5.963e-02 9.208e-03
1.491e+01
2.650e+01 9.887e+01 5.677e+02 2.098e-01 8.663e-01 6.869e-01
2.575e-01
6.638e-01 1.730e-01]
[2.029e+01 1.434e+01 1.351e+02 1.297e+03 1.003e-01 1.328e-01
1.980e-01
1.043e-01 1.809e-01 5.883e-02 7.572e-01 7.813e-01 5.438e+00
9.444e+01
1.149e-02 2.461e-02 5.688e-02 1.885e-02 1.756e-02 5.115e-03
2.254e+01
1.667e+01 1.522e+02 1.575e+03 1.374e-01 2.050e-01 4.000e-01
1.625e-01
2.364e-01 7.678e-02]]
And the target set.
# print the cancer labels (0:malignant, 1:benign)
print(cancer.target)
Output:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1000000001011111001001111010011110100
1010011100100011101100111001111011011
1111110001001110010100100110110111101
1111111101111001011001100111101100010
1011101100100001000101011010000110011
1011111001101100101111011111010000000
0000000111111010110110100111111111111
1011010111111111111110111010111100011
1101010111011111110001111111111100100
0100111110111110111011001111110111111
1011111011011111111111101001011111011
0101101011111111001111110111111111101
1111110101101111100101011111011010100
1110111111111110100111111111111111111
1 1 1 1 1 1 1 0 0 0 0 0 0 1]
Step Three - Splitting the Data
The best way to understand how the model performs is to split the dataset
into a training set and a test set. We do this with a function called
train_test_split() with three parameters – features, target, test_set size. You
can also choose random records using random_state:
# Import train_test_split function
from sklearn.model_selection import train_test_split
# Split the dataset into a training set and a test set
X_train, X_test, y_train, y_test = train_test_split(cancer.data,
cancer.target, test_size=0.3,random_state=109) # 70% training and
30% test
Step Four - Generating the Model
Now we can build our SVM. First, we should import the SVM module and
then pass kernel, the linear kernel in the SVC() function, as an argument to
create our support classifier object.
Then, we use fit() to fit our model on the training set and predict() to make
the predictions on the test set.
#Import svm model
from sklearn import svm
#Create the SVM Classifier
clf = svm.SVC(kernel='linear') # Linear Kernel
#Train the model on the training set
clf.fit(X_train, y_train)
#Predict the response for the test dataset
y_pred = clf.predict(X_test)
Step Five - Evaluating the Model
The final step is to estimate the model's accuracy in predicting breast cancer.
To do this, we compare the actual values from the test set with the predicted
values:
#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics
# Model Accuracy: how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
Output:
Accuracy: 0.9649122807017544
The classification rate accuracy is a very good 96.49%. However, we can
take things further by checking the model's precision and recall:
# Model Precision: what percentage of the positive tuples are correctly
labeled?
print("Precision:",metrics.precision_score(y_test, y_pred))
# Model Recall: what percentage of positive tuples correctly labeled?
print("Recall:",metrics.recall_score(y_test, y_pred))
Output:
Precision: 0.9811320754716981
Recall: 0.9629629629629629
This time, we got an even better score – precision is 98%, and recall is 96%
Tuning the Hyperparameters
The kernel's primary function is to turn the provided dataset's input data into
the correct form. There are three function types – polynomial, linear, and
RBF (radial basis function.) RBF and polynomial are best for non-linear
hyperplanes, both computing the separation lines in higher dimensions.
However, some applications require a more complex to separate non-linear or
curved classes, thus leading to highly accurate classifiers:
Regularization - this parameter in Scikit-Learn helps maintain
regularization. The penalty parameter is C, representing the error or
misclassification term. This term determines the most amount of error
acceptable to the SVM optimization and is how the trade-off is
controlled between the misclassification term and the decision
boundary. A smaller C value results in a small-margin hyperplane,
while a large value results in a larger-margin hyperplane.
Gamma – a lower Gamma value fits the training set loosely, while a
higher value fits it exactly, resulting in over-fitting. In simple terms, a
low Gamma value only considers the nearest points when it calculates
the separation line, while the higher value considers all data points.
Advantages
SVM classifiers have better accuracy and are faster at prediction than
the Naïve Bayes algorithm
They don't use so much memory because a subset of the training points
is used in making the decision.
SVMs work very well with high dimensional space and a clear margin
of separation.
Disadvantages
SVM classifiers don't work well with large datasets because they take
too long to train, more so than Naïve Bayes.
They also don't work very well with overlapping classes.
They are sensitive to the kernel type used.
In the next chapter, we will look at using decision trees.
Chapter 6: Using Decision Trees
Let's say you are a marketing manager and you want the customers most
likely to buy your products. Decision trees help you find your audience and
make the best use of your marketing budget. Or you could be a loans
manager who needs to weed out the risky applications to get a better loan
default rate. Using a decision tree, you can classify the applicants into two
groups – potential clients and non-potential clients, or safe applications and
risky applications – giving you a better chance of making the right decisions.
Classification requires two steps – the learning step and the prediction step.
In the first step, we use existing training data to develop the model, while in
the second step, we use the model to make predictions on given data. The
decision tree is a popular classification algorithm and certainly one of the
easiest to understand. It isn't restricted to classification problems, either; you
can also use it for regression problems.
Decision Tree Algorithm
Decision trees are similar to flowcharts, a structure with internal nodes
representing attributes or features, branches representing decision rules, and
leaf nodes representing the outcomes. The top node in the tree is the root
node, which uses the attribute value to learn how to partition the tree
recursively. This structure helps you make better decisions. The results are
visualized in a diagram similar to a flowchart, mimicking thinking on a
human level, making them incredibly easy to interpret.
Decision trees are described as "white box" algorithms, sharing the internal
logic for making decisions not found in neural networks and other "black
box" algorithms. It trains faster than neural networks, with a time complexity
being a function of the number of attributes and records in the data. Decision
trees are classified as non-parametric or distribution-free, which means they
are not dependent on assumptions about probability distribution. They also
produce good accuracy with high-dimensional data.
How Decision Trees Work
The idea behind the algorithm is:
Use ASM (Attribute Selection Measures) to choose the best attribute to
split the records
Transform the attribute into a decision node and split the dataset into
subsets
Repeat the process recursively for every child node to build the tree
until one of the following conditions is matched:
Every tuple is part of the same attribute value
There aren't any more instances
There aren't any more attributes
Attribute Selection Measure
This is a heuristic to help choose the right splitting measures to partition the
data efficiently. This is also called "splitting rules" because it helps us work
out the breakpoints for any tuple on a specified node. ASM gives each
attribute or feature a rank based on the dataset, and the best score or rank is
selected as the splitting attribute. Where continuous-valued attributes are
concerned, we also need to define the split points for the branches.
Some of the more popular methods of measure selection are Information
Gain, Gain Ratio, and Gini Index – let's look at these in more detail.
Information Gain
Claude Shannon developed the concept of entropy in a 1948 paper, "A
Mathematical Theory of Communication." In mathematics and physics,
entropy is the impurity or randomness in a system, while in information
theory, it is the impurity present in a set of examples. Information gain
indicates a decrease in entropy and computes the differences between entropy
before being split, and the average entropy after the database has been split
based on the values of specific attributes.
In this equation, Pi indicates the probability of an arbitrary tuple from D
belonging to class Ci.
In these equations, Info(D) indicates the average level of information
required to identify a D tuple's class label, and |Dj|/|D| acts as the jth
partition's weight. InfoA(D) is the required information for classifying a D
tuple based on A's partitioning. The attribute A with Gain(A) – the most
information gain – is picked as the best attribute for splitting at node N().
Gain Ratio
Information gain is somewhat biased for the attribute with several potential
outcomes, which means the attribute with many distinct values is preferred.
For example, an attribute with a unique identifier (customer_ID, for
example) has no info(D). Pure partition causes this, maximizing information
gain and creating nothing more than useless partitioning.
C4.5 improves on ID3, using an information gain extension called the gain
ratio. This uses Split Info to normalize information gain, thus taking care of
the bias issue.
In this equation, |Dj|/|D| acts as the jth partition's weight, while v indicates
attribute A's discrete values.
We can define the gain ratio as:
The attribute with the most gain ratio is picked as the splitting attribute.
Gini Index
CART (Classification and Regression Tree) is a decision tree algorithm that
uses Gini to create the split points:
In this equation, Pi indicates the probability of a tuple from D belonging to
class Ci.
The Gini Index takes each attribute's binary split into account. A weighted
sum of each partition's impurity can be computed; where a binary split on
attribute A results in data D being split into D1 and D2, then D's Gini Index
is:
Where we have a discrete-valued attribute, we select the subset, providing the
minimum Gini Index for the chosen attribute as the splitting attribute. Where
we have continuous-valued attributes, each pair of adjacent values is chosen
as a potential splitting point. The one with the smallest Gini Index becomes
the actual splitting point.
Building a Decision Tree Classifier
Let's build our decision tree classifier.
Step One – Import the Libraries
The first step is to import the libraries we need:
# Load libraries
import pandas as pd
from sklearn.tree import DecisionTreeClassifier # Import Decision
Tree Classifier
from sklearn.model_selection import train_test_split # Import
train_test_split function
from sklearn import metrics #Import scikit-learn metrics module for
accuracy calculation
Step Two – Load the Data
We are using the Pima Indian Diabetes dataset for this, so we need to import
it using the read CSV function in Panda. The dataset can be downloaded from
here.
col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree',
'age', 'label']
# load dataset
pima = pd.read_csv("pima-indians-diabetes.csv", header=None,
names=col_names)
If you want to see what the dataset looks like, execute this:
pima.head()
When you execute this, you will see
0
1
2
3
4
PREGNANT
6
1
8
1
0
GLUCOSE
148
85
183
89
137
BP
72
66
64
66
40
SKIN
35
29
0
23
35
INSULIN
0
0
0
94
168
BMI
33.6
26.6
23.3
28.1
43.1
PEDIGREE
0.27
0.351
0.672
0.167
2.288
Step Three - Feature Selection
Now the given columns need to be divided into two variable types –
dependent (target variable) and independent (feature variables).
#split dataset in features and target variable
feature_cols = ['pregnant', 'insulin', 'bmi', 'age','glucose','bp','pedigree']
X = pima[feature_cols] # Features
y = pima.label # Target variable
Step Four – Splitting the Data
Understanding how the model performs requires dividing the dataset into two
– training and test sets. We'll use a function called train_test_split() to do this,
passing three parameters – features, target, and test_set size.
# Split the dataset into a training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=1) # 70% training and 30% test
Step Five – Building the Model
We'll use Scikit-Learn to build the decision tree classifier model:
# Create Decision Tree classifier object
clf = DecisionTreeClassifier()
# Train Decision Tree Classifier
clf = clf.fit(X_train,y_train)
#Predict the response for test dataset
y_pred = clf.predict(X_test)
Step Six - Evaluating the Model
Now we can estimate the accuracy of the model in predicting cultivar types.
We compute the accuracy by comparing the actual values from the test set
with the predicted values:
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
Output:
Accuracy: 0.6753246753246753
We got a good accuracy of 67.53%, but we can improve this by tuning the
algorithm's parameters.
Visualizing Decision Trees
Scikit-learn has a good function called export_graphviz that lets us use a
Jupyter notebook to visualize the tree. You also need pydotplus and graphviz
to plot the tree.
pip install graphviz
pip install pydotplus
export_graphviz – this function converts the classifier into a dot file,
which is then converted by pydotplus into a png or a displayable form
in a Jupyter notebook.
from sklearn.tree import export_graphviz
from sklearn.externals.six import StringIO
from IPython.display import Image
import pydotplus
dot_data = StringIO()
export_graphviz(clf, out_file=dot_data,
filled=True, rounded=True,
special_characters=True, feature_names =
feature_cols,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png('diabetes.png')
Image(graph.create_png())
In the resulting chart, the internal nodes each have a decision rule used to
split the data. Gini, otherwise known as the Gini Ratio, measures the node's
impurity. A node is said to be pure when all the records in it share the same
class, such as the leaf node.
The resulting tree is not pruned; it is difficult to explain or understand, so we
need to prune it and optimize it.
Optimizing Decision Tree Performance
When we use Scikit-Learn, we can only optimize the decision tree classifier
model by pre-pruning it. A tree's maximum depth may be used as one of the
control variables for this. In the example below, a decision tree can be plotted
on the same data using max_depth=3.
The pre-pruning parameters are:
criterion – Optional (default=" gini") or Choose the attribute selection
measure. This parameter provides the ability to use the attribution
selection measure different-different. The criteria supported by it are
'gini' for the Gini Index and 'entropy' for information gain.
splitter – a string and optional (default=" best") or Split Strategy.
Using this parameter, we can decide on the split parameter.
max_depth – int or None, optional (default=None), or the Maximum
Depth of a Tree. If this is None, we expand the nodes until every leaf
has samples that are less than min_samples_split. The higher the
maximum depth value, the more chance of overfitting, while lower
values risk underfitting.
Other than using these parameters, we can also use entropy or other attribute
selection measures:
# Create Decision Tree classifier object
clf = DecisionTreeClassifier(criterion="entropy", max_depth=3)
# Train Decision Tree Classifer
clf = clf.fit(X_train,y_train)
#Predict the response for test dataset
y_pred = clf.predict(X_test)
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
Accuracy: 0.7705627705627706
The classifier produced an accuracy of 77.05%. much better than the last
omodel.
Visualizing Decision Trees
from sklearn.externals.six import StringIO
from IPython.display import Image
from sklearn.tree import export_graphviz
import pydotplus
dot_data = StringIO()
export_graphviz(clf, out_file=dot_data,
filled=True, rounded=True,
special_characters=True, feature_names =
feature_cols,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png('diabetes.png')
Image(graph.create_png())
The resulting pruned model isn't so complex, can be easily explained, and are
much easier to understand than the unpruned version.
Pros
Decision trees are simple to visualize and interpret
They can capture non-linear patterns easily
Not so much data preprocessing is required, i.e., columns don't need to
be normalized
We can use decision trees for feature engineering for selecting
variables, such as missing value predictions
The algorithm is non-parametric and doesn't make assumptions about
distribution
Cons
Decision trees are highly sensitive data and can overfit it
Small data variance can result in a completely different tree. However,
using boosting and bagging algorithms can reduce this
They are biased for imbalanced datasets – the dataset should be
balanced before the tree is created.
Next, we move on to KNN or K-Nearest Neighbors.
Chapter 7: K-Nearest Neighbors
KNN is a supervised learning algorithm and, in its most basic form, it is easy
to implement, yet it can perform incredibly complex classification. As there
is no special training phase, it is considered a lazy algorithm. Instead, it uses
all the provided data to train on while classifying new instances or data
points. Like the decision tree, it is a non-parametric algorithm and assumes
nothing about the underlying data. This is one of the most useful features
because real-world data doesn't tend to follow uniform distribution, linear
separability, or any other theoretical assumption.
This chapter will dive into KNN and how to use Scikit-Learn to implement it.
We'll start with the theory behind it.
The Theory Behind KNN
KNN is one of the easiest algorithms in the supervised learning family, and
all it does is calculates distances of a data point to all others. The distance
may be Euclidean, Manhattan, or any other type. It chooses the K-nearest
points once the distance is calculated, where K is an integer. Lastly, the data
point is assigned to the class where most of the K data points belong.
For example, let's say we have a dataset containing two variables. The
algorithm needs to classify a data point with X into one of two classes – Blue
or Red. The data point has coordinates of x=45 and y=50, and we'll assume K
has a value of 3. The algorithm's first step is to calculate the distance from X
to all the other points. Next, it finds the three nearest points with the least
distance to X. Lastly, the algorithm assigns the new data point to the class
where most of the three nearest points belong. We'll assume that two of our
points are in the red class, while the third is in blue, so the new point is
classified as red.
Implementing KNN Algorithm With Scikit-Learn
Now we will look at how to implement KNN using Scikit-Learn and no more
than 20 lines of code.
Step One – Import the Dataset
As always, we need to import the libraries and the dataset first. We're using
the iris dataset, with its four attributes:
sepal-width
sepal-length
petal-width
petal-length
These attributes indicate the specific iris plant types, and our task is to predict
the class the plants belong to. There are three classes in the dataset:
Iris-setosa
Iris-versicolor
Iris-Virginica
You can download the dataset here.
Importing Libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
Importing the Dataset
Now the dataset needs to be imported and loaded into the Pandas DataFrame:
url = "https://archive.ics.uci.edu/ml/machine-learningdatabases/iris/iris.data"
# Assign column names to the dataset
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width',
'Class']
# Read dataset to pandas dataframe
dataset = pd.read_csv(url, names=names)
If you want to see what the dataset looks like, execute this:
dataset.head()
You will see the first five rows of the dataset::
sepal-length
sepal-width
3.5
0 5.1
3.0
1 4.9
3.2
2 4.7
3.1
3 4.6
3.6
4 5.0
Step Two – Data Preprocessing
petal-length
1.4
1.4
1.3
1.5
1.4
petal-width
0.2
0.2
0.2
0.2
0.2
Class
Iris-setosa
Iris-setosa
Iris-setosa
Iris-setosa
Iris-setosa
Next, the dataset needs to be divided into attributes and labels:
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values
The X variable has the dataset's first four columns (the attributes), while Y
has the labels.
Step Three – Train Test Split
We don't want to risk overfitting, so the dataset should be split into training
and test sets. This will tell us what the algorithm's performance was during
testing. By splitting the data, the algorithm learns on one set of data and is
tested on previously unseen data, exactly what would happen in a real-world
application.
Here's how to create the two sets:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
This script divides the dataset, putting 80% into a training set and the
remaining 20% in the test set. The dataset contains 150 records, so 120 go for
training and 30 for testing.
Step Four – Feature Scaling
Before the algorithm can make any predictions, we should scale the features
to ensure they are all evaluated uniformly. Why do we need to do this? The
raw data contains a range of widely varying values, and in some algorithms,
the functions cannot work as they should without normalization. For
example, most classifiers use Euclidean distance to calculate the distance
between a pair of points. If one feature has a wide range of values, this
specific feature governs the distance. As such, the range of every feature must
be normalized to ensure each one contributes to the resulting distance
proportionately.
The gradient descent algorithm is commonly used in training neural
networks, among other algorithms, and it will converge much faster when the
features are normalized.
Here’s how to do the feature scaling:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
Step Five - Training and Predictions
Training the KNN algorithm to make predictions is straightforward,
particularly when Scikit-Learn is used:
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=5)
classifier.fit(X_train, y_train)
First, the KNeighborsClassifier class is imported from the library called
sklearn.neighbors. Second, we initialized the class with a single parameter,
n_neighbors, which is K's value. K doesn't have an ideal value; it is chosen
once the testing and evaluation stages are out of the way. However, the most
commonly used value is 5.
Now we need to use our test data to make predictions:
y_pred = classifier.predict(X_test)
Step Six - Evaluating the Algorithm
A few metrics commonly used in algorithm evaluation are confusion matrix,
precision, recall, and F1 score. We can use two methods from sklearn.metrics
to calculate the metrics – confusion_matrix and classification_report.
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
Output:
[[11 0 0]
0 13 0]
0 1 6]]
precision recall f1-score support
Iris-setosa
Iris-versicolor
1.00
1.00
1.00
1.00
1.00
1.00
11
13
Iris-virginica
1.00
1.00
1.00
6
avg / total
1.00
1.00
1.00
30
These results indicate that KNN classified all of the test set records with an
accuracy of 100% - it doesn't get better than that. However, while the
algorithm showed excellent accuracy on this dataset, this won't be the case
with every application. KNN isn't known for working well with categorical
features or high-dimensionality.
Compare the Error Rate with K Value
As mentioned earlier, there is no way of knowing upfront which K value will
give us the best results the first time around. We chose 5, but this was random
– luckily for us, it worked, providing 100% accuracy. One way of finding the
best K value is to plot a graph with the value of K and the dataset's
corresponding error rate.
Our next step is plotting the mean error for the test set's predicted values for
the K values between 1 and 40. First, we should calculate the mean of error:
error = []
# Calculating error for K values between 1 and 40
for i in range(1, 40):
knn = KNeighborsClassifier(n_neighbors=i)
knn.fit(X_train, y_train)
pred_i = knn.predict(X_test)
error.append(np.mean(pred_i != y_test))
Here, a loop is executed from 1 to 40. Each iteration calculates the predicted
value's mean error, appending the result to the error list.
Next, we need to plot the error values against the values of K:
plt.figure(figsize=(12, 6))
plt.plot(range(1, 40), error, color='red', linestyle='dashed', marker='o',
markerfacecolor='blue', markersize=10)
plt.title('Error Rate K Value')
plt.xlabel('K Value')
plt.ylabel('Mean Error')
The output graph shows that when K's value is between 5 and 18, the mean
error is zero. Play about with the value of K to see how it affects the
prediction accuracy.
Pros
KNN Is very easy to implement.
It is considered lazy learning and doesn't need any training before
making real-time predictions. This means it is faster than SVM, linear
regression, and all the other algorithms that need to be trained first.
KNN only requires two parameters - the value of K and your chosen
distance function (Manhattan, Euclidean, etc.
Because no training is needed, we can add new data seamlessly.
Cons
KNN isn't effective with high-dimensional data because too many
dimensions make it difficult to calculate the distance in each one.
The bigger the dataset, the higher the prediction cost because the
distance between a new data point and existing ones is higher in larger
datasets.
KNN isn't good with categorical features because they don't make it
easy to find the distance between the dimensions.
KNN is one of the simpler algorithms, yet it is incredibly powerful. Training
is one of the hardest parts of any ML algorithm, and KNN doesn't require
this. It has been used in pattern recognition, finding similarities between
documents, preprocessing and dimensionality reduction in computer vision,
especially facial recognition, and developing recommender systems.
In our final chapter, we will look at using unsupervised learning techniques to
find patterns in data.
Chapter 8: Finding Patterns in Data
Unsupervised learning analyzes unlabeled datasets and clusters them using
different machine learning algorithms, which find hidden patterns in the data
without needing a human helping hand. We use unsupervised learning
models for three primary problems:
Clustering – this data mining technique groups unlabeled data based
on differences or similarities. For example, the K-Means clustering
algorithms place similar data into groups, where K indicates the
granularity and size of the grouping. This works well in image
compression, market segmentation, and other similar tasks.
Association – this unsupervised learning method finds relationships in
a given dataset between the different variables. It does this by using
different rules and is usually used in recommendation engines and
market basket analysis. For example, Netflix recommends movies and
TV series based on what you watched previously.
Dimensionality Reduction – this unsupervised learning technique is
typically used when a given dataset has a high number of dimensions
or features. The data inputs are reduced to a more manageable number
while the data integrity is preserved. This technique is often used in
data preprocessing, for example, when noise is removed from visual
data to improve quality.
The Difference between Supervised and Unsupervised Learning
The primary difference is that supervised learning uses labeled data while
unsupervised learning doesn't. A supervised learning algorithm uses a
training dataset to learn from, making iterative predictions and adjusting to
get the right answer. They are more accurate than unsupervised learning
algorithms, but humans must first label the data. For example, we can use
supervised learning to predict the length of a commute based on weather
conditions, time, etc., but the model has to be trained first to know that bad
weather extends commuting time.
In contrast, unsupervised learning models work alone, without human
intervention, to see the unlabeled data structure. They do still require a certain
amount of human intervention in the output variable validation. For example,
we can use unsupervised learning to determine that some online shoppers buy
groups of products together. However, this would need to be validated by a
data analyst to ensure it is right for a recommendation engine to put baby
clothes in the same group as applesauce, diapers, and sippy cups.
Preparing the Data
We will use the iris dataset to make our predictions. To recap, it has 150
records, four attributes – petal length, petal width, sepal length, and sepal
width – and three classes – setosa, versicolor, and virginica. The four features
will be fed into the algorithm to predict the classes each flower belongs to.
Scikit-Learn is used to load the dataset, and Matplotlib is used for
visualization. The code below shows you how to explore the dataset:
# Importing Modules
from sklearn import datasets
import matplotlib.pyplot as plt
# Loading dataset
iris_df = datasets.load_iris()
# Available methods on dataset
print(dir(iris_df))
# Features
print(iris_df.feature_names)
# Targets
print(iris_df.target)
# Target Names
print(iris_df.target_names)
Output:
label = {0: 'red', 1: 'blue', 2: 'green'}
# Dataset Slicing
x_axis = iris_df.data[:, 0] # Sepal Length
y_axis = iris_df.data[:, 2] # Sepal Width
# Plotting
plt.scatter(x_axis, y_axis, c=iris_df.target)
plt.show()
['DESCR', 'data', 'feature_names', 'target', 'target_names']
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width
(cm)']
Output:
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
00000000000011111111111111111111111111
11111111111111111111111122222222222222
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
['setosa' 'versicolor' 'virginica']
The resultant plot will show three colors – violet representing setosa, green
representing versicolor, and yellow representing viriginica.
Clustering
Clustering divides the data into groups based on trait similarity. When an
input is provided for prediction, the algorithm looks in the cluster it should
belong to, based on the features and makes the prediction.
Let's look at a few types of clustering algorithms.
K-Means Clustering
K-Means is an iterative algorithm for finding the local maxima in every
iteration. To start with, we choose the number of clusters we want. We know
we have three classes for this example, so the algorithm is programmed to
group all the data into those classes. We do that by passing the algorithm the
"n_clusters" parameter. Three inputs or points are randomly assigned into
three clusters, and the centroid distance between each point is used to
segregate the next lot of points into the right clusters. After that, the centroids
for all clusters are computed again.
A centroid is a group of feature values used to define the groups. We can
interpret the type of group represented by each cluster by examining its
respective centroid.
The K-Means algorithm is imported from Scikit-Learn, the features are fitted,
and the predictions are made:
# Importing Modules
from sklearn import datasets
from sklearn.cluster import KMeans
# Loading dataset
iris_df = datasets.load_iris()
# Declaring Model
model = KMeans(n_clusters=3)
# Fitting Model
model.fit(iris_df.data)
# Predicting a single input
predicted_label = model.predict([[7.2, 3.5, 0.8, 1.6]])
# Prediction on the entire data
all_predictions = model.predict(iris_df.data)
# Printing Predictions
print(predicted_label)
print(all_predictions)
Output:
[0]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
00000000000022122222222222222222222222
21222222222222222222222212111121111112
2 1 1 1 1 2 1 2 1 2 1 1 2 2 1 1 1 1 1 2 1 1 1 1 2 1 1 1 2 1 1 1 2 1 1 2]
Hierarchical Clustering
The hierarchical algorithm builds a hierarchy of clusters, as the name implies.
At the start, all the data is assigned to a cluster. Then the nearest two clusters
are joined in the same cluster, and so on until one cluster is left.
Here's an example using a grain dataset, which can be downloaded from here.
# Importing Modules
from scipy.cluster.hierarchy import linkage, dendrogram
import matplotlib.pyplot as plt
import pandas as pd
# Reading the DataFrame
seeds_df = pd.read_csv(
"https://raw.githubusercontent.com/vihar/unsupervised-learningwith-python/master/seeds-less-rows.csv")
# Remove the grain species from the DataFrame, save for later
varieties = list(seeds_df.pop('grain_variety'))
# Extract the measurements as a NumPy array
samples = seeds_df.values
"""
Perform hierarchical clustering on samples using the
linkage() function with the method='complete' keyword argument.
Assign the result to mergings.
"""
mergings = linkage(samples, method='complete')
"""
Plot a dendrogram using the dendrogram() function on mergings,
specifying the keyword arguments labels=varieties, leaf_rotation=90,
and leaf_font_size=6.
"""
dendrogram(mergings,
labels=varieties,
leaf_rotation=90,
leaf_font_size=6,
)
plt.show()
The result will be shown as a dendrogram plot.
K-Means vs. Hierarchical Clustering
There are a couple of differences worth mentioning:
K-Means can handle big data efficiently, while hierarchical clustering
cannot. This is because K-Means has linear time complexity, i.e., O(n),
while hierarchical clustering has quadratic time complexity, i.e., O(n2)
An arbitrary cluster choice is used at the start of K-Means, and the
results will likely differ when the algorithm is run several times. In
hierarchical clustering, the results are reproducible.
K-Means works well with hyper spherical cluster shapes, like a sphere
in 3D or a circle in 2D
Noisy data is not allowed in K-Means but can be used in hierarchical
clustering.
T-SNE Clustering
One of the best unsupervised learning algorithms for visualization is t-SNE,
otherwise known as t-distributed stochastic neighbor embedding. This
algorithm is used for mapping higher-dimensional space into twodimensional or three-dimensional space so it can be visualized. More
specifically, each high-dimensional object is modeled by a two-dimensional
or three-dimensional point in a way that nearby points model similar objects,
while distant points with high probability model dissimilar points.
Here's t-SNE implemented on the iris dataset:
# Importing Modules
from sklearn import datasets
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
# Loading dataset
iris_df = datasets.load_iris()
# Defining Model
model = TSNE(learning_rate=100)
# Fitting Model
transformed = model.fit_transform(iris_df.data)
# Plotting 2d t-Sne
x_axis = transformed[:, 0]
y_axis = transformed[:, 1]
plt.scatter(x_axis, y_axis, c=iris_df.target)
plt.show()
The resulting plot has three colors: violet to represent setosa, green to
represent versicolor, and yellow to represent virginica.
There are four features, so the dataset is four-dimensional. We transformed
the dataset, representing it as a two-dimensional figure. You can also apply tSNE to datasets with n-features.
DBSCAN Clustering
DBSCAN, otherwise known as "density-based spatial clustering of
applications with noise," is one of the more popular algorithms used in place
of the K-Means algorithm in predictive analysis tasks. An input is not
required to indicate how many clusters there are, but two parameters need
tuning.
Again, we can implement this from Scikit-Learn, and defaults are provided
for the min_samples and eps parameters. However, you do need to tune these.
The min_samples parameter indicates the minimum number of data points
required in a neighborhood to be treated as a cluster, while the eps parameter
indicates the maximum distance between a pair of data points to consider
them in the same neighborhood.
Here is a DBSCAN clustering implementation:
# Importing Modules
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn.decomposition import PCA
# Load Dataset
iris = load_iris()
# Declaring Model
dbscan = DBSCAN()
# Fitting
dbscan.fit(iris.data)
# Transforming Using PCA
pca = PCA(n_components=2).fit(iris.data)
pca_2d = pca.transform(iris.data)
# Plot based on Class
for i in range(0, pca_2d.shape[0]):
if dbscan.labels_[i] == 0:
c1 = plt.scatter(pca_2d[i, 0], pca_2d[i, 1], c='r', marker='+')
elif dbscan.labels_[i] == 1:
c2 = plt.scatter(pca_2d[i, 0], pca_2d[i, 1], c='g', marker='o')
elif dbscan.labels_[i] == -1:
c3 = plt.scatter(pca_2d[i, 0], pca_2d[i, 1], c='b', marker='*')
plt.legend([c1, c2, c3], ['Cluster 1', 'Cluster 2', 'Noise'])
plt.title('DBSCAN finds 2 clusters and Noise')
plt.show()
The output is a plot visualizing the data.
Conclusion
First, I want to thank you for taking the time to read my beginner's guide to
machine learning with Python.
Machine learning is a huge subject, and it wouldn't have been possible to
cover every single subject in this guide. Instead, I chose the basic subjects
that anyone who wants to get into machine learning should learn.
As well as learning what machine learning is and how it all came about, you
also learned some of the basic algorithms and tasks involved. You learned:
The different types of machine learning
How linear regression works
What the different types of machine learning classification are
What SVM is
How decision trees work for classification
What KNN is
How to use unsupervised learning to detect patterns in data
Once you understand the topics covered in this guide, you will be ready to
take the next step and compound your learning with more complex subjects.
There are plenty of guides available and lots of courses on the internet – sign
up and improve your skills.
Thank you, once again, for choosing my guide. I hope it helped you, and you
now have a better understanding of machine learning.
References
"5 Types of Regression Analysis and When to Use Them | Appier." 2021. Appier. January 15, 2021.
https://www.appier.com/blog/5-types-of-regression-analysis-and-when-to-use-them/.
"A Complete Guide to Understand Classification in Machine Learning." 2021. Analytics Vidhya.
September 9, 2021. https://www.analyticsvidhya.com/blog/2021/09/a-complete-guide-to-understandclassification-in-machine-learning/.
Avinash Navlani. 2018a. "Decision Tree Classification in Python." DataCamp Community. 2018.
https://www.datacamp.com/community/tutorials/decision-tree-classification-python.
———. 2018b. "Support Vector Machines in Scikit-Learn." DataCamp Community. 2018.
https://www.datacamp.com/community/tutorials/svm-classification-scikit-learn-python.
"Decision Tree." 2017. GeeksforGeeks. October 16, 2017. https://www.geeksforgeeks.org/decisiontree/?ref=lbp.
"How to Use Unsupervised Learning with Python to Find Patterns in Data." 2019. Built In. 2019.
https://builtin.com/data-science/unsupervised-learning-python.
"Machine Learning Tutorial | Machine Learning with Python - Javatpoint." n.d. Www.javatpoint.com.
https://www.javatpoint.com/machine-learning.
Python, Real. n.d. "The K-Nearest Neighbors (KNN) Algorithm in Python – Real Python."
Realpython.com. Accessed January 15, 2022. https://realpython.com/knn-python/.
Real Python. 2019. "Linear Regression in Python." Realpython.com. Real Python. April 15, 2019.
https://realpython.com/linear-regression-in-python/.
Robinson, Scott. 2018. "K-Nearest Neighbors Algorithm in Python and Scikit-Learn." Stack Abuse.
Stack Abuse. February 15, 2018. https://stackabuse.com/k-nearest-neighbors-algorithm-in-python-andscikit-learn/.
Wakefield, Katrina. 2019. "A Guide to Machine Learning Algorithms and Their Applications."
Sas.com. 2019. https://www.sas.com/en_gb/insights/articles/analytics/machine-learningalgorithms.html.
"What Is Machine Learning: Definition, Types, Applications and Examples." 2019. Potentia Analytics.
December 19, 2019. https://www.potentiaco.com/what-is-machine-learning-definition-typesapplications-and-examples/#:~:text=These%20are%20three%20types%20of.
Download