Uploaded by sairam ch

NEC ML UNIT-I Final

advertisement
UNIT-I
Learning a class from examples, Vapnik Chervonenkies(VC) Dimension, Probably Approximately
correct(PAC) learning, Noise, learning multiple classes.
Regression: Simple linear regression, multiple linear regression, model selection and generalization,
Dimensions of supervised Machine learning algorithm, Bayesian classification.
……………………………………………………………………………………………………………………………..
Need For Machine Learning
Ever since the technical revolution, we’ve been generating an immeasurable amount of data. As per research,
we generate around 2.5 quintillion bytes of data every single day! It is estimated that by 2020, 1.7MB of data
will be created every second for every person on earth.
With the availability of so much data, it is finally possible to build predictive models that can study and analyze
complex data to find useful insights and deliver more accurate results.
Top Tier companies such as Netflix and Amazon build such Machine Learning models by using tons of data in
order to identify profitable opportunities and avoid unwanted risks.
Here’s a list of reasons why Machine Learning is so important:
•
Increase in Data Generation: Due to excessive production of data, we need a method that can be used
to structure, analyze and draw useful insights from data. This is where Machine Learning comes in. It
uses data to solve problems and find solutions to the most complex tasks faced by organizations.
•
Improve Decision Making: By making use of various algorithms, Machine Learning can be used to make
better business decisions. For example, Machine Learning is used to forecast sales, predict downfalls in
the stock market, identify risks and anomalies, etc.
•
Uncover patterns & trends in data: Finding hidden patterns and extracting key insights from data is the
most essential part of Machine Learning. By building predictive models and using statistical techniques,
Machine Learning allows you to dig beneath the surface and explore the data at a minute scale.
Understanding data and extracting patterns manually will take days, whereas Machine Learning
algorithms can perform such computations in less than a second.
•
Solve complex problems: From detecting the genes linked to the deadly ALS disease to building selfdriving cars, Machine Learning can be used to solve the most complex problems.
1
To give you a better understanding of how important Machine Learning is, let’s list down a couple of Machine
Learning Applications:
•
Netflix’s Recommendation Engine: The core of Netflix is its infamous recommendation engine. Over
75% of what you watch is recommended by Netflix and these recommendations are made by
implementing Machine Learning.
•
Facebook’s Auto-tagging feature: The logic behind Facebook’s DeepMind face verification system is
Machine Learning and Neural Networks. DeepMind studies the facial features in an image to tag your
friends and family.
•
Amazon’s Alexa: The infamous Alexa, which is based on Natural Language Processing and Machine
Learning is an advanced level Virtual Assistant that does more than just play songs on your playlist. It
can book you an Uber, connect with the other IoT devices at home, track your health, etc.
•
Google’s Spam Filter: Gmail makes use of Machine Learning to filter out spam messages. It uses Machine
Learning algorithms and Natural Language Processing to analyze emails in real-time and classify them
as either spam or non-spam.
Introduction To Machine Learning
The term Machine Learning was first coined by Arthur Samuel in the year 1959. Looking back, that year was
probably the most significant in terms of technological advancements.
If you browse through the net about ‘what is Machine Learning’, you’ll get at least 100 different definitions.
However, the very first formal definition was given by Tom M. Mitchell:
“A computer program is said to learn from experience E with respect to some class of tasks T and performance
measure P if its performance at tasks in T, as measured by P, improves with experience E.”
In simple terms, Machine learning is a subset of Artificial Intelligence (AI) which provides machines the ability
to learn automatically & improve from experience without being explicitly programmed to do so. In the sense,
it is the practice of getting Machines to solve problems by gaining the ability to think.
2
But wait, can a machine think or make decisions? Well, if you feed a machine a good amount of data, it will
learn how to interpret, process and analyze this data by using Machine Learning Algorithms, in order to solve
real-world problems.
Machine Learning Definitions
Algorithm: A Machine Learning algorithm is a set of rules and statistical techniques used to learn patterns from
data and draw significant information from it. It is the logic behind a Machine Learning model. An example of
a Machine Learning algorithm is the Linear Regression algorithm.
Model: A model is the main component of Machine Learning. A model is trained by using a Machine Learning
Algorithm. An algorithm maps all the decisions that a model is supposed to take based on the given input, in
order to get the correct output.
Predictor Variable: It is a feature(s) of the data that can be used to predict the output.
Response Variable: It is the feature or the output variable that needs to be predicted by using the predictor
variable(s).
Training Data: The Machine Learning model is built using the training data. The training data helps the model
to identify key trends and patterns essential to predict the output.
Testing Data: After the model is trained, it must be tested to evaluate how accurately it can predict an outcome.
This is done by the testing data set.
To sum it up, take a look at the above figure. A Machine Learning process begins by feeding the machine lots
of data, by using this data the machine is trained to detect hidden insights and trends. These insights are then
used to build a Machine Learning Model by using an algorithm in order to solve a problem.
The next topic in this Introduction to Machine Learning blog is the Machine Learning Process.
Machine Learning Process
The Machine Learning process involves building a Predictive model that can be used to find a solution for a
Problem Statement. To understand the Machine Learning process let’s assume that you have been given a
problem that needs to be solved by using Machine Learning.
The problem is to predict the occurrence of rain in your local area by using Machine Learning.
The below steps are followed in a Machine Learning process:
Step 1: Define the objective of the Problem Statement
At this step, we must understand what exactly needs to be predicted. In our case, the objective is to predict
the possibility of rain by studying weather conditions. At this stage, it is also essential to take mental notes on
3
what kind of data can be used to solve this problem or the type of approach you must follow to get to the
solution.
Step 2: Data Gathering
At this stage, you must be asking questions such as,
•
What kind of data is needed to solve this problem?
•
Is the data available?
•
How can I get the data?
Once you know the types of data that is required, you must understand how you can derive this data. Data
collection can be done manually or by web scraping. However, if you’re a beginner and you’re just looking
to learn Machine Learning you don’t have to worry about getting the data. There are 1000s of data resources
on the web, you can just download the data set and get going.
Coming back to the problem at hand, the data needed for weather forecasting includes measures such as
humidity level, temperature, pressure, locality, whether or not you live in a hill station, etc. Such data must be
collected and stored for analysis.
Step 3: Data Preparation
The data you collected is almost never in the right format. You will encounter a lot of inconsistencies in the
data set such as missing values, redundant variables, duplicate values, etc. Removing such inconsistencies is very
essential because they might lead to wrongful computations and predictions. Therefore, at this stage, you scan
the data set for any inconsistencies and you fix them then and there.
Step 4: Exploratory Data Analysis
Grab your detective glasses because this stage is all about diving deep into data and finding all the hidden data
mysteries. EDA or Exploratory Data Analysis is the brainstorming stage of Machine Learning. Data Exploration
involves understanding the patterns and trends in the data. At this stage, all the useful insights are drawn and
correlations between the variables are understood.
For example, in the case of predicting rainfall, we know that there is a strong possibility of rain if the
temperature has fallen low. Such correlations must be understood and mapped at this stage.
Step 5: Building a Machine Learning Model
4
All the insights and patterns derived during Data Exploration are used to build the Machine Learning Model.
This stage always begins by splitting the data set into two parts, training data, and testing data. The training
data will be used to build and analyze the model. The logic of the model is based on the Machine Learning
Algorithm that is being implemented.
In the case of predicting rainfall, since the output will be in the form of True (if it will rain tomorrow) or False
(no rain tomorrow), we can use a Classification Algorithm such as Logistic Regression.
Choosing the right algorithm depends on the type of problem you’re trying to solve, the data set and the level
of complexity of the problem. In the upcoming sections, we will discuss the different types of problems that
can be solved by using Machine Learning.
Step 6: Model Evaluation & Optimization
After building a model by using the training data set, it is finally time to put the model to a test. The testing
data set is used to check the efficiency of the model and how accurately it can predict the outcome. Once the
accuracy is calculated, any further improvements in the model can be implemented at this stage. Methods like
parameter tuning and cross-validation can be used to improve the performance of the model.
Step 7: Predictions
Once the model is evaluated and improved, it is finally used to make predictions. The final output can be a
Categorical variable (eg. True or False) or it can be a Continuous Quantity (eg. the predicted value of a stock).
In our case, for predicting the occurrence of rainfall, the output will be a categorical variable.
So that was the entire Machine Learning process. Now it’s time to learn about the different ways in which
Machines can learn.
Machine Learning Types
A machine can learn to solve a problem by following any one of the following three approaches. These are
the ways in which a machine can learn:
1. Supervised Learning
2. Unsupervised Learning
3. Reinforcement Learning
Supervised Learning
Supervised learning is a technique in which we teach or train the machine using data which is well labeled.
To understand Supervised Learning let’s consider an analogy. As kids we all needed guidance to solve math
problems. Our teachers helped us understand what addition is and how it is done. Similarly, you can think of
supervised learning as a type of Machine Learning that involves a guide. The labeled data set is the teacher
that will train you to understand patterns in the data. The labeled data set is nothing but the training data set.
5
Consider the above figure. Here we’re feeding the machine images of Tom and Jerry and the goal is for the
machine to identify and classify the images into two groups (Tom images and Jerry images). The training data
set that is fed to the model is labeled, as in, we’re telling the machine, ‘this is how Tom looks and this is Jerry’.
By doing so you’re training the machine by using labeled data. In Supervised Learning, there is a well-defined
training phase done with the help of labeled data.
Unsupervised Learning
Unsupervised learning involves training by using unlabeled data and allowing the model to act on that
information without guidance.
Think of unsupervised learning as a smart kid that learns without any guidance. In this type of Machine
Learning, the model is not fed with labeled data, as in the model has no clue that ‘this image is Tom and this
is Jerry’, it figures out patterns and the differences between Tom and Jerry on its own by taking in tons of
data.
For example, it identifies prominent features of Tom such as pointy ears, bigger size, etc, to understand that
this image is of type 1. Similarly, it finds such features in Jerry and knows that this image is of type 2. Therefore,
it classifies the images into two different classes without knowing who Tom is or Jerry is.
Reinforcement Learning
Reinforcement Learning is a part of Machine learning where an agent is put in an environment and he learns
to behave in this environment by performing certain actions and observing the rewards which it gets from
those
actions.
6
This type of Machine Learning is comparatively different. Imagine that you were dropped off at an isolated
island! What would you do?
Panic? Yes, of course, initially we all would. But as time passes by, you will learn how to live on the island.
You will explore the environment, understand the climate condition, the type of food that grows there, the
dangers of the island, etc. This is exactly how Reinforcement Learning works, it involves an Agent (you, stuck
on the island) that is put in an unknown environment (island), where he must learn by observing and
performing actions that result in rewards.
Reinforcement Learning is mainly used in advanced Machine Learning areas such as self-driving cars, AplhaGo,
etc.
1. LEARNING FROM CLASS EXAMPLES
Let us say we want to learn the class, C, of a “family car.” We have a set of examples of cars, and we have a
group of people that we survey to whom we show these cars. The people look at the cars and label them; the
cars that they believe are family cars are positive examples, and the other cars are negative examples.
Class learning is finding a description that is shared by all the positive examples and none of the negative
examples.
Doing this, we can make a prediction: Given a car that we have not seen before, by checking with the
description learned, we will be able to say whether it is a family car or not. Or we can do knowledge
extraction.
After some discussions with experts in the field, let us say that we reach the conclusion that among all features
a car may have, the features that separate a family car from other type of cars are the price and engine power.
These two attributes are the inputs to the class recognizer.
Let us denote price as the first input attribute x1 (e.g., in U.S. dollars) and engine power as the second
attribute x2 (e.g., engine volume in cubic centimetres). Thus, we represent each car using two numeric values
7
Our training data can now be plotted in the two-dimensional (x1, x2) space where each instance t is a data
point at coordinates (xt1, xt2) and its type, namely, positive versus negative, is given by r t (see figure 2.1).
After further discussions with the expert and the analysis of the data, we may have reason to believe that for
a car to be a family car, its price and engine power should be in a certain range
for suitable values of p1, p2, e1, and e2. Equation 2.4 thus assumes C to be a rectangle in the price-engine
power space (see figure 2.2).
Equation 2.4 fixes H, the hypothesis class from which we believe C is drawn, namely, the set of rectangles.
The learning algorithm then finds hypothesis the particular hypothesis, h ∈ H, specified by a particular
quadruple of (ph1, ph2, eh1, eh2), to approximate C as closely as possible.
8
which particular h∈H is equal, or closest, to C. But once we restrict our attention to this hypothesis class,
learning the class reduces to the easier problem of finding the four parameters that define h.
The aim is to find h∈H that is as similar as possible to C. Let us say the hypothesis h makes a prediction for an
instance x such that
In real life we do not know C(x), so we cannot evaluate how well h(x) matches C(x). What we have is the
training set X, which is a small subset of the set of all possible x.
Error
The empirical error is the proportion of training instances where predictions of h do not match the required
values given in X. The error of hypothesis h given the training set X is
where 1(a != b) is 1 if a != b and is 0 if a = b (see figure 2.3).
Most specific and Most general Hypothesis:
In our example, the hypothesis class H is the set of all possible rectangles. Each quadruple (ph1, ph2, eh1, eh2),
defines one hypothesis, h, from H, and we need to choose the best one, or in other words, we need to find
the values of these four parameters given the training set, to include all the positive examples and none of the
negative examples.
One possibility is to find the most specific hypothesis, S, that is the tightest rectangle that includes all the
positive examples and none of the negative examples (see figure 2.4). This gives us one hypothesis, h = S, as
our induced class.
The most general hypothesis, G, is the largest rectangle we can draw that includes all the positive examples
and none of the negative examples (figure 2.4).
9
False positive and False Negative
•
•
•
•
C is the actual class and h is our induced hypothesis.
The point where C is 1 but h is 0 is a false negative, and
the point where C is 0 but h is 1 is a false positive.
Other points—namely, true positives and true negatives—are correctly classified.
version space
•
•
•
•
•
•
Any h∈H between S and G is a valid hypothesis with no error, said to be consistent with the training
set, and such h make up the version space.
Given another training set, S, G, version space, the parameters and thus the learned hypothesis, h,
can be different.
Actually, depending on X andH, there may be several Si and Gj which respectively make up the S-set
and the G-set.
Every member of the S-set is consistent with all the instances, and there are no consistent hypotheses
that are more specific. Similarly, every member of the G-set is consistent with all the instances, and
there are no consistent hypotheses that are more general.
These two make up the boundary sets and any hypothesis between them is consistent and is part of
the version space.
There is an algorithm called candidate elimination that incrementally updates the S- and G-sets as it
sees training instances one by one.
Margin
•
•
•
•
The margin, which is the distance between the
boundary and the instances closest to it.
We choose the hypothesis with the largest
margin, for best separation.
The shaded instances are those that define (or
support) the margin;
Other instances can be removed without
affecting h
10
2. VAPNIK-CHERVONENKIS DIMENSION
Let us say we have a dataset containing N points. These N points can be labelled in 2N ways as positive
and negative. Therefore, 2N different learning problems can be defined by N data points.
If for any of these problems, we can find a hypothesis h∈H that separates the positive examples from
the negative, then we say H shatters N points.
That is, any learning problem definable by N examples can be learned with no error by a hypothesis
drawn from H.
The maximum number of points that can be shattered by H is called the Vapnik-Chervonenkis (VC)
dimension of H, is denoted as VC(H), and measures the capacity of H.
The VC dimension of a classifier is defined by Vapnik and Chervonenkis to be the cardinality (size) of
the largest set of points that the classification algorithm can shatter.
Shattering is the ability of a model to classify a set of points perfectly. More generally, the model can
create a function that can divide the points into two distinct classes without overlapping
Definition: The Vapnik-Chervonenkis Dimension, VC(H), of hypothesis space H defined over instance space X
is the size of the largest finite subset of X shattered by H. If arbitrarily large finite sets of X can be shattered by
H, then VC(H) = ∞
In figure 2.6, we see that an axis-aligned rectangle can shatter four points in two dimensions. Then VC(H),
when H is the hypothesis class of axis-aligned rectangles in two dimensions, is four.
In calculating the VC dimension, it is enough that we find four points that can be shattered; it is not
necessary that we be able to shatter any four points in two dimensions. For example, four points placed on a
line cannot be shattered by rectangles.
However, we cannot place five points in two dimensions anywhere such that a rectangle can separate the
positive and negative examples for all possible labelings.
11
VC dimension may seem pessimistic. It tells us that using a rectangle our hypothesis class, we can learn only
datasets containing four points and not more.
A learning algorithm that can learn datasets of four points is not very useful. However, this is because the VC
dimension is independent of the probability distribution from which instances are drawn.
In real life, the world is smoothly changing, instances close by most of the time have the same labels, and we
need not worry about all possible labelings.
There are a lot of datasets containing many more data points than four that are learnable by our hypothesis
class (figure 2.1).
So even hypothesis classes with small VC dimensions are applicable and are preferred over those with large
VC dimensions, for example, a lookup table that has infinite VC dimension.
Example-2: Vapnik-Chervonenkis (VC) Dimension
• VC (Vapnik-Chervonenkis) dimension is a measure of the capacity or complexity of a space of functions
that can be learned by a classification algorithm (more specifically, hypothesis).
• The basic definition of VC dimension is the capacity of a classification algorithm, and is defined as the
maximum cardinality of the points that the algorithm is able to shatter.
Linear Classifier with two data points
A binary classifier, first is
positive class 'A' and another is
negative class 'B', with two data
points. The possible combinations of
data points are 2N
• In our case 2², i.e. (++,+--+,--)
• In all the cases, the linear
classifier can separate the
positive and negative data
points.
Linear Classifier with three data points
• Binary classification with three data
points (in 2D space
• The 3 points can take either class A (+)
or class B (-)
which gives us 23 (-8) possible
combinations (or learning
problems).
● a line can shatter 3 points (in general
position).
12
Linear Classifier with four data points
● Now, for the case of 4 points, we can have maximum of 24 (16) possible combinations.
● In Figure that the line was unable to shatter the two classes.
• So, we can say that the linear classifier can shatter at most 3 points.
Rectangle Classifier
In Four data point set, The rectangle classifier can shattered in all possible ways
• Given such 4 points, we assign them the {+,-} labels, in all possible ways.
For each labeling it must exist a rectangle which produces such assignment, i.e. such classification
• Our classifier: inside the rectangle positive and outside negative examples, respectively
● Given 4 points (linearly independent), we have the following assignments:
a) All points are "+" ⇒ use a rectangle that includes them
b) All points are "-" ⇒ use a empty rectangle
c) 3 points "-" and 1 "+" ⇒ use a rectangle centered on the "+" points
d) 3 points "+" and one "-" ⇒ we can always find a rectangle which exclude the "-" points
e) 2 points “+” and 2 points "-" ⇒ we can define a rectangle which includes the 2 "+" and excludes the 2 "-".
Rectangles Classifier with five data points
For any 5-point set, we can define a rectangle which has the most extern points as vertices
● If we assign to such vertices the "+" label and to the internal point the "-" label, there will not be any
rectangle which reproduces such assignment
13
Vapnik-Chervonenkis dimension (VC dim).
● A dataset containing N points.
• These N points can be labeled in 2N ways as positive and negative
• A hypothesis h E H that separates the positive examples from the negative, then we say H shatters N
points.
● The maximum number of points that can be shattered by H is called the Vapnik - Chervonenkis(VC)
dimension of H, is denoted as VC(H), and measures the capacity of H.
Text Book Example: Rectangle can shatter four points.
● An axis-aligned rectangle can shatter four points in two dimensions.
● Then VC(H), when H is the hypothesis class of axis-aligned rectangles in two dimensions.
• Rectangle can separate the positive and negative examples for all possible labeling.
• Only rectangles covering two points, with all possible shatter are shown in the diagram.
3. PROBABLY APPROXIMATELY CORRECT (PAC) LEARNING
In computational learning theory, probably approximately correct (PAC) learning is a framework for
mathematical analysis of machine learning. It was proposed in 1984 by Leslie Valiant.
In this framework, the learner receives samples and must select a generalization function (called the
hypothesis) from a certain class of possible functions.
The goal is that, with high probability (the "probably" part), the selected function will have low
generalization error (the "approximately correct" part).
The learner must be able to learn the concept given any arbitrary approximation ratio, probability of success,
or distribution of the samples.
The PAC model belongs to that class of learning models which is characterized by learning from examples. In
these models, say if f is the target function to be learnt, the learner is provided with some random examples
(actually these examples may be from some probability distribution over the input space) in the form of (X,f
(X)) where X is a binary input instance, say of length n, and f (X) is the value (boolean TRUE or FALSE) of the
target function at that instance. Based on these examples, the learner must succeed in deducing the target
function f which we can now express as f : {0, 1}n → {0, 1}.
ε gives an upper bound on the error in accuracy with which h approximated f and δ gives the probability of
failure in achieving this accuracy.
Using both these quantities, we can express the definition of a PAC Algorithm with more mathematical
clarity.
Consequently, we can say that, to qualify for PAC Learnability, the learner must find with probability of at
least 1 − δ, a concept h such that the error between h and f is at most ε
14
Formal Definition of PAC-Learnable
PAC Learnability
•
We would like to find an h such that errorD(h) = 0. This is not possible because (1) unless every
possible instance of X is in the training set, there might be multiple hypotheses consistent with the
training data and (2) there is a small chance that the training examples will be misleading
•
Therefore, we will require that errorD(h) < ε
•
Therefore, we will require that the probability of failure on a sequence of randomly drawn training
examples be bounded by δ
Example:
Using the tightest rectangle, S, as our hypothesis, we
would like to find how many examples we need.
We would like our hypothesis to be approximately
correct, namely, that the error probability be bounded
by some value.
We also would like to be confident in our hypothesis
in that we want to know that our hypothesis will be
correct most of the time (if not always); so we want
to be probably correct as well (by a probability we
can specify).
In probably approximately correct (PAC) learning, given a class, C, and examples drawn from some
unknown but fixed probability distribution, p(x), we want to find the number of examples, N, such that with
probability at least 1 − δ, the hypothesis h has error at most €, for arbitrary
δ ≤ 1/2 and € > 0
P{CΔh ≤ €} ≥ 1 − δ
where CΔh is the region of difference between C and h.
•
•
•
•
•
•
In our case, because S is the tightest possible rectangle, the error region between C and h = S is the
sum of four rectangular strips (see figure 2.7).
We would like to make sure that the probability of a positive example falling in here (and causing an
error) is at most €
For any of these strips, if we can guarantee that the probability is upper bounded by €/4, the error is
at most 4(€/4) = €.
The probability that a randomly drawn example misses this strip is 1 − €/4.
The probability that all N independent draws miss the strip is (1−€/4)N, and
the probability that all N independent draws miss any of the four strips is at most 4(1 − €/4)N, which
we would like to be at most δ. We have the inequality
15
Therefore, provided that we take at least (4/€) log(4/δ) independent examples from C and use the tightest
rectangle as our hypothesis h, with confidence probability at least 1 − δ, a given point will be misclassified
with error probability at most €.
Issues of PAC Learnability
The computational limitation also imposes a polynomial constraint on the training set size, since a learner
can process at most polynomial data in polynomial time.
• How to prove PAC learnability:
– First prove sample complexity of learning C using H is polynomial.
– Second prove that the learner can train on a polynomial-sized data set in polynomial time.
• To be PAC-learnable, there must be a hypothesis in H with arbitrarily small error for every concept in C,
generally C<=H.
4. NOISE
Noise is any unwanted anomaly in the data and due to noise, the class may be more difficult to learn and
zero error may be infeasible with a simple hypothesis class (see figure 2.8). There are several interpretations
of noise:
There may be imprecision in recording the input attributes, which may shift the data points in the input
space.
There may be errors in labeling the data points, which may relabel positive instances as negative and vice
versa. This is sometimes called teacher noise.
There may be additional attributes, which we have not taken into account, that affect the label of an instance.
Such attributes may be hidden or latent in that they may be unobservable. The effect of these neglected
attributes is thus modelled as a random component and is included in “noise.”
16
As can be seen in figure 2.8, when there is noise, there is not a simple boundary between the positive and
negative instances and to separate them, one needs a complicated hypothesis that corresponds to a
hypothesis class with larger capacity.
Using the simple rectangle (unless its training error is much bigger) makes more sense because of the
following:
1. It is a simple model to use. It is easy to check whether a point is inside or outside a rectangle and we can
easily check, for a future data instance, whether it is a positive or a negative instance.
2. It is a simple model to train and has fewer parameters. It is easier to find the corner values of a rectangle
than the control points of an arbitrary shape. With a small training set when the training instances differ a
little bit, we expect the simpler model to change less than a complex model.
3. It is a simple model to explain. A rectangle simply corresponds to defining intervals on the two attributes.
By learning a simple model, we can extract information from the raw data given in the training set.
5. If indeed there is mislabeling or noise in input and the actual class is really a simple model like the
rectangle, then the simple rectangle, because it has less variance and is less affected by single instances, will be
a better discriminator than the wiggly shape, although the simple one may make slightly more errors on the
training set.
Given comparable empirical error, we say that a simple (but not too simple) model would generalize better
than a complex model. This principle is known as Occam’s razor, which states that simpler explanations are
more plausible and any unnecessary complexity should be shaved off.
Occam’s razor argues that the simplest explanation is the one most likely to be correct.
How is Occam’s Razor Relevant in Machine Learning?
Occam’s Razor is one of the principles that guides us when we are trying to select the appropriate model for
a particular machine learning problem. If the model is too simple, it will make useless predictions. If the
model is too complex (loaded with attributes), it will not generalize well.
5. Learning multiple classes
In our example of learning a family car, we have positive examples belonging to the class family car and the
negative examples belonging to all other cars. This is a two-class problem. In the general case, we have K
classes denoted as Ci, i = 1, . . . , K, and an input instance belongs to one and exactly one of them. The
training set is now of the form
17
An example is given in figure 2.9 with instances from three classes: family car, sports car, and luxury sedan.
In machine learning for classification, we would like to learn the boundary separating the instances of one
class from the instances of all other classes. Thus, we view a K-class classification problem as K two-class
problems. The training examples belonging to Ci are the positive instances of hypothesis hi and the examples
of all other classes are the negative instances of hi . Thus in a K-class problem, we have K hypotheses to learn
such that
For a given x, ideally only one of hi(x), i = 1, . . . , K is 1 and we can choose a class. But when no, or two or
more, hi(x) is 1, we cannot choose a class, and this is the case of doubt and the classifier rejects such cases.
In our example of learning a family car, we used only one hypothesis and only modeled the positive
examples. Any negative example outside is not a family car.
Alternatively, sometimes we may prefer to build two hypotheses, one for the positive and the other for the
negative instances.
18
This assumes a structure also for the negative instances that can be covered by another hypothesis. Separating
family cars from sports cars is such a problem; each class has a structure of its own.
The advantage is that if the input is a luxury sedan, we can have both hypotheses decide negative and reject
the input.
If in a dataset, we expect to have all classes with similar distribution— shapes in the input space—then the same
hypothesis class can be used for all classes.
For example, in a handwritten digit recognition dataset, we would expect all digits to have similar distributions.
But in a medical diagnosis dataset, for example, where we have two classes for sick and healthy people, we
may have completely different distributions for the two classes; there may be multiple ways for a person to
be sick, reflected differently in the inputs: All healthy people are alike; each sick person is sick in his or her own
way.
Second Examples:
Let us understand the concept in-depth,
1. What is Multi-class Classification?
When we solve a classification problem having only two class labels, then it becomes easy for us to filter the
data, apply any classification algorithm, train the model with filtered data, and predict the outcomes. But
when we have more than two class instances in input train data, then it might get complex to analyze the
data, train the model, and predict relatively accurate results. To handle these multiple class instances, we use
multi-class classification.
Multi-class classification is the classification technique that allows us to categorize the test data into multiple
class labels present in trained data as a model prediction.
There are mainly two types of multi-class classification techniques: •
One vs. All (one-vs-rest)
•
One vs. One
19
2. Binary classification vs. multi-class classification
Binary Classification
•
Only two class instances are present in the dataset.
•
It requires only one classifier model.
•
Confusion Matrix is easy to derive and understand.
•
Example: - Check email is spam or not, predicting gender based on height and weight.
Multi-class Classification
•
Multiple class labels are present in the dataset.
•
The number of classifier models depends on the classification technique we are applying to.
•
One vs. All:- N-class instances then N binary classifier models
•
One vs. One:- N-class instances then N* (N-1)/2 binary classifier models
•
The Confusion matrix is easy to derive but complex to understand.
•
Example:- Check whether the fruit is apple, banana, or orange.
3. One vs. All (One-vs-Rest)
In one-vs-All classification, for the N-class instances dataset, we have to generate the N-binary classifier
models. The number of class labels present in the dataset and the number of generated binary classifiers must
be the same.
20
As shown in the above image, consider we have three classes, for example, type 1 for Green, type 2 for Blue,
and type 3 for Red.
Now, as I told you earlier that we have to generate the same number of classifiers as the class labels are
present in the dataset, So we have to create three classifiers here for three respective classes.
•
Classifier 1:- [Green] vs [Red, Blue]
•
Classifier 2:- [Blue] vs [Green, Red]
•
Classifier 3:- [Red] vs [Blue, Green]
Now to train these three classifiers, we need to create three training datasets. So let’s consider our primary
dataset is as follows,
Figure 5: Primary Dataset
You can see that there are three class labels Green, Blue, and Red present in the dataset. Now we have to
create a training dataset for each class.
Here, we created the training datasets by putting +1 in the class column for that feature value, which is
aligned to that particular class only. For the costs of the remaining features, we put -1 in the class column.
21
Figure 6: Training dataset for Green class
Figure 7: Training dataset for Blue class and Red class
Let’s understand it by an example,
•
Consider the primary dataset, in the first row; we have x1, x2, x3 feature values, and the
corresponding class value is G, which means these feature values belong to G class. So we put +1
value in the class column for the correspondence of green type. Then we applied the same for the
x10, x11, x12 input train data.
•
For the rest of the values of the features which are not in correspondence with the Green class, we
put -1 in their class column.
I hope that you understood the creation of training datasets.
Now, after creating a training dataset for each classifier, we provide it to our classifier model and train the
model by applying an algorithm.
22
After the training model, when we pass input test data to the model, then that data is considered as input for
all generated classifiers. If there is any possibility that our input test data belongs to a particular class, then the
classifier created for that class gives a positive response in the form of +1, and all other classifier models provide
an adverse reaction in the way of -1. Similarly, binary classifier models predict the probability of
correspondence with concerning classes.
By analyzing the probability scores, we predict the result as the class index having a maximum probability
score.
•
Let’s understand with one example by taking three test features values as y1, y2, and y3, respectively.
•
We passed test data to the classifier models. We got the outcome in the form of a positive rating
derived from the Green class classifier with a probability score of (0.9).
•
Again We got a positive rating from the Blue class with a probability score of (0.4) along with a
negative classification score from the remaining Red classifier.
•
Hence, based on the positive responses and decisive probability score, we can say that our test input
belongs to the Green class.
23
4. One vs. One (OvO)
In One-vs-One classification, for the N-class instances dataset, we have to generate the N* (N-1)/2 binary
classifier models. Using this classification approach, we split the primary dataset into one dataset for each
class opposite to every other class.
Taking the above example, we have a classification problem having three types: Green, Blue, and Red
(N=3).
We divide this problem into N* (N-1)/2 = 3 binary classifier problems:
•
Classifier 1: Green vs. Blue
•
Classifier 2: Green vs. Red
•
Classifier 3: Blue vs. Red
Each binary classifier predicts one class label. When we input the test data to the classifier, then the model
with the majority counts is concluded as a result.
24
Chapter-II
Regression: Simple linear regression, multiple linear regression, model selection and generalization,
Dimensions of supervised Machine learning algorithm, Bayesian classification.
……………………………………………………………………………………………………………………………..
Regression analysis is a statistical method to model the relationship between a dependent (target) and
independent (predictor) variables with one or more independent variables. More specifically, Regression
analysis helps us to understand how the value of the dependent variable is changing corresponding to an
independent variable when other independent variables are held fixed. It predicts continuous/real values such
as temperature, age, salary, price, etc.
We can understand the concept of regression analysis using the below example:
Example: Suppose there is a marketing company A, who does various advertisement every year and get sales
on that. The below list shows the advertisement made by the company in the last 5 years and the
corresponding sales:
Now, the company wants to do the advertisement of $200 in the year 2019 and wants to know the prediction
about the sales for this year. So, to solve such type of prediction problems in machine learning, we need
regression analysis.
Regression is a supervised learning technique, which helps in finding the correlation between variables and
enables us to predict the continuous output variable based on the one or more predictor variables. It is mainly
used for prediction, forecasting, time series modeling, and determining the causal-effect relationship between
variables.
In Regression, we plot a graph between the variables which best fits the given datapoints, using this plot, the
machine learning model can make predictions about the data. In simple words, "Regression shows a line or
curve that passes through all the datapoints on target-predictor graph in such a way that the vertical distance
between the datapoints and the regression line is minimum." The distance between datapoints and line tells
whether a model has captured a strong relationship or not.
Some examples of regression can be as:
•
Prediction of rain using temperature and other factors
•
Determining Market trends
•
Prediction of road accidents due to rash driving.
25
Terminologies Related to the Regression Analysis:
•
Dependent Variable: The main factor in Regression analysis which we want to predict or understand is
called the dependent variable. It is also called target variable.
•
Independent Variable: The factors which affect the dependent variables or which are used to predict
the values of the dependent variables are called independent variable, also called as a predictor.
•
Outliers: Outlier is an observation which contains either very low value or very high value in
comparison to other observed values. An outlier may hamper the result, so it should be avoided.
•
Multicollinearity: If the independent variables are highly correlated with each other than other
variables, then such condition is called Multicollinearity. It should not be present in the dataset, because
it creates problem while ranking the most affecting variable.
•
Underfitting and Overfitting: If our algorithm works well with the training dataset but not well with
test dataset, then such problem is called Overfitting. And if our algorithm does not perform well even
with training dataset, then such problem is called underfitting.
Why do we use Regression Analysis?
As mentioned above, Regression analysis helps in the prediction of a continuous variable. There are various
scenarios in the real world where we need some future predictions such as weather condition, sales prediction,
marketing trends, etc., for such case we need some technology which can make predictions more accurately.
So for such case we need Regression analysis which is a statistical method and used in machine learning and
data science. Below are some other reasons for using Regression analysis:
•
Regression estimates the relationship between the target and the independent variable.
•
It is used to find the trends in data.
•
It helps to predict real/continuous values.
•
By performing the regression, we can confidently determine the most important factor, the least
important factor, and how each factor is affecting the other factors.
Types of Regression Algorithms
There are various types of regressions which are used in data science and machine learning. Each type has its
own importance on different scenarios, but at the core, all the regression methods analyze the effect of the
independent variable on dependent variables. Here we are discussing some important types of regression
which are given below:
1. Linear regression
Linear Regression is an ML algorithm used for
supervised learning. Linear regression performs the
task to predict a dependent variable(target) based
on the given independent variable(s). So, this
regression technique finds out a linear relationship
between a dependent variable and the other given
independent variables. Hence, the name of this
algorithm is Linear Regression.
In the figure above, on X-axis is the independent
variable and on Y-axis is the output. The regression
line is the best fit line for a model. And our main
objective in this algorithm is to find this best fit line.
26
Pros:
•
Linear Regression is simple to implement.
•
Less complexity compared to other algorithms.
•
Linear Regression may lead to over-fitting but it can be avoided using some dimensionality reduction
techniques, regularization techniques, and cross-validation.
Cons:
•
Outliers affect this algorithm badly.
•
It over-simplifies real-world problems by assuming a linear relationship among the variables, hence not
recommended for practical use-cases.
2. Decision Tree
The decision tree models can be applied to all those data which contains numerical features and categorical
features. Decision trees are good at capturing non-linear interaction between the features and the target
variable. Decision trees somewhat match human-level thinking so it’s very intuitive to understand the data.
For example, if we are classifying how many hours a kid plays in particular weather then the decision tree
looks like somewhat this above in the image.
So, in short, a decision tree is a tree where each node represents a feature, each branch represents a decision,
and each leaf represents an outcome(numerical value for regression).
Pros:
•
•
•
Easy to understand and interpret, visually intuitive.
It can work with numerical and categorical features.
Requires little data preprocessing: no need for one-hot encoding, dummy variables, etc.
Cons:
•
It tends to overfit.
•
A small change in the data tends to cause a big difference in the tree structure, which causes instability.
3. Support Vector Regression
You must have heard about SVM i.e., Support Vector Machine. SVR also uses the same idea of SVM but here
it tries to predict the real values. This algorithm uses hyperplanes to segregate the data. In case this separation
is not possible then it uses kernel trick where the dimension is increased and then the data points become
separable by a hyperplane.
27
In the figure above, the Blue line is the Hyper Plane; Red Line is the Boundary Line
All the data points are within the boundary line(Red Line). The main objective of SVR is to basically consider
the points that are within the boundary line.
Pros:
•
Robust to outliers.
•
Excellent generalization capability
•
High prediction accuracy.
Cons:
•
Not suitable for large datasets.
•
They do not perform very well when the data set has more noise.
4. Lasso Regression
•
LASSO stands for Least Absolute Selection Shrinkage Operator. Shrinkage is basically defined as a
constraint on attributes or parameters.
•
The algorithm operates by finding and applying a constraint on the model attributes that cause
regression coefficients for some variables to shrink toward a zero.
•
Variables with a regression coefficient of zero are excluded from the model.
•
So, lasso regression analysis is basically a shrinkage and variable selection method and it helps to
determine which of the predictors are most important.
Pros:
•
It avoids overfitting
Cons:
•
LASSO will select only one feature from a group of correlated features
•
Selected features can be highly biased.
28
5. Random Forest Regressor
Random Forests are an ensemble(combination) of decision trees. It is a Supervised Learning algorithm used for
classification and regression. The input data is passed through multiple decision trees. It executes by constructing
a different number of decision trees at training time and outputting the class that is the mode of the classes (for
classification) or mean prediction (for regression) of the individual trees.
Pros:
•
Good at learning complex and non-linear relationships
•
Very easy to interpret and understand
Cons:
•
They are prone to overfitting
•
Using larger random forest ensembles to achieve higher performance slows down their speed and then
they also need more memory.
29
Linear Regression in Machine Learning
Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a statistical method
that is used for predictive analysis. Linear regression makes predictions for continuous/real or numeric variables
such as sales, salary, age, product price, etc.
Linear regression algorithm shows a linear relationship between a dependent (y) and one or more independent
(y) variables, hence called as linear regression. Since linear regression shows the linear relationship, which
means it finds how the value of the dependent variable is changing according to the value of the independent
variable.
The linear regression model provides a sloped straight line representing the relationship between the variables.
Consider the below image:
Mathematically, we can represent a linear regression as:
y= a0+a1x+ ε
Here,
Y= Dependent Variable (Target Variable)
X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)
a1 = Linear regression coefficient (scale factor to each input value).
ε = random error
The values for x and y variables are training datasets for Linear Regression model representation.
Types of Linear Regression
Linear regression can be further divided into two types of the algorithm:
•
•
Simple Linear Regression:
If a single independent variable is used to predict the value of a numerical dependent variable, then
such a Linear Regression algorithm is called Simple Linear Regression.
Multiple Linear regression:
If more than one independent variable is used to predict the value of a numerical dependent
variable, then such a Linear Regression algorithm is called Multiple Linear Regression.
30
Linear Regression Line
A linear line showing the relationship between the dependent and independent variables is called a
regression line. A regression line can show two types of relationship:
•
Positive Linear Relationship:
If the dependent variable increases on the Y-axis and independent variable increases on X-axis, then
such a relationship is termed as a Positive linear relationship.
•
Negative Linear Relationship:
If the dependent variable decreases on the Y-axis and independent variable increases on the X-axis,
then such a relationship is called a negative linear relationship.
Assumptions of Linear Regression
Below are some important assumptions of Linear Regression. These are some formal checks while
building a Linear Regression model, which ensures to get the best possible result from the given
dataset.
•
•
Linear relationship between the features and target:
Linear regression assumes the linear relationship between the dependent and independent variables.
Small or no multicollinearity between the features:
Multicollinearity means high-correlation between the independent variables. Due to multicollinearity,
it may difficult to find the true relationship between the predictors and target variables. Or we can
say, it is difficult to determine which predictor variable is affecting the target variable and which is
31
•
•
•
not. So, the model assumes either little or no multicollinearity between the features or independent
variables.
Homoscedasticity Assumption:
Homoscedasticity is a situation when the error term is the same for all the values of independent
variables. With homoscedasticity, there should be no clear pattern distribution of data in the scatter
plot.
Normal distribution of error terms:
Linear regression assumes that the error term should follow the normal distribution pattern. If error
terms are not normally distributed, then confidence intervals will become either too wide or too
narrow, which may cause difficulties in finding coefficients.
It can be checked using the q-q plot. If the plot shows a straight line without any deviation, which
means the error is normally distributed.
No autocorrelations:
The linear regression model assumes no autocorrelation in error terms. If there will be any correlation
in the error term, then it will drastically reduce the accuracy of the model. Autocorrelation usually
occurs if there is a dependency between residual errors.
1. Simple linear regression
Simple Linear Regression is a type of Regression algorithms that models the relationship between a dependent
variable and a single independent variable. The relationship shown by a Simple Linear Regression model is
linear or a sloped straight line, hence it is called Simple Linear Regression.
The key point in Simple Linear Regression is that the dependent variable must be a continuous/real value.
However, the independent variable can be measured on continuous or categorical values.
Simple Linear regression algorithm has mainly two objectives:
•
Model the relationship between the two variables. Such as the relationship between Income and
expenditure, experience and Salary, etc.
•
Forecasting new observations. Such as Weather forecasting according to temperature, Revenue of a
company according to the investments in a year, etc.
Simple Linear Regression Model:
The Simple Linear Regression model can be represented using the below equation:
y= a0+a1x+ ε
Where,
a0= It is the intercept of the Regression line (can be obtained putting x=0)
a1= It is the slope of the regression line, which tells whether the line is increasing or decreasing.
ε = The error term. (For a good model it will be negligible)
Implementation of Simple Linear Regression Algorithm using Python
Problem Statement example for Simple Linear Regression:
Here we are taking a dataset that has two variables: salary (dependent variable) and experience (Independent
variable). The goals of this problem is:
•
We want to find out if there is any correlation between these two variables
•
We will find the best fit line for the dataset.
•
How the dependent variable is changing by changing the independent variable.
32
Here, we will create a Simple Linear Regression model to find out the best fitting line for representing the
relationship between these two variables.
To implement the Simple Linear regression model in machine learning using Python, we need to follow the
below steps:
Step-1: Data Pre-processing
The first step for creating the Simple Linear Regression model is data pre-processing. We have already done it
earlier in this tutorial. But there will be some changes, which are given in the below steps:
a) First, we will import the three important libraries, which will help us for loading the dataset, plotting the
graphs, and creating the Simple Linear Regression model.
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
b) Next, we will load the dataset into our code. After that, we need to extract the dependent and independent
variables from the given dataset. The independent variable is years of experience, and the dependent variable
is salary.
# Importing the dataset
dataset = pd.read_csv('Salary_Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 1].values
In the above lines of code, for x variable, we have taken -1 value since we want to remove the last column
from the dataset. For y variable, we have taken 1 value as a parameter, since we want to extract the second
column and indexing starts from the zero.
c) Next, we will split both variables into the test set and training set. We have 30 observations, so we will take
20 observations for the training set and 10 observations for the test set. We are splitting our dataset so that
we can train our model using a training dataset and then test the model using a test dataset.
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state = 0)
Step-2: Fitting the Simple Linear Regression to the Training Set:
Now the second step is to fit our model to the training dataset. To do so, we will import the LinearRegression
class of the linear_model library from the scikit learn. After importing the class, we are going to create an
object of the class named as a regressor.
# Fitting Simple Linear Regression to the Training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
33
In the above code, we have used a fit() method to fit our Simple Linear Regression object to the training set.
In the fit() function, we have passed the x_train and y_train, which is our training dataset for the dependent
and an independent variable. We have fitted our regressor object to the training set so that the model can
easily learn the correlations between the predictor and target variables.
Step: 3. Prediction of test set result:
dependent (salary) and an independent variable (Experience). So, now, our model is ready to predict the
output for the new observations. In this step, we will provide the test dataset (new observations) to the model
to check whether it can predict the correct output or not.
We will create a prediction vector y_pred, which will contain predictions of test dataset, and prediction of
training set respectively.
# Predicting the Test set results
y_pred = regressor.predict(X_test)
Step: 4. visualizing the Training set results:
Now in this step, we will visualize the training set result. To do so, we will use the scatter() function of the
pyplot library, which we have already imported in the pre-processing step. The scatter () function will create
a scatter plot of observations.
In the x-axis, we will plot the Years of Experience of employees and on the y-axis, salary of employees. In the
function, we will pass the real values of training set, which means a year of experience x_train, training set of
Salaries y_train, and color of the observations. Here we are taking a green color for the observation, but it can
be any color as per the choice.
Now, we need to plot the regression line, so for this, we will use the plot() function of the pyplot library. In
this function, we will pass the years of experience for training set, predicted salary for training set x_pred, and
color of the line.
Next, we will give the title for the plot. So here, we will use the title() function of the pyplot library and pass
the name ("Salary vs Experience (Training Dataset)".
After that, we will assign labels for x-axis and y-axis using xlabel() and ylabel() function.
# Visualising the Training set results
plt.scatter(X_train, y_train, color = 'red')
plt.plot(X_train, regressor.predict(X_train), color = 'blue')
plt.title('Salary vs Experience (Training set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()
In the above plot, we can see the real values
observations in green dots and predicted values are
covered by the red regression line. The regression line
shows a correlation between the dependent and
independent variable.
The good fit of the line can be observed by calculating
the difference between actual values and predicted values. But as we can see in the above plot, most of the
observations are close to the regression line, hence our model is good for the training set.
34
Step: 5. visualizing the Test set results:
In the previous step, we have visualized the performance of our model on the training set. Now, we will do
the same for the Test set. The complete code will remain the same as the above code, except in this, we will
use x_test, and y_test instead of x_train and y_train
# Visualising the Test set results
plt.scatter(X_test, y_test, color = 'red')
plt.plot(X_train, regressor.predict(X_train), color = 'blue')
plt.title('Salary vs Experience (Test set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()
2. Multiple linear regression
Multiple Linear Regression is one of the important regression algorithms which models the linear relationship
between a single dependent continuous variable and more than one independent variable.
Example:
Prediction of CO2 emission based on engine size and number of cylinders in a car.
Some key points about MLR:
•
For MLR, the dependent or target variable(Y) must be the continuous/real, but the predictor or
independent variable may be of continuous or categorical form.
•
Each feature variable must model the linear relationship with the dependent variable.
•
MLR tries to fit a regression line through a multidimensional space of data-points.
The multiple regression equation explained above takes the following form:
y = b1x1 + b2x2 + … + bnxn + c.
Where,
Y= Output/Response variable
b0, b1, b2, b3 , bn....= Coefficients of the model.
x1, x2, x3, x4,...= Various Independent/feature variable
Assumptions for Multiple Linear Regression:
•
A linear relationship should exist between the Target and predictor variables.
•
The regression residuals must be normally distributed.
•
MLR assumes little or no multicollinearity (correlation between the independent variable) in data.
35
Implementation of Multiple Linear Regression model using Python:
To implement MLR using Python, we have below problem:
Problem Description:
We have a dataset of 50 start-up companies. This dataset contains five main information: R&D Spend,
Administration Spend, Marketing Spend, State, and Profit for a financial year. Our goal is to create a model
that can easily determine which company has a maximum profit, and which is the most affecting factor for the
profit of a company.
Since we need to find the Profit, so it is the dependent variable, and the other four variables are independent
variables. Below are the main steps of deploying the MLR model:
1. Data Pre-processing Steps
2. Fitting the MLR model to the training set
3. Predicting the result of the test set
Step-1: Data Pre-processing Step:
The very first step is data pre-processing
, which we have already discussed in this tutorial. This process contains the below steps:
•
Importing libraries: Firstly, we will import the library which will help in building the model. Below is
the code for it:
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
•
Importing dataset: Now we will import the dataset(50_CompList), which contains all the variables.
Extracting dependent and independent variables from it.
# Importing the dataset
dataset = pd.read_csv('50_Startups.csv')
X = dataset.iloc[:, :-1]
y = dataset.iloc[:, 4]
•
Convert the column into categorical columns
states=pd.get_dummies(X['State'],drop_first=True)
•
Drop the state coulmn
X=X.drop('State',axis=1)
•
concat the dummy variables
X=pd.concat([X,states],axis=1)
36
•
Now we will split the dataset into training and test set.
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
Step: 2- Fitting our MLR model to the Training set:
Now, we have well prepared our dataset in order to provide training, which means we will fit our
regression model to the training set. It will be similar to as we did in Simple Linear Regression model.
# Fitting Multiple Linear Regression to the Training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
Step: 3- Prediction of Test set results:
The last step for our model is checking the performance of the model. We will do it by predicting the test set
result. For prediction, we will create a y_pred vector.
# Predicting the Test set results
y_pred = regressor.predict(X_test)
Now, checking the final results
The above score tells that our model is 95% accurate with the training dataset and 93% accurate with the
test dataset.
Applications of Multiple Linear Regression:
There are mainly two applications of Multiple Linear Regression:
•
Effectiveness of Independent variable on prediction:
•
Predicting the impact of changes:
37
3. MODEL SELECTION AND GENERALIZATION
Let us assume that our model is learning from Boolean function from examples. In a Boolean function, all
inputs and the output are binary. There are 2d possible ways to write d binary values and therefore, with d
inputs, the training set has at most 2d examples.
Each distinct training example removes half the hypotheses, namely, those whose guesses are wrong.
We start with all possible hypotheses and as we see more training examples, we remove those hypotheses
that are not consistent with the training data. In the case of a Boolean function, to end up with a single
hypothesis we need to see all 2d training examples.
If the training set we are given contains only a small subset of all possible instances, as it generally does—that
is, if we know what the output should be for only a small percentage of the cases—the solution is not unique.
After seeing N example cases, there remain 22 ^(d−N) possible functions. This is an example of an ill-posed
problem where the data by itself is not sufficient to find a unique solution.
So, because learning is ill-posed, and data by itself is not sufficient to find the solution, we should make some
extra assumptions to have a unique solution with the data we have. The set of assumptions we make to have
learning possible is called the inductive bias of the learning algorithm.
Thus, learning is not possible without inductive bias, and now the question is how to choose the right bias.
This is called model selection, which is choosing between possible H.
How well a model trained on the training set predicts the right output for new instances is called
generalization.
Model selection is the process of choosing one of the models as the final model that addresses the problem.
For best generalization, we should match the complexity of the hypothesis class H with the complexity of
the function underlying the data. If H is less complex than the function, we have underfitting
But if we have H that is too complex, the data is not enough to constrain it and we may end up with a bad
hypothesis, h ∈ H, for example, when fitting two rectangles to data sampled from one rectangle. Or if there
is noise, an overcomplex hypothesis may learn not only the underlying function but also the noise in the
data and may make a bad fit, for example, when fitting a sixth-order polynomial to noisy data sampled from
a third-order polynomial. This is called overfitting.
In all learning algorithms that are trained from example data, there is a trade-off between three factors:
•
•
•
the complexity of the hypothesis we fit to data, namely, the capacity of the hypothesis class,
the amount of training data, and
the generalization error on new examples
As the amount of training data increases, the generalization error decreases.
38
•
•
•
To estimate generalization error, we need data unseen during training. We split the data as
Training set (50%)
Validation set (25%)
Test (publication) set (25%)
Resampling when there is few data
4. Dimensions of a Supervised Machine Learning Algorithm
Let us now recapitulate and generalize. We have a sample
The sample is independent and identically distributed.
t indexes one of the N instances,
xt is the arbitrary dimensional input, and
rt is the associated desired output.
The aim is to build a good and useful approximation to r t using the model g(xt |θ). In doing this, there are
three decisions we must make.
1. Model we use in learning, denoted as
g(x|θ)
where g(·) is the model, x is the input, and θ are the parameters.
g(·) defines the hypothesis class H, and a particular value of θ instantiates one hypothesis h ∈ H
For example, in class learning, we have taken a rectangle as our model whose four coordinates make up θ;
2. Loss function, L(·), to compute the difference between the desired output, rt , and our approximation to
it, g(xt |θ), given the current value of the parameters, θ. The approximation error, or loss, is the sum of
losses over the individual instances
For example, In class learning where outputs are 0/1, L(·) checks for equality or not
3. Optimization procedure to find θ∗ that minimizes the total error
where arg min returns the argument that minimizes
For this setting to work well, the following conditions should be satisfied
1. the hypothesis class of g(·) should be large enough
2. There should be enough training data to allow us to pinpoint the correct (or a good enough) hypothesis
from the hypothesis class.
3. we should have a good optimization method that finds the correct hypothesis given the training data.
39
5. Bayesian classification
Bayes theorem is one of the most popular machine learning concepts that helps to calculate the probability
of occurring one event with uncertain knowledge while other one has already occurred.
Bayes' theorem can be derived using product rule and conditional probability of event X with known event
Y:
•
According to the product rule we can express as the probability of event X with known event Y as
follows;
1. P(X ? Y)= P(X|Y) P(Y)
•
{equation 1}
Further, the probability of event Y with known event X:
1. P(X ? Y)= P(Y|X) P(X)
{equation 2}
Mathematically, Bayes theorem can be expressed by combining both equations on right hand side. We will
get:
Here, both events X and Y are independent events which means probability of outcome of both events does
not depends one another.
The above equation is called as Bayes Rule or Bayes Theorem.
•
P(X|Y) is called as posterior, which we need to calculate. It is defined as updated probability after
considering the evidence.
•
P(Y|X) is called the likelihood. It is the probability of evidence when hypothesis is true.
•
P(X) is called the prior probability, probability of hypothesis before considering the evidence
•
P(Y) is called marginal probability. It is defined as the probability of evidence under any
consideration.
Hence, Bayes Theorem can be written as:
posterior = likelihood * prior / evidence
Prerequisites for Bayes Theorem
While studying the Bayes theorem, we need to understand few important concepts. These are as follows:
1. Experiment
An experiment is defined as the planned operation carried out under controlled condition such as tossing a
coin, drawing a card and rolling a dice, etc.
2. Sample Space
During an experiment what we get as a result is called as possible outcomes and the set of all possible
outcome of an event is known as sample space. For example, if we are rolling a dice, sample space will be:
S1 = {1, 2, 3, 4, 5, 6}
Similarly, if our experiment is related to toss a coin and recording its outcomes, then sample space will be:
40
S2 = {Head, Tail}
3. Event
Event is defined as subset of sample space in an experiment. Further, it is also called as set of outcomes.
Assume in our experiment of rolling a dice, there are two event A and B such that;
A = Event when an even number is obtained = {2, 4, 6}
B = Event when a number is greater than 4 = {5, 6}
•
Probability of the event A ''P(A)''= Number of favourable outcomes / Total number of possible
outcomes
P(E) = 3/6 =1/2 =0.5
•
Similarly, Probability of the event B ''P(B)''= Number of favourable outcomes / Total number of
possible outcomes
=2/6
=1/3
=0.333
•
Union of event A and B:
A∪B = {2, 4, 5, 6}
•
Intersection of event A and B:
A∩B= {6}
41
•
Disjoint Event: If the intersection of the event A and B is an empty set or null then such events are
known as disjoint event or mutually exclusive events also.
4. Random Variable:
It is a real value function which helps mapping between sample space and a real line of an experiment. A
random variable is taken on some random values and each value having some probability. However, it is
neither random nor a variable but it behaves as a function which can either be discrete, continuous or
combination of both.
5. Exhaustive Event:
As per the name suggests, a set of events where at least one event occurs at a time, called exhaustive event of
an experiment.
Thus, two events A and B are said to be exhaustive if either A or B definitely occur at a time and both are
mutually exclusive for e.g., while tossing a coin, either it will be a Head or may be a Tail.
6. Independent Event:
Two events are said to be independent when occurrence of one event does not affect the occurrence of
another event. In simple words we can say that the probability of outcome of both events does not depends
one another.
Mathematically, two events A and B are said to be independent if:
P(A ∩ B) = P(AB) = P(A)*P(B)
7. Conditional Probability:
Conditional probability is defined as the probability of an event A, given that another event B has already
occurred (i.e. A conditional B). This is represented by P(A|B) and we can define it as:
P(A|B) = P(A ∩ B) / P(B)
8. Marginal Probability:
Marginal probability is defined as the probability of an event A occurring independent of any other event B.
Further, it is considered as the probability of evidence under any consideration.
P(A) = P(A|B)*P(B) + P(A|~B)*P(~B)
Here ~B represents the event that B does not occur.
42
How to apply Bayes Theorem or Bayes rule in Machine Learning?
Bayes theorem helps us to calculate the single term P(B|A) in terms of P(A|B), P(B), and P(A). This rule is
very helpful in such scenarios where we have a good probability of P(A|B), P(B), and P(A) and need to
determine the fourth term.
Naïve Bayes classifier is one of the simplest applications of Bayes theorem which is used in classification
algorithms to isolate data as per accuracy, speed and classes.
Let's understand the use of Bayes theorem in machine learning with below example.
Suppose, we have a vector A with I attributes. It means
A = A1, A2, A3, A4……………Ai
Further, we have n classes represented as C1, C2, C3, C4…………Cn.
These are two conditions given to us, and our classifier that works on Machine Language has to predict A
and the first thing that our classifier has to choose will be the best possible class. So, with the help of Bayes
theorem, we can write it as:
P(Ci/A)= [ P(A/Ci) * P(Ci)] / P(A)
Here;
P(A) is the condition-independent entity.
P(A) will remain constant throughout the class means it does not change its value with respect to change in
class. To maximize the P(Ci/A), we have to maximize the value of term P(A/Ci) * P(Ci).
With n number classes on the probability list let's assume that the possibility of any class being the right
answer is equally likely. Considering this factor, we can say that:
P(C1)=P(C2)-P(C3)=P(C4)=…..=P(Cn).
This process helps us to reduce the computation cost as well as time. This is how Bayes theorem plays a
significant role in Machine Learning and Naïve Bayes theorem has simplified the conditional probability tasks
without affecting the precision. Hence, we can conclude that:
P(Ai/C)= P(A1/C)* P(A2/C)* P(A3/C)*……*P(An/C)
Hence, by using Bayes theorem in Machine Learning we can easily describe the possibilities of smaller events.
2. Naive Bayes learning algorithm
•
Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem and used
for solving classification problems.
•
It is mainly used in text classification that includes a high-dimensional training dataset.
•
Naïve Bayes Classifier is one of the simple and most effective Classification algorithms which helps in
building the fast machine learning models that can make quick predictions.
•
It is a probabilistic classifier, which means it predicts on the basis of the probability of an object.
•
Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental analysis, and
classifying articles.
Why is it called Naïve Bayes?
The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be described as:
•
Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is independent of
the occurrence of other features. Such as if the fruit is identified on the bases of color, shape, and
taste, then red, spherical, and sweet fruit is recognized as an apple. Hence each feature individually
contributes to identify that it is an apple without depending on each other.
43
•
Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.
Bayes' Theorem:
•
Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the probability
of a hypothesis with prior knowledge. It depends on the conditional probability.
•
The formula for Bayes' theorem is given as:
Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a hypothesis is
true.
P(A) is Prior Probability: Probability of hypothesis before observing the evidence.
P(B) is Marginal Probability: Probability of Evidence.
Working of Naïve Bayes' Classifier:
Working of Naïve Bayes' Classifier can be understood with the help of the below example:
Suppose we have a dataset of weather conditions and corresponding target variable "Play". So using this
dataset we need to decide that whether we should play or not on a particular day according to the weather
conditions. So to solve this problem, we need to follow the below steps:
1. Convert the given dataset into frequency tables.
2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.
44
Problem: If the weather is sunny, then the Player should play or not?
Solution: To solve this, first consider the below dataset:
Frequency table for the Weather Conditions:
Likelihood table weather condition:
45
Applying Bayes' theorem:
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny|Yes)= 3/10= 0.3
P(Sunny)= 0.35
P(Yes)=0.71
Learn more
So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
So P(No|Sunny)= 0.5*0.29/0.35 = 0.41
So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)
Hence on a Sunny day, Player can play the game.
Advantages of Naïve Bayes Classifier:
•
Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
•
It can be used for Binary as well as Multi-class Classifications.
•
It performs well in Multi-class predictions as compared to the other Algorithms.
•
It is the most popular choice for text classification problems.
Disadvantages of Naïve Bayes Classifier:
•
Naive Bayes assumes that all features are independent or unrelated, so it cannot learn the relationship
between features.
Applications of Naïve Bayes Classifier:
•
It is used for Credit Scoring.
•
It is used in medical data classification.
•
It can be used in real-time predictions because Naïve Bayes Classifier is an eager learner.
•
It is used in Text classification such as Spam filtering and Sentiment analysis.
46
Download