Uploaded by Sai Dulu

ML SEM

advertisement
Unit -1:1. Write the Four Main Challenges in Machine Learning?
2. Write Short Notes on AI, ML, and DL?
3. What are the Different Types of Machine Learning
Systems?
4. Write a Note on Training Loss Vs Testing Loss?
5. What are different Tradeoffs in Statistical Learning?
Explain
6. Write the procedure for estimating sampling distribution
of an estimator?
7. Write the statistics in supervise learning and
unsupervised learning?
1.
In Machine Learning, there occurs a process of analyzing data for building or
training models. It is just everywhere; from Amazon product recommendations
to self-driven cars, it beholds great value throughout. As per the latest research,
the global machine learning market is expected to grow by 43% by 2024. This
revolution has enhanced the demand for machine learning professionals to a
great extent. AI and machine learning jobs have observed a significant growth
rate of 75% in the past four years, and the industry is growing continuously. A
career in the Machine learning domain offers job satisfaction, excellent growth,
insanely high salary, but it is a complex and challenging process.
There are a lot of challenges that machine learning professionals face to
inculcate ML skills and create an application from scratch. What are these
challenges? In this blog, we will discuss seven major challenges faced by
machine learning professionals. Let’s have a look.
1. Poor Quality of Data
Data plays a significant role in the machine learning process. One of the
significant issues that machine learning professionals face is the absence of
good quality data. Unclean and noisy data can make the whole process
extremely exhausting. We don’t want our algorithm to make inaccurate or faulty
predictions. Hence the quality of data is essential to enhance the output.
Therefore, we need to ensure that the process of data preprocessing which
includes removing outliers, filtering missing values, and removing unwanted
features, is done with the utmost level of perfection.
2. Underfitting of Training Data
This process occurs when data is unable to establish an accurate relationship
between input and output variables. It simply means trying to fit in undersized
jeans. It signifies the data is too simple to establish a precise relationship. To
overcome this issue:
● Maximize the training time
● Enhance the complexity of the model
● Add more features to the data
● Reduce regular parameters
● Increasing the training time of model
3. Overfitting of Training Data
Overfitting refers to a machine learning model trained with a massive amount of
data that negatively affect its performance. It is like trying to fit in Oversized
jeans. Unfortunately, this is one of the significant issues faced by machine
learning professionals. This means that the algorithm is trained with noisy and
biased data, which will affect its overall performance. Let’s understand this with
the help of an example. Let’s consider a model trained to differentiate between a
cat, a rabbit, a dog, and a tiger. The training data contains 1000 cats, 1000 dogs,
1000 tigers, and 4000 Rabbits. Then there is a considerable probability that it
will identify the cat as a rabbit. In this example, we had a vast amount of data,
but it was biased; hence the prediction was negatively affected.
We can tackle this issue by:
● Analyzing the data with the utmost level of perfection
● Use data augmentation technique
● Remove outliers in the training set
● Select a model with lesser features
To know more, you can visit here.
4. Machine Learning is a Complex Process
The machine learning industry is young and is continuously changing. Rapid hit
and trial experiments are being carried on. The process is transforming, and
hence there are high chances of error which makes the learning complex. It
includes analyzing the data, removing data bias, training data, applying complex
mathematical calculations, and a lot more. Hence it is a really complicated
process which is another big challenge for Machine learning professionals.
5. Lack of Training Data
The most important task you need to do in the machine learning process is to
train the data to achieve an accurate output. Less amount training data will
produce inaccurate or too biased predictions. Let us understand this with the
help of an example. Consider a machine learning algorithm similar to training a
child. One day you decided to explain to a child how to distinguish between an
apple and a watermelon. You will take an apple and a watermelon and show
him the difference between both based on their color, shape, and taste. In this
way, soon, he will attain perfection in differentiating between the two. But on
the other hand, a machine-learning algorithm needs a lot of data to distinguish.
For complex problems, it may even require millions of data to be trained.
Therefore we need to ensure that Machine learning algorithms are trained with
sufficient amounts of data.
6. Slow Implementation
This is one of the common issues faced by machine learning professionals. The
machine learning models are highly efficient in providing accurate results, but it
takes a tremendous amount of time. Slow programs, data overload, and
excessive requirements usually take a lot of time to provide accurate results.
Further, it requires constant monitoring and maintenance to deliver the best
output.
7. Imperfections in the Algorithm When Data Grows
So you have found quality data, trained it amazingly, and the predictions are
really concise and accurate. Yay, you have learned how to create a machine
learning algorithm!! But wait, there is a twist; the model may become useless in
the future as data grows. The best model of the present may become inaccurate
in the coming Future and require further rearrangement. So you need regular
monitoring and maintenance to keep the algorithm working. This is one of the
most exhausting issues faced by machine learning professionals.
Conclusion: Machine learning is all set to bring a big bang transformation in
technology. It is one of the most rapidly growing technologies used in medical
diagnosis, speech recognition, robotic training, product recommendations, video
surveillance, and this list goes on. This continuously evolving domain offers
immense job satisfaction, excellent opportunities, global exposure, and
exorbitant salary. It is a high risk and a high return technology. Before starting
your machine learning journey, ensure that you carefully examine the challenges
mentioned above. To learn this fantastic technology, you need to plan carefully,
stay patient, and maximize your efforts. Once you win this battle, you can
conquer the Future of work and land your dream job!
2.
Artificial Intelligence is basically the mechanism to incorporate human
intelligence into machines through a set of rules(algorithm). AI is a combination
of two words: “Artificial” meaning something made by humans or non-natural
things and “Intelligence” meaning the ability to understand or think accordingly.
Another definition could be that “AI is basically the study of training your
machine(computers) to mimic a human brain and its thinking capabilities”.
AI focuses on 3 major aspects(skills): learning, reasoning, and self-correction
to obtain the maximum efficiency possible.
Machine Learning:
Machine Learning is basically the study/process which provides the
system(computer) to learn automatically on its own through experiences it had
and improve accordingly without being explicitly programmed. ML is an
application or subset of AI. ML focuses on the development of programs so
that it can access data to use it for itself. The entire process makes observations
on data to identify the possible patterns being formed and make better future
decisions as per the examples provided to them. The major aim of ML is to
allow the systems to learn by themselves through experience without any
kind of human intervention or assistance.
Deep Learning:
Deep Learning is basically a sub-part of the broader family of Machine Learning
which makes use of Neural Networks(similar to the neurons working in our
brain) to mimic human brain-like behavior. DL algorithms focus on information
processing patterns mechanism to possibly identify the patterns just like our
human brain does and classifies the information accordingly. DL works on larger
sets of data when compared to ML and the prediction mechanism is
self-administered by machines.
Below is a table of differences between Artificial Intelligence, Machine Learning
and Deep Learning:
3.
Types of Machine Learning
Machine learning is a subset of AI, which enables the machine to automatically learn
from data, improve performance from past experiences, and make predictions.
Machine learning contains a set of algorithms that work on a huge amount of data. Data
is fed to these algorithms to train them, and on the basis of training, they build the
model & perform a specific task.
These ML algorithms help to solve different business problems like Regression,
Classification, Forecasting, Clustering, and Associations, etc.
Based on the methods and way of learning, machine learning is divided into mainly four
types, which are:
1. Supervised Machine Learning
2. Unsupervised Machine Learning
3. Semi-Supervised Machine Learning
4. Reinforcement Learning
In this topic, we will provide a detailed description of the types of Machine Learning
along with their respective algorithms:
1. Supervised Machine Learning
As its name suggests, Supervised machine learning is based on supervision. It
means in the supervised learning technique, we train the machines using the
"labelled" dataset, and based on the training, the machine predicts the output.
Here, the labelled data specifies that some of the inputs are already mapped to
the output. More preciously, we can say; first, we train the machine with the
input and corresponding output, and then we ask the machine to predict the
output using the test dataset.
Let's understand supervised learning with an example. Suppose we have an
input dataset of cats and dog images. So, first, we will provide the training to
the machine to understand the images, such as the shape & size of the tail of cat
and dog, Shape of eyes, colour, height (dogs are taller, cats are smaller), etc.
After completion of training, we input the picture of a cat and ask the machine to
identify the object and predict the output. Now, the machine is well trained, so it
will check all the features of the object, such as height, shape, colour, eyes, ears,
tail, etc., and find that it's a cat. So, it will put it in the Cat category. This is the
process of how the machine identifies the objects in Supervised Learning.
The main goal of the supervised learning technique is to map the input
variable(x) with the output variable(y). Some real-world applications of
supervised learning are Risk Assessment, Fraud Detection, Spam filtering, etc.
Categories of Supervised Machine Learning
Supervised machine learning can be classified into two types of problems,
which are given below:
○ Classification
○ Regression
a) Classification
Classification algorithms are used to solve the classification problems in which
the output variable is categorical, such as "Yes" or No, Male or Female, Red or
Blue, etc. The classification algorithms predict the categories present in the
dataset. Some real-world examples of classification algorithms are Spam
Detection, Email filtering, etc.
Some popular classification algorithms are given below:
○ Random Forest Algorithm
○ Decision Tree Algorithm
○ Logistic Regression Algorithm
○ Support Vector Machine Algorithm
b) Regression
Regression algorithms are used to solve regression problems in which there is a
linear relationship between input and output variables. These are used to
predict continuous output variables, such as market trends, weather prediction,
etc.
Some popular Regression algorithms are given below:
○ Simple Linear Regression Algorithm
○ Multivariate Regression Algorithm
○ Decision Tree Algorithm
○ Lasso Regression
Advantages and Disadvantages of Supervised Learning
Advantages:
○ Since supervised learning work with the labelled dataset so we can have an exact
idea about the classes of objects.
○ These algorithms are helpful in predicting the output on the basis of prior
experience.
Disadvantages:
○ These algorithms are not able to solve complex tasks.
○ It may predict the wrong output if the test data is different from the training data.
○ It requires lots of computational time to train the algorithm.
Applications of Supervised Learning
Some common applications of Supervised Learning are given below:
○ Image Segmentation:
Supervised Learning algorithms are used in image segmentation. In this process,
image classification is performed on different image data with pre-defined labels.
○ Medical Diagnosis:
Supervised algorithms are also used in the medical field for diagnosis purposes.
It is done by using medical images and past labelled data with labels for disease
conditions. With such a process, the machine can identify a disease for the new
patients.
○ Fraud Detection - Supervised Learning classification algorithms are used for
identifying fraud transactions, fraud customers, etc. It is done by using historic
data to identify the patterns that can lead to possible fraud.
○ Spam detection - In spam detection & filtering, classification algorithms are
used. These algorithms classify an email as spam or not spam. The spam emails
are sent to the spam folder.
○ Speech Recognition - Supervised learning algorithms are also used in speech
recognition. The algorithm is trained with voice data, and various identifications
can be done using the same, such as voice-activated passwords, voice
commands, etc.
2. Unsupervised Machine Learning
Unsupervised learning is different from the Supervised learning technique; as its name
suggests, there is no need for supervision. It means, in unsupervised machine learning,
the machine is trained using the unlabeled dataset, and the machine predicts the output
without any supervision.
In unsupervised learning, the models are trained with the data that is neither classified
nor labelled, and the model acts on that data without any supervision.
The main aim of the unsupervised learning algorithm is to group or categories the
unsorted dataset according to the similarities, patterns, and differences. Machines are
instructed to find the hidden patterns from the input dataset.
Let's take an example to understand it more preciously; suppose there is a basket of
fruit images, and we input it into the machine learning model. The images are totally
unknown to the model, and the task of the machine is to find the patterns and
categories of the objects.
So, now the machine will discover its patterns and differences, such as colour
difference, shape difference, and predict the output when it is tested with the test
dataset.
Categories of Unsupervised Machine Learning
Unsupervised Learning can be further classified into two types, which are given below:
○ Clustering
○ Association
1) Clustering
The clustering technique is used when we want to find the inherent groups from the
data. It is a way to group the objects into a cluster such that the objects with the most
similarities remain in one group and have fewer or no similarities with the objects of
other groups. An example of the clustering algorithm is grouping the customers by their
purchasing behaviour.
Some of the popular clustering algorithms are given below:
○ K-Means Clustering algorithm
○ Mean-shift algorithm
○ DBSCAN Algorithm
○ Principal Component Analysis
○ Independent Component Analysis
2) Association
Association rule learning is an unsupervised learning technique, which finds interesting
relations among variables within a large dataset. The main aim of this learning
algorithm is to find the dependency of one data item on another data item and map
those variables accordingly so that it can generate maximum profit. This algorithm is
mainly applied in Market Basket analysis, Web usage mining, continuous production,
etc.
Some popular algorithms of Association rule learning are Apriori Algorithm, Eclat,
FP-growth algorithm.
Advantages and Disadvantages of Unsupervised Learning
Algorithm
Advantages:
○ These algorithms can be used for complicated tasks compared to the supervised
ones because these algorithms work on the unlabeled dataset.
○ Unsupervised algorithms are preferable for various tasks as getting the unlabeled
dataset is easier as compared to the labelled dataset.
Disadvantages:
○ The output of an unsupervised algorithm can be less accurate as the dataset is
not labelled, and algorithms are not trained with the exact output in prior.
○ Working with Unsupervised learning is more difficult as it works with the
unlabelled dataset that does not map with the output.
Applications of Unsupervised Learning
○ Network Analysis: Unsupervised learning is used for identifying plagiarism and
copyright in document network analysis of text data for scholarly articles.
○ Recommendation Systems: Recommendation systems widely use unsupervised
learning techniques for building recommendation applications for different web
applications and e-commerce websites.
○ Anomaly Detection: Anomaly detection is a popular application of unsupervised
learning, which can identify unusual data points within the dataset. It is used to
discover fraudulent transactions.
○ Singular Value Decomposition: Singular Value Decomposition or SVD is used to
extract particular information from the database. For example, extracting
information of each user located at a particular location.
3. Semi-Supervised Learning
Semi-Supervised learning is a type of Machine Learning algorithm that lies between
Supervised and Unsupervised machine learning. It represents the intermediate ground
between Supervised (With Labelled training data) and Unsupervised learning (with no
labelled training data) algorithms and uses the combination of labelled and unlabeled
datasets during the training period.
Although Semi-supervised learning is the middle ground between supervised and
unsupervised learning and operates on the data that consists of a few labels, it mostly
consists of unlabeled data. As labels are costly, but for corporate purposes, they may
have few labels. It is completely different from supervised and unsupervised learning as
they are based on the presence & absence of labels.
To overcome the drawbacks of supervised learning and unsupervised learning
algorithms, the concept of Semi-supervised learning is introduced. The main aim of
semi-supervised learning is to effectively use all the available data, rather than only
labelled data like in supervised learning. Initially, similar data is clustered along with an
unsupervised learning algorithm, and further, it helps to label the unlabeled data into
labelled data. It is because labelled data is a comparatively more expensive acquisition
than unlabeled data.
We can imagine these algorithms with an example. Supervised learning is where a
student is under the supervision of an instructor at home and college. Further, if that
student is self-analysing the same concept without any help from the instructor, it
comes under unsupervised learning. Under semi-supervised learning, the student has to
revise himself after analyzing the same concept under the guidance of an instructor at
college.
Advantages and disadvantages of Semi-supervised Learning
Advantages:
○ It is simple and easy to understand the algorithm.
○ It is highly efficient.
○ It is used to solve drawbacks of Supervised and Unsupervised Learning
algorithms.
Disadvantages:
○ Iterations results may not be stable.
○ We cannot apply these algorithms to network-level data.
○ Accuracy is low.
4. Reinforcement Learning
Reinforcement learning works on a feedback-based process, in which an AI agent (A
software component) automatically explore its surrounding by hitting & trail, taking
action, learning from experiences, and improving its performance. Agent gets
rewarded for each good action and get punished for each bad action; hence the goal of
reinforcement learning agent is to maximize the rewards.
In reinforcement learning, there is no labelled data like supervised learning, and agents
learn from their experiences only.
The reinforcement learning process is similar to a human being; for example, a child
learns various things by experiences in his day-to-day life. An example of reinforcement
learning is to play a game, where the Game is the environment, moves of an agent at
each step define states, and the goal of the agent is to get a high score. Agent receives
feedback in terms of punishment and rewards.
Due to its way of working, reinforcement learning is employed in different fields such as
Game theory, Operation Research, Information theory, multi-agent systems.
A reinforcement learning problem can be formalized using Markov Decision
Process(MDP). In MDP, the agent constantly interacts with the environment and
performs actions; at each action, the environment responds and generates a new state.
Categories of Reinforcement Learning
Reinforcement learning is categorized mainly into two types of methods/algorithms:
○ Positive Reinforcement Learning: Positive reinforcement learning specifies
increasing the tendency that the required behaviour would occur again by adding
something. It enhances the strength of the behaviour of the agent and positively
impacts it.
○ Negative Reinforcement Learning: Negative reinforcement learning works
exactly opposite to the positive RL. It increases the tendency that the specific
behaviour would occur again by avoiding the negative condition.
Real-world Use cases of Reinforcement Learning
○ Video Games:
RL algorithms are much popular in gaming applications. It is used to gain
super-human performance. Some popular games that use RL algorithms are
AlphaGO and AlphaGO Zero.
○ Resource Management:
The "Resource Management with Deep Reinforcement Learning" paper showed
that how to use RL in computer to automatically learn and schedule resources to
wait for different jobs in order to minimize average job slowdown.
○ Robotics:
RL is widely being used in Robotics applications. Robots are used in the industrial
and manufacturing area, and these robots are made more powerful with
reinforcement learning. There are different industries that have their vision of
building intelligent robots using AI and Machine learning technology.
○ Text Mining
Text-mining, one of the great applications of NLP, is now being implemented with
the help of Reinforcement Learning by Salesforce company.
Advantages and Disadvantages of Reinforcement Learning
Advantages
○ It helps in solving complex real-world problems which are difficult to be solved by
general techniques.
○ The learning model of RL is similar to the learning of human beings; hence most
accurate results can be found.
○ Helps in achieving long term results.
Disadvantage
○ RL algorithms are not preferred for simple problems.
○ RL algorithms require huge data and computations.
○ Too much reinforcement learning can lead to an overload of states which can
weaken the results.
The curse of dimensionality limits reinforcement learning for real physical systems.
4.
Introduction: When training a machine learning model, it is crucial to monitor its
performance using various evaluation metrics. Among these metrics, training
loss and testing loss play a fundamental role in assessing the model's learning
progress and generalization capabilities. This note aims to shed light on the
concepts of training loss and testing loss, their differences, and their significance
in the model development process.
Training Loss: During the training phase, a machine learning model is exposed to
a labeled dataset to learn patterns and relationships between the input data and
the desired output. Training loss, also known as the empirical or objective loss,
measures the error or discrepancy between the predicted output of the model
and the actual ground truth labels within the training data. The training loss
quantifies how well the model is fitting the training data and how effectively it is
minimizing the error during the optimization process.
Testing Loss: Once the model is trained, it needs to be evaluated on unseen data
to assess its performance on examples that were not part of the training set.
Testing loss, also called validation loss or generalization loss, measures the
model's performance on this unseen data. It calculates the error or discrepancy
between the predicted outputs and the true labels from the testing dataset. The
testing loss serves as an estimate of how well the model will perform on new,
unseen data and reflects its ability to generalize and make accurate predictions
beyond the training set.
Key Differences:
1. Data Usage: Training loss is computed using the training dataset, which
the model has already seen and learned from. Testing loss, on the other
hand, is computed using a separate dataset that was not used during the
training phase and represents real-world scenarios.
2. Purpose: Training loss is primarily used to guide the model's learning
process and optimize its parameters by minimizing the error. Testing loss
provides an evaluation of the model's performance and serves as an
estimate of how it will perform in real-world scenarios.
3. Overfitting Detection: Monitoring the training loss is essential for detecting
overfitting. Overfitting occurs when a model becomes excessively complex
and starts to memorize the training data, resulting in a low training loss but
a high testing loss. An increasing gap between training loss and testing
loss indicates overfitting, suggesting that the model is not generalizing well
to unseen data.
Significance:
1. Model Optimization: Training loss serves as the primary feedback signal
during the model's optimization process, guiding the learning algorithm to
adjust the model's parameters to minimize the error on the training data.
2. Generalization Assessment: Testing loss provides insights into how well
the model is performing on unseen data, indicating its ability to generalize
beyond the training set. It helps evaluate and compare different models or
model configurations.
3. Hyperparameter Tuning: Testing loss is often used to tune the
hyperparameters of a model, such as learning rate or regularization
strength, to improve its generalization performance. By monitoring the
testing loss, one can find the optimal hyperparameter settings that result in
the lowest testing error.
Conclusion: Training loss and testing loss are crucial metrics for assessing the
performance of machine learning models. While training loss indicates how well
a model fits the training data, testing loss provides insights into its generalization
capabilities. Monitoring both loss values allows for detecting overfitting,
optimizing models, and making informed decisions during the development
process. By striking a balance between minimizing training loss and achieving
low testing loss, one can create models that learn effectively and perform well on
new, unseen data.
5.
Tradeoffs in Statistical Learning
In statistical learning, various tradeoffs arise when developing and deploying machine
learning models. These tradeoffs involve factors such as model performance,
complexity, interpretability, and computational resources. Understanding these tradeoffs
is essential for making informed decisions and achieving the desired balance. Here are
some common tradeoffs in statistical learning:
1. Bias-variance tradeoff: The bias-variance tradeoff is about finding the right
balance between capturing underlying patterns in the data and avoiding
overfitting or underfitting. Models with high bias oversimplify the data, resulting
in underfitting, while models with high variance are overly complex and fit noise,
leading to overfitting. Striking the right balance between bias and variance is
crucial for optimal predictive performance.
2. Model complexity vs. interpretability: Increasing model complexity often
improves predictive accuracy by capturing intricate patterns. However, complex
models, such as deep neural networks, may be difficult to interpret. Simpler
models, like linear regression, are more interpretable but may have limited
predictive power. The tradeoff between model complexity and interpretability
depends on the application and the importance of interpretability in
decision-making.
3. Training time vs. model performance: Some models, like deep neural networks,
require substantial computational resources and longer training times to achieve
high performance. Simpler models, such as decision trees or linear models, can
be trained quickly but may have limited predictive capabilities. The tradeoff
between training time and model performance depends on available
computational resources, time constraints, and specific application
requirements.
4. Underfitting vs. overfitting: Underfitting occurs when a model is too simple to
capture underlying patterns, resulting in poor performance on both training and
testing data. Overfitting happens when a model becomes overly complex and fits
noise or random variations, leading to poor performance on unseen data. The
tradeoff involves finding the right level of model complexity that minimizes both
underfitting and overfitting.
5. Feature selection vs. feature dimensionality: Feature selection involves choosing
relevant features for the model's predictive performance. Including irrelevant or
redundant features increases model complexity, training time, and the risk of
overfitting. However, removing informative features may result in a loss of
valuable information. The tradeoff lies in selecting a subset of features that
balances model performance and complexity.
6. Model robustness vs. computational efficiency: Complex models may be more
robust to noise and data variations, but they require increased computational
resources, memory, and time. Simpler models may be computationally efficient
but more sensitive to noise. The tradeoff involves finding a balance between
model robustness and computational efficiency based on available resources
and desired performance level.
Understanding and managing these tradeoffs are essential for developing effective and
efficient statistical learning models. Consider the problem requirements, available data,
computational resources, interpretability needs, and desired performance level to make
informed decisions and strike an appropriate balance.
6.
Estimating the Sampling Distribution of an Estimator
The sampling distribution of an estimator provides insights into the behavior and
variability of the estimator's values when repeatedly sampling from a population.
It allows us to make inferences about the estimator's accuracy and precision.
Here is a procedure for estimating the sampling distribution of an estimator:
1. Define the Estimator: Begin by clearly defining the estimator you want to
study. This could be the sample mean, sample proportion, regression
coefficient, or any other statistic used to estimate a population parameter.
2. Define the Population: Specify the characteristics of the population you are
interested in. This includes identifying the population distribution, any
assumed parameters, and the sampling method.
3. Simulate Sampling: Simulate the process of sampling from the population
to generate multiple samples. The number of samples and the sample size
may vary depending on the desired level of precision.
4. Calculate the Estimator: For each simulated sample, calculate the value of
the estimator. This involves applying the estimator formula to the sample
data. Record the estimator's value for each sample.
5. Repeat Steps 3 and 4: Repeat the process of simulating sampling and
calculating the estimator multiple times. The more repetitions you perform,
the more accurate the estimation of the sampling distribution will be.
6. Analyze the Results: Once you have generated multiple samples and
obtained the corresponding estimator values, analyze the results to
estimate the sampling distribution. Commonly used methods include
calculating descriptive statistics such as the mean, standard deviation, and
confidence intervals of the estimator values.
7. Visualize the Sampling Distribution: To gain a better understanding of the
sampling distribution, plot a histogram or a density plot of the estimator
values. This visual representation can help identify the shape, center, and
spread of the sampling distribution.
8. Interpret the Results: Analyze the estimated sampling distribution to draw
conclusions about the estimator's behavior. Look for characteristics such
as bias, efficiency, consistency, or other relevant properties. Compare the
estimated sampling distribution to theoretical expectations if applicable.
9. Validate and Refine: Assess the validity of the estimated sampling
distribution by comparing it with theoretical properties, if available. If
necessary, refine the simulation process by adjusting the sample size, the
number of repetitions, or other parameters to improve the accuracy and
reliability of the estimation.
By following this procedure, you can estimate the sampling distribution of an
estimator. This provides valuable insights into the behavior and variability of the
estimator's values when sampling from a population. These estimates can guide
decision-making, hypothesis testing, and other statistical inference tasks.
7.
Statistics in Supervised Learning and Unsupervised Learning
Supervised learning and unsupervised learning are two main categories of machine
learning techniques. Both approaches utilize various statistical methods to analyze and
make inferences from data. Here's an overview of the role of statistics in supervised and
unsupervised learning:
Supervised Learning: In supervised learning, the goal is to learn a mapping between
input data and corresponding output labels based on labeled training examples.
Statistics plays a crucial role in different aspects of supervised learning, including:
1. Descriptive Statistics: Descriptive statistics are used to summarize and
understand the characteristics of the input features and output labels. This
involves calculating measures such as mean, median, variance, and correlation
coefficients to gain insights into the data distribution and relationships.
2. Inferential Statistics: Inferential statistics are employed to make inferences about
the population based on the sample data. Techniques such as hypothesis testing
and confidence intervals can help assess the significance of relationships
between input features and output labels and determine if they are statistically
meaningful.
3. Regression Analysis: Regression analysis is commonly used in supervised
learning to model the relationship between input features and continuous output
variables. Statistical methods like linear regression, polynomial regression, or
more advanced techniques like ridge regression or lasso regression are
employed to estimate the parameters of the regression models.
4. Classification Analysis: Classification analysis is used when the output variable is
categorical. Statistical techniques such as logistic regression, decision trees,
random forests, or support vector machines are employed to build classification
models that can predict the class labels of new, unseen data based on the input
features.
Unsupervised Learning: In unsupervised learning, the goal is to explore and discover
patterns or structures within the data without labeled examples. Statistics plays a vital
role in several aspects of unsupervised learning, including:
1. Clustering Analysis: Clustering algorithms are employed to group similar data
points together based on their feature similarities. Statistical methods like
k-means clustering, hierarchical clustering, or Gaussian mixture models are
utilized to identify clusters and estimate cluster centers and boundaries.
2. Dimensionality Reduction: Dimensionality reduction techniques are used to
reduce the complexity of high-dimensional data by transforming it into a
lower-dimensional representation. Statistical methods such as principal
component analysis (PCA), factor analysis, or t-distributed stochastic neighbor
embedding (t-SNE) are employed to capture the most informative features or
project the data onto a lower-dimensional space.
3. Association Rule Mining: Association rule mining is used to discover interesting
relationships or patterns in transactional or categorical data. Statistical
techniques like Apriori algorithm or frequent itemset mining are applied to
identify associations or dependencies among different items or variables.
4. Outlier Detection: Outlier detection aims to identify unusual or anomalous data
points that deviate significantly from the norm. Statistical methods such as
z-score, Mahalanobis distance, or robust statistical measures like median
absolute deviation (MAD) are utilized to detect outliers based on their deviations
from the expected statistical patterns.
In both supervised and unsupervised learning, statistical techniques and principles
provide the foundation for data analysis, model building, and interpretation. They enable
researchers and practitioners to extract meaningful insights, assess the significance of
relationships, and make informed decisions based on the data at hand.
Unit -2:1. Explain KNN with an example?
2. Explain Naïve Bayes classification with example?
3. Explain Linear and logistic regression with examples?
4. Explain Binary classification with example in machine
Learning?
5. What is decision tree? Explain the procedure to
construct decision tree?
6. Write a short note on various Distance based methods
of Classification/regression?
1.
K-Nearest
Neighbor(KNN)
Machine Learning
Algorithm
for
○ K-Nearest Neighbour is one of the simplest Machine Learning algorithms based
on Supervised Learning technique.
○ K-NN algorithm assumes the similarity between the new case/data and available
cases and put the new case into the category that is most similar to the available
categories.
○ K-NN algorithm stores all the available data and classifies a new data point
based on the similarity. This means when new data appears then it can be easily
classified into a well suite category by using K- NN algorithm.
○ K-NN algorithm can be used for Regression as well as for Classification but
mostly it is used for the Classification problems.
○ K-NN is a non-parametric algorithm, which means it does not make any
assumption on underlying data.
○ It is also called a lazy learner algorithm because it does not learn from the
training set immediately instead it stores the dataset and at the time of
classification, it performs an action on the dataset.
○ KNN algorithm at the training phase just stores the dataset and when it gets new
data, then it classifies that data into a category that is much similar to the new
data.
○ Example: Suppose, we have an image of a creature that looks similar to cat and
dog, but we want to know either it is a cat or dog. So for this identification, we
can use the KNN algorithm, as it works on a similarity measure. Our KNN model
will find the similar features of the new data set to the cats and dogs images and
based on the most similar features it will put it in either cat or dog category.
Why do we need a K-NN Algorithm?
Suppose there are two categories, i.e., Category A and Category B, and we have a new
data point x1, so this data point will lie in which of these categories. To solve this type
of problem, we need a K-NN algorithm. With the help of K-NN, we can easily identify the
category or class of a particular dataset. Consider the below diagram:
How does K-NN work?
The K-NN working can be explained on the basis of the below algorithm:
○ Step-1: Select the number K of the neighbors
○ Step-2: Calculate the Euclidean distance of K number of neighbors
○ Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
○ Step-4: Among these k neighbors, count the number of the data points in each
category.
○ Step-5: Assign the new data points to that category for which the number of the
neighbor is maximum.
○ Step-6: Our model is ready.
Suppose we have a new data point and we need to put it in the required category.
Consider the below image:
○ Firstly, we will choose the number of neighbors, so we will choose the k=5.
○ Next, we will calculate the Euclidean distance between the data points. The
Euclidean distance is the distance between two points, which we have already
studied in geometry. It can be calculated as:
○ By calculating the Euclidean distance we got the nearest neighbors, as three
nearest neighbors in category A and two nearest neighbors in category B.
Consider the below image:
○ As we can see the 3 nearest neighbors are from category A, hence this new data
point must belong to category A
Advantages of KNN Algorithm:
○ It is simple to implement.
○ It is robust to the noisy training data
○ It can be more effective if the training data is large.
Disadvantages of KNN Algorithm:
○ Always needs to determine the value of K which may be complex some time.
○ The computation cost is high because of calculating the distance between the
data points for all the training samples.
2.
Naïve Bayes classification is a popular and simple probabilistic machine learning
algorithm used for classification tasks. It is based on Bayes' theorem and assumes that
the features are conditionally independent given the class. Despite its simplicity, Naïve
Bayes can be highly effective, especially in text classification and spam filtering. Here's
an explanation of Naïve Bayes classification with an example:
Bayes' Theorem:
○ Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to
determine the probability of a hypothesis with prior knowledge. It depends on the
conditional probability.
○ The formula for Bayes' theorem is given as:
Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of
a hypothesis is true.
P(A) is Prior Probability: Probability of hypothesis before observing the evidence.
P(B) is Marginal Probability: Probability of Evidence.
Example: Let's consider a simple example of classifying emails as either "spam" or "not
spam" based on certain words that appear in the email.
1. Training Phase: In the training phase, we collect a labeled dataset consisting of
emails and their corresponding class labels (spam or not spam). We also
preprocess the data by tokenizing the emails into words and removing any
irrelevant information.
Suppose we have the following training data:
Email 1: "Get a free gift!" Class: Spam
Email 2: "Meeting at 3 pm." Class: Not spam
Email 3: "Claim your prize now!" Class: Spam
Email 4: "Reminder: Project deadline tomorrow." Class: Not spam
2. Calculating Class Priors: First, we calculate the prior probabilities of each class
(spam and not spam). The prior probability of a class is the probability of an
email being in that class without considering any features.
Let's assume that in our training data, we have 2 spam emails and 2 not spam emails.
Therefore, the class priors are as follows:
P(Spam) = 2/4 = 0.5 P(Not Spam) = 2/4 = 0.5
3. Building Feature Models: Next, we build feature models by estimating the
likelihood probabilities of each feature (word) given the class. In Naïve Bayes, we
assume that the features (words) are conditionally independent given the class.
This is known as the "naïve" assumption.
For simplicity, let's assume that our feature set consists of only three words: "free,"
"meeting," and "claim."
To calculate the likelihood probabilities, we count the number of occurrences of each
word in each class and divide it by the total number of words in that class.
P(free|Spam) = 2/6 = 1/3 P(free|Not Spam) = 0/6 = 0
P(meeting|Spam) = 0/6 = 0 P(meeting|Not Spam) = 1/7 ≈ 0.143
P(claim|Spam) = 1/6 ≈ 0.167 P(claim|Not Spam) = 0/7 = 0
4. Classifying New Emails: Now, let's suppose we have a new email: "Claim your
free gift now!" and we want to classify it as spam or not spam using the Naïve
Bayes classifier.
To classify the new email, we calculate the posterior probability for each class given the
features (words) in the email. The posterior probability is obtained by multiplying the
class prior and the likelihood probabilities of each feature.
For the new email:
P(Spam|new email) ∝ P(Spam) * P(claim|Spam) * P(free|Spam) = 0.5 * 0.167 * 1/3 ≈
0.0278
P(Not Spam|new email) ∝ P(Not Spam) * P(claim|Not Spam) * P(free|Not Spam) = 0.5 *
0*0≈0
Since the posterior probability for spam is higher than that for not spam
3.
Linear Regression vs Logistic Regression
Linear Regression and Logistic Regression are the two famous Machine Learning
Algorithms which come under supervised learning technique. Since both the algorithms
are of supervised in nature hence these algorithms use labeled dataset to make the
predictions. But the main difference between them is how they are being used. The
Linear Regression is used for solving Regression problems whereas Logistic Regression
is used for solving the Classification problems. The description of both the algorithms is
given below along with difference table.
Linear Regression:
○ Linear Regression is one of the most simple Machine learning algorithm that
comes under Supervised Learning technique and used for solving regression
problems.
○ It is used for predicting the continuous dependent variable with the help of
independent variables.
○ The goal of the Linear regression is to find the best fit line that can accurately
predict the output for the continuous dependent variable.
○ If single independent variable is used for prediction then it is called Simple Linear
Regression and if there are more than two independent variables then such
regression is called as Multiple Linear Regression.
○ By finding the best fit line, algorithm establish the relationship between
dependent variable and independent variable. And the relationship should be of
linear nature.
○ The output for Linear regression should only be the continuous values such as
price, age, salary, etc. The relationship between the dependent variable and
independent variable can be shown in below image:
In above image the dependent variable is on Y-axis (salary) and independent variable is
on x-axis(experience). The regression line can be written as:
y= a0+a1x+ ε
Where, a0 and a1 are the coefficients and ε is the error term.
Logistic Regression:
○ Logistic regression is one of the most popular Machine learning algorithm that
comes under Supervised Learning techniques.
○ It can be used for Classification as well as for Regression problems, but mainly
used for Classification problems.
○ Logistic regression is used to predict the categorical dependent variable with the
help of independent variables.
○ The output of Logistic Regression problem can be only between the 0 and 1.
○ Logistic regression can be used where the probabilities between two classes is
required. Such as whether it will rain today or not, either 0 or 1, true or false etc.
○ Logistic regression is based on the concept of Maximum Likelihood estimation.
According to this estimation, the observed data should be most probable.
○ In logistic regression, we pass the weighted sum of inputs through an activation
function that can map values in between 0 and 1. Such activation function is
known as sigmoid function and the curve obtained is called as sigmoid curve or
S-curve. Consider the below image:
○ The equation for logistic regression is:
Difference between Linear Regression and Logistic Regression:
Linear Regression
Logistic Regression
Linear regression is used to predict the
Logistic Regression is used to predict the
continuous dependent variable using a
categorical dependent variable using a
given set of independent variables.
given set of independent variables.
Linear Regression is used for solving
Logistic regression is used for solving
Regression problem.
Classification problems.
In Linear regression, we predict the value
In logistic Regression, we predict the
of continuous variables.
values of categorical variables.
In linear regression, we find the best fit
In Logistic Regression, we find the
line, by which we can easily predict the
S-curve by which we can classify the
output.
samples.
Least square estimation method is used
Maximum likelihood estimation method
for estimation of accuracy.
is used for estimation of accuracy.
The output for Linear Regression must be
The output of Logistic Regression must
a continuous value, such as price, age,
be a Categorical value such as 0 or 1, Yes
etc.
or No, etc.
In Linear regression, it is required that
In Logistic regression, it is not required to
relationship between dependent variable
have the linear relationship between the
and independent variable must be linear.
dependent and independent variable.
In
linear
regression, there may be In logistic regression, there should not be
collinearity between the independent
collinearity between the independent
variables.
variable.
4.
Binary classification is a machine learning task that involves classifying data instances
into one of two possible classes. The goal is to build a model that can learn from
labeled training data to accurately predict the class of unseen instances. Here's an
explanation of binary classification with an example:
Example: Let's consider a binary classification problem of predicting whether a bank
loan applicant is "approved" or "rejected" based on certain attributes such as income,
credit score, and loan amount.
1. Dataset: We have a labeled dataset that contains information about previous loan
applicants along with their loan approval status. Each instance in the dataset
consists of input features (income, credit score, loan amount) and the
corresponding class label (approved or rejected).
2. Training Phase: In the training phase, we use the labeled dataset to train a binary
classification model. The model learns to identify patterns and relationships
between the input features and the corresponding class labels.
For example, the model may learn that applicants with a higher income and a good
credit score are more likely to be approved for a loan, while those with a low income and
a poor credit score are more likely to be rejected.
3. Model Building: Based on the training data, we select an appropriate algorithm
for binary classification, such as logistic regression, support vector machines
(SVM), or decision trees.
For instance, if we choose logistic regression, the algorithm will estimate the
coefficients for the input features to create a decision boundary that separates the
approved and rejected classes. This boundary will be based on the probabilities of class
membership.
4. Feature Engineering: Before training the model, we may perform feature
engineering to preprocess and transform the input features. This can include
steps like normalization, handling missing values, or encoding categorical
variables.
For example, we may normalize the income and loan amount values to a standardized
scale and encode categorical variables like employment type or education level using
one-hot encoding.
5. Model Training: During the training phase, the binary classification model adjusts
its parameters using an optimization algorithm to minimize the prediction error.
The model learns to distinguish between the two classes based on the provided
training data.
The training process iteratively updates the model's parameters until it reaches a point
where it minimizes the difference between predicted and actual class labels for the
training instances.
6. Evaluation and Prediction: After training, we evaluate the performance of the
binary classification model using a separate set of labeled test data. The model
predicts the class labels for the test instances based on their input features, and
we compare the predicted labels with the true labels to assess accuracy,
precision, recall, and other performance metrics.
For example, we can evaluate how accurately the model predicts loan approval status
for the test instances by calculating metrics like accuracy (the proportion of correctly
predicted instances), precision (the proportion of true positives among predicted
positives), recall (the proportion of true positives identified), and F1 score (a combined
metric of precision and recall).
7. Model Deployment: Once the binary classification model has been trained and
evaluated, it can be deployed to make predictions on new, unseen instances. It
can classify new loan applicants as either approved or rejected based on their
input features.
This allows banks or financial institutions to automate the loan approval process,
enabling faster and more efficient decision-making.
Binary classification is a fundamental task in machine learning with numerous
applications across various domains, including finance, healthcare, marketing, and
more.
5.
A decision tree is a popular supervised machine learning algorithm that is used for both
classification and regression tasks. It creates a tree-like model of decisions and their
possible consequences. Each internal node in the tree represents a feature or attribute,
each branch represents a decision rule, and each leaf node represents an outcome or
class label. Decision trees are easy to interpret and visualize, making them widely used
in various domains. Here's an explanation of the procedure to construct a decision tree:
1. Dataset:
Start with a labeled dataset that contains instances with input features and their
corresponding class labels. Each instance should have a set of features and a known
class label. For example, consider a dataset of patients with attributes like age, gender,
symptoms, and a binary class label indicating whether they have a certain disease or
not.
2. Select a Root Node:
Choose an attribute from the dataset that will act as the root node of the decision tree.
The attribute is selected based on various criteria, such as information gain, Gini
impurity, or gain ratio. These criteria evaluate the effectiveness of an attribute in
splitting the data and creating distinct classes.
3. Splitting the Data:
Divide the dataset into subsets based on the values of the selected attribute (root
node). Each subset represents a different branch from the root node. For example, if the
root node is the "age" attribute, the data might be split into subsets for different age
ranges.
4. Attribute Selection:
Repeat the attribute selection process for each subset or branch. Choose the next
attribute that best splits the data and creates pure or homogeneous subsets. This
process is typically based on the same criteria used for selecting the root node.
5. Recursive Construction:
Continue recursively splitting the data based on attribute selection until a stopping
criterion is met. Stopping criteria can include reaching a maximum depth for the tree,
having a minimum number of instances per leaf, or achieving a pure subset (where all
instances belong to the same class).
6. Handling Missing Values:
Decide how to handle instances with missing attribute values. This can be done by
either ignoring those instances, replacing the missing values with the most common
value of the attribute, or using more advanced imputation techniques.
7. Pruning (Optional):
Pruning is an optional step to prevent overfitting and improve the generalization ability
of the decision tree. Pruning involves removing branches or nodes from the tree that do
not significantly contribute to its accuracy. This helps simplify the tree and reduces the
risk of overfitting the training data.
8. Assigning Class Labels:
Assign class labels to the leaf nodes of the decision tree based on the majority class in
each leaf or by using probability-based rules.
9. Visualization and Interpretation:
Visualize the constructed decision tree to gain insights into the decision-making
process. The tree structure can be displayed graphically, showing the attributes,
decision rules, and class labels at each node. This visualization aids in understanding
the decision logic and provides a clear interpretation of the model's behavior.
10. Prediction and Evaluation:
Finally, use the constructed decision tree to make predictions on new, unseen instances.
Traverse the tree by following the decision rules based on the input features of the
instance until reaching a leaf node, which provides the predicted class label. Evaluate
the performance of the decision tree model using appropriate evaluation metrics such
as accuracy, precision, recall, or F1 score.
The procedure outlined above outlines the basic steps involved in constructing a
decision tree. Various algorithms, such as ID3, C4.5, and CART, implement these steps
with slight variations. Each algorithm may have different attribute selection measures,
splitting rules, or pruning techniques, but the general idea remains the same: recursively
split the data based on attribute selection until a stopping criterion is met, and assign
class labels to the leaf nodes.
6.
Distance-based methods are a class of algorithms used for classification and
regression tasks. These methods rely on the concept of distance to make predictions.
Here's a concise explanation of various distance-based methods that you can use in
your semester exam:
1. k-Nearest Neighbors (k-NN):
k-Nearest Neighbors is a simple yet effective algorithm. It calculates the distance
between a new data instance and the instances in the training set. The k nearest
neighbors, determined by the smallest distances, are considered. Their class labels or
target values are then used to predict the class or estimate the target variable of the
new instance. The value of k determines the number of neighbors to consider.
2. Euclidean Distance:
Euclidean distance is a commonly used metric in distance-based methods. It measures
the straight-line distance between two data points in a multidimensional feature space.
The Euclidean distance between two points (x1, y1) and (x2, y2) in a two-dimensional
space is calculated as the square root of the sum of the squared differences along each
dimension: sqrt((x2 - x1)^2 + (y2 - y1)^2). It can be extended to higher dimensions.
3. Manhattan Distance:
Manhattan distance, also known as city block distance or L1 distance, calculates the
distance between two points by summing the absolute differences between their
corresponding coordinates. It is called Manhattan distance because it measures the
distance as if traveling along the city blocks in a grid-like pattern. The Manhattan
distance between two points (x1, y1) and (x2, y2) is calculated as the sum of the
absolute differences: |x2 - x1| + |y2 - y1|. It can also be extended to higher dimensions.
4. Minkowski Distance:
Minkowski distance is a generalized form of distance that includes both Euclidean and
Manhattan distances as special cases. It is defined as the p-th root of the sum of the
p-th powers of the absolute differences between the coordinates of two points. The
Minkowski distance between two points (x1, y1) and (x2, y2) is calculated as the p-th
root of the sum of the p-th powers of the absolute differences: (|x2 - x1|^p + |y2 y1|^p)^(1/p). The value of p determines the shape of the distance measure.
5. Mahalanobis Distance:
Mahalanobis distance takes into account the correlation and covariance structure of the
data. It measures the distance between a point and a distribution or a group of points.
The Mahalanobis distance between a point and a distribution is calculated as the
square root of the sum of the squared differences between the point's coordinates and
the mean of the distribution, weighted by the inverse covariance matrix. It is useful
when dealing with correlated features.
These are some important distance-based methods used in classification and
regression. They provide a flexible framework for analyzing data based on the concept
of proximity. Understanding these methods will help you tackle questions related to
distance-based algorithms in your semester exam.
Unit -3:1. Explain about AdaBoost ensemble and Gradient
Boosting ensemble?
2. Differences between bagging and Boosting?
3. What is the difference between hard and soft voting
classifiers?
4. Differences between decision tree and random
forest?
5. Explain the Stacking?
6. Write a note on SVM regression?
7. Write about Naive Bayes Classifiers Vs SVM in text
classification.
1.
AdaBoost (Adaptive Boosting)
AdaBoost is a boosting ensemble model and works especially
well with the decision tree. Boosting model’s key is learning from
the previous mistakes, e.g. misclassification data points.
AdaBoost learns from the mistakes by increasing the weight of
misclassified data points.
Let’s illustrate how AdaBoost adapts.
Step 0: Initialize the weights of data points. if the training set
has 100 data points, then each point’s initial weight should be
1/100 = 0.01.
Step 1: Train a decision tree
Step 2: Calculate the weighted error rate (e) of the decision
tree. The weighted error rate (e) is just how many wrong
predictions out of total and you treat the wrong predictions
differently based on its data point’s weight. The higher the
weight, the more the corresponding error will be
weighted during the calculation of the (e).
Step 3: Calculate this decision tree’s weight in the
ensemble the weight of this tree = learning rate * log( (1 — e) / e)
● the higher weighted error rate of a tree,
😫, the less
decision power the tree will be given during the later
voting
● the lower weighted error rate of a tree,
😃, the
higher decision power the tree will be given during
the later voting
Step 4: Update weights of wrongly classified points
the weight of each data point =
● if the model got this data point correct, the weight
stays the same
● if the model got this data point wrong, the new
weight of this point = old weight * np.exp(weight of
this tree)
Note: The higher the weight of the tree (more accurate this tree
performs), the more boost (importance) the misclassified data
point by this tree will get. The weights of the data points are
normalized after all the misclassified points are updated.
Step 5: Repeat Step 1(until the number of trees we set to train is
reached)
Step 6: Make the final prediction
The AdaBoost makes a new prediction by adding up the weight
(of each tree) multiply the prediction (of each tree). Obviously,
the tree with higher weight will have more power of influence the
final decision.
Gradient Boosting
Gradient boosting is another boosting model. Remember,
boosting model’s key is learning from the previous mistakes.
Gradient Boosting learns from the mistake — residual error
directly, rather than update the weights of data points.
Let’s illustrate how Gradient Boost learns.
Step 1: Train a decision tree
Step 2: Apply the decision tree just trained to predict
Step 3: Calculate the residual of this decision tree, Save residual
errors as the new y
Step 4: Repeat Step 1 (until the number of trees we set to train
is reached)
Step 5: Make the final prediction
The Gradient Boosting makes a new prediction by simply adding
up the predictions (of all trees).
2.
Bagging vs Boosting in Machine Learning
As we know, Ensemble learning helps improve machine learning results by
combining several models. This approach allows the production of better
predictive performance compared to a single model. Basic idea is to learn a set
of classifiers (experts) and to allow them to vote. Bagging and Boosting are
two types of Ensemble Learning. These two decrease the variance of a single
estimate as they combine several estimates from different models. So the result
may be a model with higher stability. Let’s understand these two terms in a
glimpse.
1. Bagging: It is a homogeneous weak learners’ model that learns from
each other independently in parallel and combines them for
determining the model average.
2. Boosting: It is also a homogeneous weak learners’ model but works
differently from Bagging. In this model, learners learn sequentially and
adaptively to improve model predictions of a learning algorithm.
Let’s look at both of them in detail and understand the Difference between
Bagging and Boosting.
Bagging
Bootstrap Aggregating, also known as bagging, is a machine learning ensemble
meta-algorithm designed to improve the stability and accuracy of machine
learning algorithms used in statistical classification and regression. It decreases
the variance and helps to avoid overfitting. It is usually applied to decision tree
methods. Bagging is a special case of the model averaging approach.
Description of the Technique
Suppose a set D of d tuples, at each iteration i, a training set Di of d tuples is
selected via row sampling with a replacement method (i.e., there can be
repetitive elements from different d tuples) from D (i.e., bootstrap). Then a
classifier model Mi is learned for each training set D < i. Each classifier Mi
returns its class prediction. The bagged classifier M* counts the votes and
assigns the class with the most votes to X (unknown sample).
Implementation Steps of Bagging
● Step 1: Multiple subsets are created from the original data set with
equal tuples, selecting observations with replacement.
● Step 2: A base model is created on each of these subsets.
● Step 3: Each model is learned in parallel with each training set and
independent of each other.
● Step 4: The final predictions are determined by combining the
predictions from all the models.
An illustration for the concept of bootstrap aggregating (Bagging)
Example of Bagging
The Random Forest model uses Bagging, where decision tree models with
higher variance are present. It makes random feature selection to grow trees.
Several random trees make a Random Forest.
To read more refer to this article: Bagging classifier
Boosting
Boosting is an ensemble modeling technique that attempts to build a strong
classifier from the number of weak classifiers. It is done by building a model by
using weak models in series. Firstly, a model is built from the training data. Then
the second model is built which tries to correct the errors present in the first
model. This procedure is continued and models are added until either the
complete training data set is predicted correctly or the maximum number of
models is added.
Boosting Algorithms
There are several boosting algorithms. The original ones, proposed by Robert
Schapire and Yoav Freund were not adaptive and could not take full advantage
of the weak learners. Schapire and Freund then developed AdaBoost, an
adaptive boosting algorithm that won the prestigious Gödel Prize. AdaBoost
was the first really successful boosting algorithm developed for the purpose of
binary classification. AdaBoost is short for Adaptive Boosting and is a very
popular boosting technique that combines multiple “weak classifiers” into a
single “strong classifier”.
Algorithm:
1. Initialise the dataset and assign equal weight to each of the data point.
2. Provide this as input to the model and identify the wrongly classified
data points.
3. Increase the weight of the wrongly classified data points and decrease
the weights of correctly classified data points. And then normalize the
weights of all data points.
4. if (got required results)
Goto step 5
else
Goto step 2
5. End
An illustration presenting the intuition behind the boosting algorithm, consisting of the parallel learners and
weighted dataset.
To read more refer to this article: Boosting and AdaBoost in ML
Similarities Between Bagging and Boosting
Bagging and Boosting, both being the commonly used methods, have a universal
similarity of being classified as ensemble methods. Here we will explain the
similarities between them.
1. Both are ensemble methods to get N learners from 1 learner.
2. Both generate several training data sets by random sampling.
3. Both make the final decision by averaging the N learners (or taking the
majority of them i.e Majority Voting).
4. Both are good at reducing variance and provide higher stability.
Differences Between Bagging and Boosting
S.NO
Bagging
The simplest way of combining predictions that
1.
belong to the same type.
2.
Aim to decrease variance, not bias.
Boosting
A way of combining predictions
that
belong to the different types.
Aim to decrease bias, not
variance.
3.
Each model receives equal weight.
Models are weighted according
to their performance.
New models are influenced
4.
Each model is built independently.
by the performance of previously
built models.
Different training data subsets are selected
5.
using row sampling with replacement and
random sampling methods from the entire
training dataset.
6.
7.
8.
9
Bagging tries to solve the over-fitting problem.
If the classifier is unstable (high variance), then
apply bagging.
In this base classifiers are trained parallelly.
Every new subset contains the
elements that were misclassified
by previous models.
Boosting tries to reduce bias.
If the classifier is stable and
simple (high bias) the apply
boosting.
In this base classifiers are trained
sequentially.
Example: The Random forest model uses
Example: The AdaBoost uses
Bagging.
Boosting techniques
3.
Hard and soft voting classifiers are two approaches used in ensemble
learning, where multiple individual classifiers are combined to make
predictions. The main difference between hard and soft voting classifiers
lies in how they aggregate the predictions of the individual classifiers.
Here's a breakdown of each approach:
1. Hard Voting Classifier:
In a hard voting classifier, the final prediction is made by taking a majority
vote among the predictions of the individual classifiers. Each classifier in
the ensemble contributes one vote, and the class label that receives the
most votes is selected as the final prediction.
For example, consider an ensemble of three classifiers: Classifier A
predicts class 1, Classifier B predicts class 2, and Classifier C predicts
class 1. In hard voting, the majority class is determined by counting the
votes, and the final prediction would be class 1 since it received two out of
three votes.
Hard voting classifiers are suitable when the individual classifiers are
equally weighted, and the focus is on the majority opinion of the ensemble.
It can be effective in situations where the individual classifiers have
complementary strengths and weaknesses, leading to a more robust and
accurate prediction.
2. Soft Voting Classifier:
In a soft voting classifier, the final prediction is made based on the average
or weighted average of the predicted probabilities for each class from the
individual classifiers. Rather than considering just the class labels, soft
voting takes into account the confidence or probability estimates associated
with each class.
For example, suppose we have an ensemble of three classifiers that
provide predicted probabilities for class 1 and class 2. Classifier A predicts
(0.8, 0.2), Classifier B predicts (0.6, 0.4), and Classifier C predicts (0.7,
0.3). In soft voting, the probabilities are averaged for each class, resulting
in (0.7, 0.3). The class with the highest average probability, in this case, is
class 1, so it would be selected as the final prediction.
Soft voting classifiers consider more nuanced information from the
individual classifiers, allowing them to capture confidence levels and subtle
differences in probabilities. This approach can be beneficial when the
individual classifiers provide probability estimates, and the goal is to make
more informed and calibrated predictions.
In summary, the main difference between hard and soft voting classifiers is
that hard voting relies on majority voting based on class labels, while soft
voting considers the predicted probabilities of the individual classifiers to
make a more nuanced decision. The choice between these approaches
depends on the specific problem and the characteristics of the individual
classifiers in the ensemble.
4.
Sr. No
1.
Random Forest
Decision Tree
While building a random forest
Whereas, it built several decision
the number of rows are selected
trees and find out the output.
randomly.
2.
It combines two or more
Whereas the decision is a
decision trees together.
collection of variables or data set
or attributes.
3.
It gives accurate results.
Whereas it gives less accurate
results.
4.
By using multiple trees it reduces
On the other hand, decision trees,
the chances of overfitting.
it has the possibility of overfitting,
which is an error that occurs due
to variance or due to bias.
5.
Random forest is more
Whereas, the decision tree is
complicated to interpret.
simple so it is easy to read and
understand.
6.
In a random forest, we need to
The decision tree is not accurate
generate, process, and analyze
but it processes fast which means
trees so that this process is slow,
it is fast to implement.
it may take one hour or even
days.
7.
It has more computation
because it has n number of
decision trees, so more decision
trees more computation.
Whereas it has less computation.
8.
It has complex visualization, but
On the other hand, it is simple to
it plays an important role to
visualize because we just need to
show hidden patterns behind the
fit the decision tree model.
data.
9.
10.
The classification and regression
Whereas a decision tree is used to
problems can be solved by using
solve the classification and
random forest.
regression problems.
It uses the random subspace
Whereas a decision is made based
method and bagging during tree
on the selected sample’s feature,
construction, which has built-in
this is usually a feature that is
feature importance.
used to make a decision, decision
tree learning is a process to find
the optimal value for each internal
tree node.
5.
Stacking in Machine Learning
There are many ways to ensemble models in machine learning, such as Bagging,
Boosting, and stacking. Stacking is one of the most popular ensemble machine learning
techniques used to predict multiple nodes to build a new model and improve model
performance. Stacking enables us to train multiple models to solve similar problems,
and based on their combined output, it builds a new model with improved performance.
In this topic, "Stacking in Machine Learning", we will discuss a few important concepts
related to stacking, the general architecture of stacking, important key points to
implement stacking, and how stacking differs from bagging and boosting in machine
learning. Before starting this topic, first, understand the concepts of the ensemble in
machine learning. So, let's start with the definition of ensemble learning in machine
learning.
What is Ensemble learning in Machine Learning?
Ensemble learning is one of the most powerful machine learning techniques that use
the combined output of two or more models/weak learners and solve a particular
computational intelligence problem. E.g., a Random Forest algorithm is an ensemble of
various decision trees combined.
Ensemble learning is primarily used to improve the model performance, such as
classification, prediction, function approximation, etc. In simple words, we can
summarise the ensemble learning as follows:
"An ensembled model is a machine learning model that combines the predictions from two or more models.”
There are 3 most common ensemble learning methods in machine learning. These are as follows:
○ Bagging
○ Boosting
○ Stacking
However, we will mainly discuss Stacking on this topic.
1. Bagging
Bagging is a method of ensemble modeling, which is primarily used to solve supervised machine learning problems.
It is generally completed in two steps as follows:
○ Bootstrapping: It is a random sampling method that is used to derive samples
from the data using the replacement procedure. In this method, first, random
data samples are fed to the primary model, and then a base learning algorithm is
run on the samples to complete the learning process.
○ Aggregation: This is a step that involves the process of combining the output of
all base models and, based on their output, predicting an aggregate result with
greater accuracy and reduced variance.
Example: In the Random Forest method, predictions from multiple decision trees are ensembled parallelly. Further, in
regression problems, we use an average of these predictions to get the final output, whereas, in classification
problems, the model is selected as the predicted class.
2. Boosting
Boosting is an ensemble method that enables each member to learn from the preceding member's mistakes and
make better predictions for the future. Unlike the bagging method, in boosting, all base learners (weak) are arranged
in a sequential format so that they can learn from the mistakes of their preceding learner. Hence, in this way, all weak
learners get turned into strong learners and make a better predictive model with significantly improved performance.
We have a basic understanding of ensemble techniques in machine learning and their two common methods, i.e.,
bagging and boosting. Now, let's discuss a different paradigm of ensemble learning, i.e., Stacking.
3. Stacking
Stacking is one of the popular ensemble modeling techniques in machine learning.
Various weak learners are ensembled in a parallel manner in such a way that by
combining them with Meta learners, we can predict better predictions for the future.
This ensemble technique works by applying input of combined multiple weak learners' predictions and Meta learners
so that a better output prediction model can be achieved.
In stacking, an algorithm takes the outputs of sub-models as input and attempts to learn how to best combine the
input predictions to make a better output prediction.
Stacking is also known as a stacked generalization and is an extended form of the Model Averaging Ensemble
technique in which all sub-models equally participate as per their performance weights and build a new model with
better predictions. This new model is stacked up on top of the others; this is the reason why it is named stacking.
Architecture of Stacking
The architecture of the stacking model is designed in such as way that it consists of two or more base/learner's
models and a meta-model that combines the predictions of the base models. These base models are called level 0
models, and the meta-model is known as the level 1 model. So, the Stacking ensemble method includes original
(training) data, primary level models, primary level prediction, secondary level model, and final prediction. The basic
architecture of stacking can be represented as shown below the image.
○ Original data: This data is divided into n-folds and is also considered test data or
training data.
○ Base models: These models are also referred to as level-0 models. These models
use training data and provide compiled predictions (level-0) as an output.
○ Level-0 Predictions: Each base model is triggered on some training data and
provides different predictions, which are known as level-0 predictions.
○ Meta Model: The architecture of the stacking model consists of one meta-model,
which helps to best combine the predictions of the base models. The
meta-model is also known as the level-1 model.
○ Level-1 Prediction: The meta-model learns how to best combine the predictions
of the base models and is trained on different predictions made by individual
base models, i.e., data not used to train the base models are fed to the
meta-model, predictions are made, and these predictions, along with the
expected outputs, provide the input and output pairs of the training dataset used
to fit the meta-model.
Steps to implement Stacking models:
There are some important steps to implementing stacking models in machine learning.
These are as follows:
○ Split training data sets into n-folds using the RepeatedStratifiedKFold as this is
the most common approach to preparing training datasets for meta-models.
○ Now the base model is fitted with the first fold, which is n-1, and it will make
predictions for the nth folds.
○ The prediction made in the above step is added to the x1_train list.
○ Repeat steps 2 & 3 for remaining n-1folds, so it will give x1_train array of size n,
○ Now, the model is trained on all the n parts, which will make predictions for the
sample data.
○ Add this prediction to the y1_test list.
○ In the same way, we can find x2_train, y2_test, x3_train, and y3_test by using
Model 2 and 3 for training, respectively, to get Level 2 predictions.
○ Now train the Meta model on level 1 prediction, where these predictions will be
used as features for the model.
○ Finally, Meta learners can now be used to make a prediction on test data in the
stacking model.
Summary of Stacking Ensemble
Stacking is an ensemble method that enables the model to learn how to use combine
predictions given by learner models with meta-models and prepare a final model with
accurate prediction. The main benefit of stacking ensemble is that it can shield the
capabilities of a range of well-performing models to solve classification and regression
problems. Further, it helps to prepare a better model having better predictions than all
individual models. In this topic, we have learned various ensemble techniques and their
definitions, the stacking ensemble method, the architecture of stacking models, and
steps to implement stacking models in machine learning.
6.
7.
Naive Bayes classifiers and Support Vector Machines (SVM)
are two popular machine learning algorithms commonly used
for text classification tasks. While both approaches can be
effective, they have distinct characteristics and operate based
on different principles. Here's a comparison of Naive Bayes
classifiers and SVM in the context of text classification:
Naive Bayes Classifiers:
- Naive Bayes classifiers are based on the probabilistic
principle of Bayes' theorem.
- They assume that the features (words or tokens) in the input
text are conditionally independent, given the class label.
- Naive Bayes classifiers are computationally efficient and
require a relatively small amount of training data.
- They perform well even in situations with high-dimensional
feature spaces, such as text classification tasks.
- Naive Bayes classifiers are known for their simplicity and
interpretability, as they provide clear probabilistic predictions.
- However, they may struggle when the independence
assumption is violated or when dealing with rare or unseen
words.
Support Vector Machines (SVM):
- SVM is a discriminative algorithm that aims to find an
optimal hyperplane to separate data points of different
classes.
- SVMs map the input text into a high-dimensional feature
space, where the data can be linearly separable.
- They work well in situations where the number of features is
large and the data is not linearly separable in the original
space.
- SVMs can handle both linear and non-linear decision
boundaries through the use of different kernel functions.
- They are effective in dealing with high-dimensional text data
and can handle large-scale text classification tasks.
- SVMs are less interpretable than Naive Bayes classifiers, as
they do not provide direct probability estimates.
- However, they can be computationally expensive, especially
when dealing with large datasets.
Choosing between Naive Bayes classifiers and SVMs for text
classification depends on various factors:
- Dataset Size: Naive Bayes classifiers can work well with
small training datasets, while SVMs can handle larger
datasets.
- Data Characteristics: If the independence assumption of
Naive Bayes is reasonable and the classes are
well-separated, it can be a good choice. If the data is complex
or nonlinear, SVMs may be more suitable.
- Interpretability: If interpretability is important, Naive Bayes
classifiers provide clear probabilistic predictions, whereas
SVMs focus on maximizing classification performance.
- Computational Efficiency: Naive Bayes classifiers are
generally faster to train and require less computational
resources compared to SVMs.
In summary, Naive Bayes classifiers are simple, interpretable,
and computationally efficient, making them suitable for text
classification tasks with smaller datasets. SVMs, on the other
hand, are powerful models that can handle large-scale text
classification tasks and nonlinear data, but they are less
interpretable and can be computationally expensive. The
choice between the two depends on the specific requirements
and characteristics of the text classification problem at hand.
Unit -4:1. Write about DBSCAN and Gaussian Mixtures?
2. Write the Main Approaches for Dimensionality
Reduction?
3. Explain K-Means clustering with an Example?
4. How to implement PCA using Sci-Kit learn?
5. Write a note on Randomized PCA and Kernel
PCA?
1.
Certainly! Here's a more detailed explanation of DBSCAN
(Density-Based Spatial Clustering of Applications with Noise) and
Gaussian Mixtures:
DBSCAN:
DBSCAN is a density-based clustering algorithm that groups data
points based on their density and proximity. It is particularly effective
for discovering clusters of arbitrary shape within a dataset. Here's how
DBSCAN works:
1. Density-Based Clustering:
DBSCAN defines clusters as dense regions of data points separated
by regions of lower density. It considers two important parameters:
epsilon (ε), which represents the radius within which neighboring
points are considered, and minPts, which is the minimum number of
points required to form a dense region.
2. Core Points, Border Points, and Noise:
DBSCAN identifies three types of points:
- Core Points: A data point is considered a core point if it has at least
minPts points within a distance of ε.
- Border Points: A data point is considered a border point if it has
fewer than minPts points within ε but is within ε of a core point.
- Noise Points: A data point is considered noise if it is neither a core
point nor a border point.
3. Clustering Process:
The clustering process of DBSCAN involves the following steps:
- Initially, all points are marked as unvisited.
- A core point is randomly chosen, and its ε-neighborhood is
explored.
- If the ε-neighborhood contains at least minPts points, a new cluster
is created.
- The ε-neighborhood points are added to the cluster, and their
ε-neighborhoods are further explored recursively until no more points
can be added.
- The process repeats with unvisited core points until all points are
visited.
4. Resulting Clusters:
DBSCAN outputs clusters as connected components of core points
and their reachable neighbors. Border points may belong to more than
one cluster, while noise points do not belong to any cluster.
Gaussian Mixtures:
Gaussian Mixtures, also known as Gaussian Mixture Models (GMM),
is a probabilistic model that assumes the data is generated from a
mixture of Gaussian distributions. It is a parametric approach to
clustering that models the underlying probability distribution of the
data. Here's how Gaussian Mixtures work:
1. Probability Distribution Modeling:
Gaussian Mixtures model the data as a weighted sum of multiple
Gaussian distributions, where each distribution represents a
component or cluster. The model assumes that each data point is
generated by one of these Gaussian components.
2. Parameter Estimation:
The parameters of a Gaussian Mixture model include the means,
covariances, and weights of the Gaussian components. These
parameters are estimated using the Expectation-Maximization (EM)
algorithm, which iteratively maximizes the likelihood of the data.
3. Soft Assignments:
Unlike DBSCAN, which assigns data points to discrete clusters,
Gaussian Mixtures provide soft assignments. Each data point is
assigned probabilities indicating the likelihood of belonging to each
cluster.
4. Cluster Assignment:
To assign data points to clusters, a threshold can be set on the
probabilities. Points with high probabilities for a particular cluster are
assigned to that cluster. Alternatively, the most likely cluster can be
determined based on the maximum probability.
5. Flexibility and Complexity:
Gaussian Mixtures can model complex and overlapping clusters due
to the flexibility of the underlying Gaussian distributions. However, the
number of components or clusters needs to be specified in advance,
which can be a limitation.
In summary, DBSCAN is a density-based clustering algorithm that
identifies dense regions in the data, while Gaussian Mixtures model
the data as a mixture of Gaussian distributions. DBSCAN discovers
clusters based on density and proximity, whereas Gaussian Mixtures
assume data is generated from Gaussian components. DBSCAN is
particularly useful for finding clusters of arbitrary shape, while
Gaussian Mixtures can handle complex and overlapping clusters. The
choice between the two depends on the characteristics of the dataset and
the specific requirements of the clustering task at hand.
DB SCAN
GAUSSIAN MIXTURE
2.
Introduction
Technique
to
Dimensionality
Reduction
What is Dimensionality Reduction?
The number of input features, variables, or columns present in a given dataset is known
as dimensionality, and the process to reduce these features is called dimensionality
reduction.
A dataset contains a huge number of input features in various cases, which makes the
predictive modeling task more complicated. Because it is very difficult to visualize or
make predictions for the training dataset with a high number of features, for such cases,
dimensionality reduction techniques are required to use.
Dimensionality reduction technique can be defined as, "It is a way of converting the
higher dimensions dataset into lesser dimensions dataset ensuring that it provides
similar information." These techniques are widely used in machine learning for obtaining
a better fit predictive model while solving the classification and regression problems.
It is commonly used in the fields that deal with high-dimensional data, such as speech
recognition, signal processing, bioinformatics, etc. It can also be used for data
visualization, noise reduction, cluster analysis, etc.
The Curse of Dimensionality
Handling the high-dimensional data is very difficult in practice, commonly known as the
curse of dimensionality. If the dimensionality of the input dataset increases, any
machine learning algorithm and model becomes more complex. As the number of
features increases, the number of samples also gets increased proportionally, and the
chance of overfitting also increases. If the machine learning model is trained on
high-dimensional data, it becomes overfitted and results in poor performance.
Hence, it is often required to reduce the number of features, which can be done with
dimensionality reduction.
Benefits of applying Dimensionality Reduction
Some benefits of applying dimensionality reduction technique to the given dataset are
given below:
○ By reducing the dimensions of the features, the space required to store the
dataset also gets reduced.
○ Less Computation training time is required for reduced dimensions of features.
○ Reduced dimensions of features of the dataset help in visualizing the data
quickly.
○ It removes the redundant features (if present) by taking care of multicollinearity.
Disadvantages of dimensionality Reduction
There are also some disadvantages of applying the dimensionality reduction, which are
given below:
○ Some data may be lost due to dimensionality reduction.
○ In the PCA dimensionality reduction technique, sometimes the principal
components required to consider are unknown.
Approaches of Dimension Reduction
There are two ways to apply the dimension reduction technique, which are given below:
Feature Selection
Feature selection is the process of selecting the subset of the relevant features and
leaving out the irrelevant features present in a dataset to build a model of high accuracy.
In other words, it is a way of selecting the optimal features from the input dataset.
Three methods are used for the feature selection:
1. Filters Methods
In this method, the dataset is filtered, and a subset that contains only the relevant
features is taken. Some common techniques of filters method are:
○ Correlation
○ Chi-Square Test
○ ANOVA
○ Information Gain, etc.
2. Wrappers Methods
The wrapper method has the same goal as the filter method, but it takes a machine
learning model for its evaluation. In this method, some features are fed to the ML model,
and evaluate the performance. The performance decides whether to add those features
or remove to increase the accuracy of the model. This method is more accurate than the
filtering method but complex to work. Some common techniques of wrapper methods
are:
○ Forward Selection
○ Backward Selection
○ Bi-directional Elimination
3. Embedded Methods: Embedded methods check the different training iterations of the
machine learning model and evaluate the importance of each feature. Some common
techniques of Embedded methods are:
○ LASSO
○ Elastic Net
○ Ridge Regression, etc.
Feature Extraction:
Feature extraction is the process of transforming the space containing many
dimensions into space with fewer dimensions. This approach is useful when we want to
keep the whole information but use fewer resources while processing the information.
Some common feature extraction techniques are:
a. Principal Component Analysis
b. Linear Discriminant Analysis
c. Kernel PCA
d. Quadratic Discriminant Analysis
Common techniques of Dimensionality Reduction
a. Principal Component Analysis
b. Backward Elimination
c. Forward Selection
d. Score comparison
e. Missing Value Ratio
f. Low Variance Filter
g. High Correlation Filter
h. Random Forest
i. Factor Analysis
j. Auto-Encoder
Principal Component Analysis (PCA)
Principal Component Analysis is a statistical process that converts the observations of
correlated features into a set of linearly uncorrelated features with the help of
orthogonal transformation. These new transformed features are called the Principal
Components. It is one of the popular tools that is used for exploratory data analysis and
predictive modeling.
PCA works by considering the variance of each attribute because the high attribute
shows the good split between the classes, and hence it reduces the dimensionality.
Some real-world applications of PCA are image processing, movie recommendation
system, optimizing the power allocation in various communication channels.
Backward Feature Elimination
The backward feature elimination technique is mainly used while developing Linear
Regression or Logistic Regression model. Below steps are performed in this technique
to reduce the dimensionality or in feature selection:
○ In this technique, firstly, all the n variables of the given dataset are taken to train
the model.
○ The performance of the model is checked.
○ Now we will remove one feature each time and train the model on n-1 features for
n times, and will compute the performance of the model.
○ We will check the variable that has made the smallest or no change in the
performance of the model, and then we will drop that variable or features; after
that, we will be left with n-1 features.
○ Repeat the complete process until no feature can be dropped.
In this technique, by selecting the optimum performance of the model and maximum
tolerable error rate, we can define the optimal number of features require for the
machine learning algorithms.
Forward Feature Selection
Forward feature selection follows the inverse process of the backward elimination
process. It means, in this technique, we don't eliminate the feature; instead, we will find
the best features that can produce the highest increase in the performance of the
model. Below steps are performed in this technique:
○ We start with a single feature only, and progressively we will add each feature at
a time.
○ Here we will train the model on each feature separately.
○ The feature with the best performance is selected.
○ The process will be repeated until we get a significant increase in the
performance of the model.
Missing Value Ratio
If a dataset has too many missing values, then we drop those variables as they do not
carry much useful information. To perform this, we can set a threshold level, and if a
variable has missing values more than that threshold, we will drop that variable. The
higher the threshold value, the more efficient the reduction.
Low Variance Filter
As same as missing value ratio technique, data columns with some changes in the data
have less information. Therefore, we need to calculate the variance of each variable, and
all data columns with variance lower than a given threshold are dropped because low
variance features will not affect the target variable.
High Correlation Filter
High Correlation refers to the case when two variables carry approximately similar
information. Due to this factor, the performance of the model can be degraded. This
correlation between the independent numerical variable gives the calculated value of
the correlation coefficient. If this value is higher than the threshold value, we can remove
one of the variables from the dataset. We can consider those variables or features that
show a high correlation with the target variable.
Random Forest
Random Forest is a popular and very useful feature selection algorithm in machine
learning. This algorithm contains an in-built feature importance package, so we do not
need to program it separately. In this technique, we need to generate a large set of trees
against the target variable, and with the help of usage statistics of each attribute, we
need to find the subset of features.
Random forest algorithm takes only numerical variables, so we need to convert the
input data into numeric data using hot encoding.
Factor Analysis
Factor analysis is a technique in which each variable is kept within a group according to
the correlation with other variables, it means variables within a group can have a high
correlation between themselves, but they have a low correlation with variables of other
groups.
We can understand it by an example, such as if we have two variables Income and
spend. These two variables have a high correlation, which means people with high
income spends more, and vice versa. So, such variables are put into a group, and that
group is known as the factor. The number of these factors will be reduced as compared
to the original dimension of the dataset.
Auto-encoders
One of the popular methods of dimensionality reduction is auto-encoder, which is a type
of ANN or artificial neural network, and its main aim is to copy the inputs to their
outputs. In this, the input is compressed into latent-space representation, and output is
occurred using this representation. It has mainly two parts:
○ Encoder: The function of the encoder is to compress the input to form the
latent-space representation.
○ Decoder: The function of the decoder is to recreate the output from the
latent-space representation.
3.
K-Means Clustering Algorithm
K-Means Clustering is an unsupervised learning algorithm that is used to solve the
clustering problems in machine learning or data science. In this topic, we will learn what
is K-means clustering algorithm, how the algorithm works, along with the Python
implementation of k-means clustering.
What is K-Means Algorithm?
K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled
dataset into different clusters. Here K defines the number of pre-defined clusters that
need to be created in the process, as if K=2, there will be two clusters, and for K=3, there
will be three clusters, and so on.
It is an iterative algorithm that divides the unlabeled dataset into k different clusters
in such a way that each dataset belongs only one group that has similar properties.
It allows us to cluster the data into different groups and a convenient way to discover
the categories of groups in the unlabeled dataset on its own without the need for any
training.
It is a centroid-based algorithm, where each cluster is associated with a centroid. The
main aim of this algorithm is to minimize the sum of distances between the data point
and their corresponding clusters.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of
clusters, and repeats the process until it does not find the best clusters. The value of k
should be predetermined in this algorithm.
The k-means clustering algorithm mainly performs two tasks:
○ Determines the best value for K center points or centroids by an iterative process.
○ Assigns each data point to its closest k-center. Those data points which are near
to the particular k-center, create a cluster.
Hence each cluster has datapoints with some commonalities, and it is away from other
clusters.
The below diagram explains the working of the K-means Clustering Algorithm:
How does the K-Means Algorithm Work?
The working of the K-Means algorithm is explained in the below steps:
Step-1: Select the number K to decide the number of clusters.
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined
K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest
centroid of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready.
Let's understand the above steps by considering the visual plots:
Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two
variables is given below:
○ Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them
into different clusters. It means here we will try to group these datasets into two
different clusters.
○ We need to choose some random k points or centroid to form the cluster. These
points can be either the points from the dataset or any other point. So, here we
are selecting the below two points as k points, which are not the part of our
dataset. Consider the below image:
○ Now we will assign each data point of the scatter plot to its closest K-point or
centroid. We will compute it by applying some mathematics that we have studied
to calculate the distance between two points. So, we will draw a median between
both the centroids. Consider the below image:
From the above image, it is clear that points left side of the line is near to the K1 or blue
centroid, and points to the right of the line are close to the yellow centroid. Let's color
them as blue and yellow for clear visualization.
○ As we need to find the closest cluster, so we will repeat the process by choosing
a new centroid. To choose the new centroids, we will compute the center of
gravity of these centroids, and will find new centroids as below:
○ Next, we will reassign each datapoint to the new centroid. For this, we will repeat
the same process of finding a median line. The median will be like below image:
From the above image, we can see, one yellow point is on the left side of the line, and
two blue points are right to the line. So, these three points will be assigned to new
centroids.
As reassignment has taken place, so we will again go to the step-4, which is finding new
centroids or K-points.
○ We will repeat the process by finding the center of gravity of centroids, so the
new centroids will be as shown in the below image:
○ As we got the new centroids so again will draw the median line and reassign the
data points. So, the image will be:
○ We can see in the above image; there are no dissimilar data points on either side
of the line, which means our model is formed. Consider the below image:
As our model is ready, so we can now remove the assumed centroids, and the two final
clusters will be as shown in the below image:
4.
5.
Randomized PCA:
Randomized PCA is a variant of Principal Component Analysis (PCA) that
provides an efficient approximation to the traditional PCA algorithm. It is
particularly useful when dealing with large datasets where the
computational cost of traditional PCA becomes prohibitive. Randomized
PCA speeds up the computation by using randomization techniques while
still preserving the most important components.
The main steps involved in Randomized PCA are as follows:
1. Randomized Sampling:
Instead of processing the entire dataset, Randomized PCA starts by
randomly selecting a subset of the data. This step significantly reduces the
computational complexity.
2. Matrix Approximation:
Next, the algorithm approximates the covariance matrix of the selected
data subset using various matrix approximation techniques, such as
randomized SVD (Singular Value Decomposition) or random projection.
These approximation methods allow for faster computations without
sacrificing much accuracy.
3. Traditional PCA on Approximated Matrix:
The reduced and approximated covariance matrix is then used as input for
the traditional PCA algorithm. The remaining steps of PCA, including
eigendecomposition or singular value decomposition, are performed on this
matrix to compute the principal components.
Randomized PCA provides a good approximation of the principal
components with a lower computational cost compared to traditional PCA.
However, it may introduce a small amount of error in the results, which is
generally acceptable for many practical applications.
Kernel PCA:
Kernel PCA is a nonlinear extension of traditional PCA that can capture
complex patterns and structures in the data by using kernel functions.
Unlike linear PCA, which operates in the original feature space, Kernel
PCA maps the data into a higher-dimensional feature space, where linear
PCA is then applied.
The key steps involved in Kernel PCA are as follows:
1. Kernel Function:
Kernel PCA begins by selecting an appropriate kernel function, such as the
Gaussian kernel or polynomial kernel. The kernel function measures the
similarity between pairs of data points in the original feature space.
2. Kernel Matrix:
A kernel matrix is constructed using the kernel function, which quantifies
the pairwise similarities between all data points in the original feature
space. This matrix captures the nonlinear relationships among the data
points.
3. Eigendecomposition:
The kernel matrix is then eigendecomposed to obtain the eigenvectors and
eigenvalues. These eigenvectors represent the principal components in the
higher-dimensional feature space.
4. Projection:
Finally, the data is projected onto the principal components obtained from
the eigendecomposition. The projected data can then be used for further
analysis or visualization.
Kernel PCA is particularly useful when dealing with nonlinear relationships
and complex data structures. It can capture nonlinear patterns that may be
missed by linear PCA. However, it is important to note that Kernel PCA
involves additional computational costs compared to linear PCA, as it
requires the computation of the kernel matrix and eigendecomposition in
the higher-dimensional feature space.
Both Randomized PCA and Kernel PCA are powerful techniques that
extend the capabilities of traditional PCA. Randomized PCA offers
computational efficiency for large datasets, while Kernel PCA allows for
nonlinear dimensionality reduction and capturing complex patterns in the
data. The choice between these techniques depends on the specific
requirements of the analysis task and the nature of the dataset.
KERNAL PCA
RANDOMIZED PCA
Unit -5:1. Write about different ways to Installing TensorFlow2?
2. Explain the procedure of Loading and preprocessing
Data with TensorFlow?
3. What are the various ways to implement MLP’s with
Keras? Explain.
1.
2.
3.
These are the main ways to implement MLP models using Keras: the Sequential API, the
Functional API, and the Subclassing API. Each approach has its own benefits and
suitability for different use cases. The choice depends on the complexity of the model
architecture and the level of flexibility required.
Download