Uploaded by RAJ JOSHI

Module1-Introduction to ML

advertisement
Module 1
Introduction to Machine Learning
What We Talk About, When We Talk About“Learning”
• Learning is “a process that leads to change, which occurs as a result
of experience and increases the potential for improved performance and
future learning” (Ambrose et al, 2010, p.3).
• The change in the learner may happen at the level of knowledge, attitude
or behavior to see concepts, ideas, and/or the world differently.
• Knowledge is expensive and scarce; Data is cheap and abundant (data
warehouses, data marts);
• Example in retail: Customer transactions to consumer behavior:
People who bought “Da Vinci Code” also bought “The Five People You Meet in
Heaven” (www.amazon.com)
• Build a model that is a good and useful approximation to the data.
• Writing software is the bottleneck. So let the data do the work instead!
Data Economy
1 zettabyte = 1,024 exabytes
or 1 billion terabytes.
1 Exabyte (EB) = 1,024 Petabytes
or one billion gigabytes (GB).
1 Petabyte =1,024 Terabytes.
1 Terabyte (TB) = 1,024 Gigabytes
1 Gigabyte = 1024 MB (or
approximately 1000 MB)
1MB = 1024 KB
1KB = 1024 bytes (2^10 bytes)
1 byte = 8 bits
Why “Learn” ?
Learning is used when:
❑ Human expertise does not exist
➢navigating on Mars
➢industrial/manufacturing control.
➢mass spectrometer analysis, drug design, astronomic discovery.
❑ Black-box human expertise (Humans are unable to explain their expertise)
➢face/handwriting/speech recognition.
➢driving a car, flying a plane.
❑ Rapidly changing phenomena (Solution changes in time)
➢routing on a computer network
➢credit scoring, financial modelling.
➢diagnosis, fraud detection.
❑ Need for customization/personalization (Solution needs to be adapted to particular cases)
➢user biometrics
➢personalized news reader.
➢movie/book recommendation.
Machine Learning Definitions
• Machine learning is programming computers to optimize a
performance criterion using example data or past experience.
• Giving capability to computers to learn from data without being
explicitly programmed.
• A branch of artificial intelligence, concerned with the design and
development of algorithms that allow computers to evolve behaviors
based on empirical data.
• Machine learning is an application of artificial intelligence that
involves algorithms and data that automatically analyse and make
decision by itself without human intervention.
What is Necessary to Understand Machine
Learning (ML) ?
➢Role of Statistics: Inference from a sample
➢Role of Computer science: Efficient algorithms for
✓ Solving the optimization problem
✓ Representing and evaluating the model for inference
Traditional Programming
Data
Program
Computer
Output
Computer
Model/Program
Machine Learning
Data
Output
Difference Between Machine Learning And
Artificial Intelligence
Artificial Intelligence is a concept of creating intelligent
machines that simulates human behaviour
whereas
Machine learning is a subset of Artificial intelligence that
allows machine to learn from data (Mostly structured data)
without being programmed.
AI
ML
DL
Additional knowledge:
Deep Learning is subset of machine learning which is based
on highly complex neural networks that mimic the way a
human brain works to detect patterns in large unstructured
data sets.
Machine learning and Data Science go hand
in hand
Advantages of
Machine Learning
• Fast, Accurate, Efficient.
• Automation of most applications.
• Wide range of real life
applications.
• Enhanced cyber security and
spam detection.
• No human Intervention is
needed.
• Handling multi dimensional data.
Disadvantages of
Machine Learning
• It is very difficult to identify and
rectify the errors.
• Data Acquisition.
• Interpretation of results
• Requires more time and space.
What algorithms should be used?
• Selecting the right algorithm for a machine learning task is a critical decision and depends
on various factors. Here are some common issues related to algorithm selection in
machine learning:
1.Type of Problem:
1. Classification vs. Regression vs. Clustering: The choice of algorithm depends on the type of
problem you are trying to solve. Classification algorithms are used for categorical outcomes,
regression for continuous outcomes, and clustering for grouping similar data points.
2.Size and Nature of Data:
1. Data Size: The size of the dataset can influence the choice of algorithms. Some algorithms may
work well with small datasets, while others require large amounts of data for effective training.
2. Data Distribution: The distribution of the data, including its linearity and the presence of outliers,
can impact the performance of certain algorithms. If the training data is biased, the model may
learn and perpetuate those biases. This can lead to unfair or discriminatory outcomes.
3.Computational Resources: Training complex models, especially deep neural networks, can be
computationally expensive and may require specialized hardware. (Especially for large datasets or
complex models.)
4. Interpretability: Many advanced machine learning models, such as deep neural
networks, are often viewed as black boxes, making it challenging to interpret their
decision-making processes. Depending on the application, you may need an
interpretable model. Some algorithms, such as decision trees, are more interpretable
than others, like complex neural networks.
5. Feature Engineering: Naive Bayes assumes feature independence, which may not
hold in all cases. Understanding the relationships between features can help choose
an algorithm that captures the underlying patterns in the data. Poor feature selection
can lead to suboptimal performance.
6. Model Complexity: Different algorithms have varying levels of bias and variance.
Finding the right balance is crucial to avoid underfitting or overfitting the data.
•Overfitting: Occurs when a model learns the training data too well, including its
noise and outliers, and performs poorly on new, unseen data.
•Underfitting: Happens when a model is too simple to capture the underlying
patterns in the data, resulting in poor performance.
7. Handling Non-linearity: Linear models may not capture non-linear relationships in
the data. For complex, non-linear patterns, algorithms like decision trees or neural
networks may be more suitable.
8. Robustness to Outliers: Some algorithms are sensitive to outliers and may be
heavily influenced by them. Robust algorithms, like support vector machines,
may be more appropriate in such cases.
9. Ensemble Methods: In many cases, combining the predictions of multiple
models (ensemble methods) can lead to better performance than using a single
model. Random forests and gradient boosting are popular ensemble methods.
10. Domain Knowledge: Knowledge about the specific domain and problem at
hand can guide the selection of algorithms. Certain algorithms may perform
better in specific industries or applications. Prior knowledge is useful even
when it is approximately correct.
11. Scalability: Some algorithms are more scalable than others. Deploying and
maintaining machine learning models at scale can pose challenges, especially
when dealing with large datasets or high inference loads. Addressing these
issues often requires a multidisciplinary approach, involving expertise in
machine learning, statistics, computer science, and domain-specific knowledge.
12. Availability of Libraries and Tools: The availability of libraries and tools for a
specific algorithm in chosen programming language can influence the decision.
Popular libraries like scikit-learn, TensorFlow, and PyTorch provide
implementations for various algorithms.
13. Ethical and Legal Concerns:
• Bias and Fairness: Machine learning models may inadvertently reflect and
perpetuate biases present in the training data. Addressing fairness and
ethical concerns is an ongoing challenge.
• Privacy: Models trained on sensitive data may inadvertently reveal private
information. Striking a balance between utility and privacy is crucial.
14. Security:
• Adversarial Attacks: Some models are vulnerable to adversarial attacks,
where small, intentional modifications to the input data can lead to incorrect
predictions.
15. Continuous Learning:
• Adapting to Change: Many machine learning models are trained on static
datasets, making it challenging for them to adapt to changing patterns in the
data.
ML in a Nutshell
• Tens of thousands of machine learning algorithms
• Hundreds new every year
• Every machine learning algorithm has three components:
i. Representation
ii. Evaluation
iii. Optimization
1. Representation
Representation refers to how data is transformed or encoded into a format that is suitable
for a learning algorithm.
It involves the conversion of raw input data into a format that the algorithm can effectively
work with, allowing it to learn patterns, relationships, and features from the data.
The choice of representation is crucial because it can significantly impact the performance
of the machine learning model.
• Decision trees: A decision tree is a graphical representation of a decision-making
process
• Sets of rules / Logic programs
• Instances
• Graphical models (Bayes/Markov nets)
• Neural networks
• Support vector machines
• Model ensembles
• Etc.
Representation of image data to Neural Network
SOURCE: https://ml4a.github.io/ml4a/looking_inside_neural_nets/
2. Evaluation
•
•
•
•
•
•
•
•
•
•
Accuracy
Precision and recall
Squared error
Likelihood
Posterior probability
Cost / Utility
Margin
Entropy
K-L divergence
Etc.
3. Optimization
•Combinatorial optimization
•E.g.: Greedy search
•Convex optimization
•E.g.: Gradient descent
•Constrained optimization
•E.g.: Linear programming
Applications of Machine Learning:
• Image tagging, image recognition, self
driving cars, OCR
• Human Simulation, humanoid robots,
industrial robots
• Anomaly detection, grouping and
prediction, association rules
• Pokémon go, Alpha go, Chess, Super Mario
• Spam and malware filtering, information
extraction, sentiment analysis
• Drug discovery, nutrition, lifestyle
management and monitoring, medical
imaging, risk analysis, bioinformatics
Additional Slide_ Discussion
Need for AI in healthcare
• AI improves accuracy
and efficiency in
diagnosing diseases.
• AI can handle large
volumes of
healthcare data for
processing and
analysis.
Accuracy &
Efficiency
Large
Volume
• AI automates routine
administrative tasks
in healthcare.
Automation
• AI has predictive
capabilities for early
disease detection
and prevention.
• AI assists doctors in
treatment decisions
by considering
available
information.
Predictions
Treatment
decisions
• AI enables
personalized
medicine by tailoring
treatment to
individual patient
characteristics.
• AI speeds up drug
discovery and
development
processes.
• AI supports
telemedicine
services and remote
patient monitoring.
Personalized
Medicine
Drug
Discovery
Telemedicine
AI in Healthcare
Disease
Diagnosis
Disease
Treatment
Predicting
Diseases
Medical Imaging
Drug Discovery
Personalized
Treatment plans
Health
Monitoring
Wearable
Technology
Hospital
Administration
Ethics
Data mining example: Learning Association
• Basket analysis:
P (Y | X ) probability that somebody who buys X also buys Y where X and Y are products or
services.
Example: P ( chips | beer ) = 0.7
P ( pav | vada) > P( pav | tea )
Many other applications of Machine Learning:
•Traffic prediction
•Virtual Personal Assistant
•Speech recognition
•Natural language processing
•Astronomy
•Anatomy
•Agriculture
•Banking
•Online Advertising
•Computer Vision
Steps of developing a Machine Learning
Application
Here
Step 1 Defining
objective of what
we want and how
we want it
Step 8 Deploying
finally built model
for application
2. Collect a data
• Given the problem we want to solve, we will have to investigate and
obtain data that we will use to feed our machine.
• The quality and quantity of information we get are very important
since it will directly impact how well or badly our model will work.
• We may have the information in an existing database or we must
create it from scratch.
• If it is a small project we can create a spreadsheet that will later be
easily exported as a CSV file.
• It is also common to use the web scraping technique to automatically
collect information from various sources such as APIs.
3. Prepare the data
•This is a good time to visualize our data and check if there are
correlations between the different characteristics that we
obtained.
•It will be necessary to make a selection of characteristics
since the one we choose will directly impact the execution
times and the results.
•We can also reduce dimensions by applying PCA if necessary.
Prepare the data continued…
• Additionally, we must balance the amount of data we have for each
result -class- so that it is significant as the learning may be biased
towards a type of response and when our model tries to generalize
knowledge, it will fail.
• At this stage, we can also pre-process our data by normalizing,
eliminating duplicates, and making error corrections.
• We must also separate the data into two groups: one for training and
the other for model evaluation which can be divided approximately in
a ratio of 80/20 but it can vary depending on the case and the volume
of data we have.
4. Choose the model
• There are several models that we can
choose according to the objective that
we might have: we will use algorithms
of classification, prediction, linear
regression, clustering, i.e. k-means or KNearest Neighbor, Deep Learning, i.e
Neural Networks, Bayesian, etc.
• There are various models to be used
depending on the data we are going to
process such as images, sound, text,
and numerical values.
• Table aside shows some models and
their sample applications.
5. Train the model
•We will need to train the model with datasets to run
smoothly and see an incremental improvement in the
prediction rate.
•It is necessary to initialize the weights of our model randomly
-the weights are the values that multiply or affect the
relationships between the inputs and outputs- which will be
automatically adjusted by the selected algorithm the more
we train them.
6. Evaluate the model
•We will have to check the model created against our
evaluation data set that contains inputs that the model does
not know and verify the precision of your already trained
model.
•If the accuracy is less than or equal to 50%, that model will
not be useful since it would be like tossing a coin to make
decisions.
•If we reach 90% or more, we can have good confidence in the
results that the model gives us.
Parameter Tuning during evaluation
• If during the evaluation we did not obtain good predictions and our precision is
not the minimum desired, it is possible that we must return to the training step
before making a new configuration of parameters in our model.
• We can increase the number of times we iterate our training data- termed
epochs.
• We can also indicate the maximum error allowed for our model.
• We can go from taking a few minutes to hours, and even days, to train our
machine.
• These parameters are often called Hyperparameters. This “tuning” is still more of
an art than a science and will improve as we experiment.
• There are usually many parameters to adjust and when combined they can trigger
all our options. This will be a work of great effort and patience to give good
results.
7. Prediction or Inference
• We are now ready to use our Machine Learning model inferring
results in real-life scenarios.
Types of Machine Learning
➢Supervised (inductive) learning
✓ Training data includes desired outputs
➢Unsupervised learning
✓ Training data does not include desired outputs
➢Semi-supervised learning
✓ Training data includes a few desired outputs
➢Reinforcement learning
✓ Rewards from sequence of actions
Supervised Learning
• Supervised Learning is a type of machine learning used to train
models from labeled training data.
• It allows you to predict output for future or unseen data
Supervised Learning Flow
Supervised Learning Flow
Unsupervised Learning
• The data has no labels. The machine just looks for whatever patterns
it can find.
• Unsupervised learning can be used for anomaly detection as well as
clustering.
Notice here there are no labels
Example of Clustering
Example of Anomaly detection
Example of Clustering
Classification, Regression and Clustering
1. Concept of Classification
• A machine learning task that identifies the class to which an instance
belongs
• If target variable is categorical (classes), then use classification algorithm.
• In other words, classification is applied when the output has finite and
discrete values.
• Popular Algorithms: Naïve Bayes, SVM, Random forest, Decision Tree
• Example: Predict the class of car given its features like horsepower,
mileage, weight, colour, etc.
• The classifier will build its attributes based on these features.
• Analysis has three potential outcomes -Sedan, SUV, or Hatchback
Classification
Example
• Example: Credit scoring
• Differentiating between
low-risk and high-risk
customers from their
income and savings
Discriminant: IF income > θ1 AND savings > θ2
THEN low-risk ELSE high-risk
Sometimes misclassification
may also occur if model does
not learn well.
Classification: Applications
• Pattern recognition
• Face recognition: Pose, lighting, occlusion (glasses, beard), make-up,
hair style
• Character recognition: Different handwriting styles.
• Speech recognition: Temporal dependency.
• Use of a dictionary or the syntax of the language.
• Sensor fusion: Combine multiple modalities; eg, visual (lip image) and acoustic
for speech
• Medical diagnosis: From symptoms to illnesses
2. Concept of Regression (for prediction)
• If target variable is a continuous numeric variable (100–2000), then
use a regression algorithm.
• Example: Predict the price of a house given its sq. ft. area, location,
no. of bedrooms, etc.
• A simple regression algorithm is given below
y=w*x+b
• This shows relationship between price (y) and sq. ft. area (x), where
price is a any number within a defined range.
Regression
Example
• Example: Price of a used car
• x : car attributes
y : price
y = g (x | θ )
g ( ) model,
θ parameters
y = wx+w0
Regression: Applications
• Navigating a car: Angle of the steering wheel (CMU NavLab)
• Kinematics of a robot arm
α1= g1(x,y)
α2= g2(x,y)
(x,y)
α2
α1
◼
Response surface design
3. Concept of Clustering
• Grouping objects based on the information found in data
that describes the objects or their relationship
• Need of Clustering
✓ To determine the intrinsic grouping in a set of unlabeled data
✓ To organize data into clusters showing internal structure of the
data
✓ To partition the data points
✓ To understand and extract value from large sets of structured and
unstructured data
• Popular Algorithms: K-means, C-means.
Example: Grouping objects based on the information found
in data that describes the objects or their relationship
Details
Supervised Learning
Unsupervised Learning
Reinforcement Learning
Type of data
labelled
Unlabelled
Not predefined
Learning Approach
Map labelled input to known
output and forecast outcome
Discovers underlying pattern
from unlabelled input to make
group of input samples
Follow trial and error method
and Learns series of actions
Type of problem
Regression, Classification
Association, Clustering
Maze solving
Training
External Supervision
No supervision
No supervision
Methodology during training
Direct feedback needed during
training
No feedback during training
Rewards are used during
training
Popular Algorithms
Linear regression, logistic
K-means, Apriori, C-means
regression, KNN, SVM, Random
forest
Q-learning, SARSA
Applications
Risk evaluation, sales or
weather forecast, house price
prediction
Games, Self driving cars
Recommendation system,
Anomaly detection
Training, Testing and Validation dataset
1. Training dataset
• This is the actual dataset from which a model trains .i.e. the model sees and
learns from this data to predict the outcome or to make the right decisions.
• Most of the training data is collected from several resources and then
preprocessed and organized to provide proper performance of the model.
• Type of training data hugely determines the ability of the model to generalize .i.e.
the better the quality and diversity of training data, the better will be the
performance of the model.
• This data is more than 60% of the total data available for the project. (there is no
specific rule but generally for large dataset 75% to 80% for training and remaining
for testing (general 80-20 rule)).
2. Testing dataset
• This dataset is independent of the training set but has a somewhat
similar type of probability distribution of classes and is used as a
benchmark to evaluate the model, used only after the training of the
model is complete.
• Testing set is usually a properly organized dataset having all kinds of
data for scenarios that the model would probably be facing when
used in the real world.
• Testing data is approximately 20-25% of the total data available for the
project. Often the validation and testing set is combined and used as a
testing set (alone) which is not considered a good practice.
• If the accuracy of the model on training data is greater than that on
testing data then the model is said to have overfitting.
3. Validation dataset
• The validation set is used to fine-tune the hyperparameters of the model and is
considered a part of the training of the model.
• The model only sees this data for evaluation but does not learn from this data, providing
an objective unbiased evaluation of the model. The validation set affects a model, but
only indirectly.
• Validation dataset can be utilized for regression as well by interrupting training of model
when loss of validation dataset becomes greater than loss of training dataset .i.e.
reducing bias and variance.
• This data is approximately 10-15% of the total data available for the project but this can
change depending upon the number of hyperparameters .i.e. if model has quite many
hyperparameters then using large validation set will give better results.
• Now, whenever the accuracy of model on validation data is greater than that on training
data then the model is said to have generalized well.
• Some experts consider that ML models with no hyperparameters or those that do not
have tuning options do not need a validation set.
Cross validation
• In machine learning, we could fit the model on the training data but can’t say that
the model will work accurately for the real data.
• For this, we must assure that our model got the correct patterns from the data, and
it is not getting up too much noise.
• For this purpose, we use the cross-validation technique. It is a technique for
evaluating ML models by training several ML models on subsets of the available
input data and evaluating them on the complementary subset of the data.
• Use cross-validation to detect overfitting, i.e., failing to generalize a pattern.
• Types of cross validation:
i. Leave One Out Cross Validation
ii. K-fold cross validation
iii. Validation Set Approach
iv. Leave-P-out cross-validation
v. Stratified k-fold cross-validation
1. LOOCV (Leave One Out Cross Validation)
• In this method, we perform training on the whole data-set but leave only one datapoint of the available data-set and then iterate for each data-point.
• It has some advantages as well as disadvantages also.
✓ An advantage of using this method is that we make use of all data points and
hence it is low bias.
✓ The major drawback of this method is that it leads to higher variation in the
testing model as we are testing against one data point. If the data point is an
outlier it can lead to higher variation.
✓ Another drawback is it takes a lot of execution time as it iterates over ‘the
number of data points’ times.
2. K-fold cross validation
• In k-fold cross-validation, we split the input data into k subsets of data
(also known as folds).
• We train an ML model on all but one (k-1) of the subsets, and then
evaluate the model on the subset that was not used for training.
• This process is repeated k times, with a different subset reserved for
evaluation (and excluded from training) each time.
• The following diagram shows an example of the training subsets and
complementary evaluation subsets generated for each of the four
models that are created and trained during a 4-fold cross-validation.
• Model one uses the first 25 percent of data for evaluation, and the remaining 75
percent for training. Model two uses the second subset of 25 percent (25 percent to
50 percent) for evaluation, and the remaining three subsets of the data for training,
and so on.
Note: A lower k means
that the model is trained
on a comparatively
smaller training set and
tested on a larger test
fold.
• Each model is trained and evaluated using complementary data sources - the data in
the evaluation data source includes and is limited to all of the data that is not in the
training data source.
• Performing a 4-fold cross-validation generates four models, four data sources to train
the models, four data sources to evaluate the models, and four evaluations, one for
each model.
Note: It is always suggested that the value of k should be 10.
as the lower value of k don’t give good estimates and higher value of k leads to LOOCV method.
Underfitting, Overfitting, Bias and Variance
• A model is said to be a good machine learning model if it generalizes any new input data from
the problem domain in a proper way.
• If we want to check how well our machine learning model learns and generalizes to the new
data; we have overfitting and underfitting, which are majorly responsible for the poor
performances of the machine learning algorithms.
• Bias: Assumptions made by a model to make a function easier to learn. It is actually the error
rate of the training data. When the error rate has a high value, we call it High Bias and when
the error rate has a low value, we call it low Bias.
• Variance: The difference between the error rate of training data and testing data is called
variance. If the difference is high then it’s called high variance and when the difference of
errors is low then it’s called low variance. Usually, we want to make a low variance for
generalizing a model.
• Example of scores of poor learner, memorizing learner and good learner in class tests and
semester end tests.
Poor Learner
Memorizing Learner
Good Learner
Class test score
50
92
93
Semester end test score
47
67
89
1. Underfitting
• A statistical model or a machine learning algorithm is said to have underfitting
when it cannot capture the underlying trend of the data, i.e., it only performs well
on training data but performs poorly on testing data. (It’s just like trying to fit
undersized shirt!)
• Underfitting destroys the accuracy of machine learning model.
• It usually happens when we have fewer data to build an accurate model and also
when we try to build a linear model with fewer non-linear data.
• In such cases, the rules of the machine learning model are too easy and flexible to
be applied to such minimal data and therefore the model will probably make a lot
of wrong predictions.
• Underfitting can be avoided by using more data and also reducing the features by
feature selection.
• In a nutshell, Underfitting refers to a model that can neither performs well on the
training data nor generalize to new data.
• Reasons for Underfitting:
1.High bias and low variance
2.The size of the training dataset used is not enough.
3.The model is too simple.
4.Training data is not cleaned and also contains noise in it.
• Techniques to reduce underfitting:
1.Increase model complexity
2.Increase the number of features, performing feature engineering
3.Remove noise from the data.
4.Increase the number of epochs or increase the duration of training to
get better results.
2. Overfitting
• A statistical model is said to be overfitted when the model does not make
accurate predictions on testing data.
• When a model gets trained with so much data, it starts learning from the noise
and inaccurate data entries in our data set. And when testing with test data
results in High variance.
• Then the model does not categorize the data correctly, because of too many
details and noise.
• The causes of overfitting are the non-parametric and non-linear methods because
these types of machine learning algorithms have more freedom in building the
model based on the dataset and therefore they can really build unrealistic
models.
• A solution to avoid overfitting is using a linear algorithm if we have linear data or
using the parameters like the maximal depth if we are using decision trees.
• In a nutshell, Overfitting is a problem where the evaluation of machine learning
algorithms on training data is different from unseen data.
• Reasons for Overfitting are as follows:
1.Low bias and High variance
2.The model is too complex
3.The size of the training data
• Techniques to reduce overfitting:
1.Increase training data.
2.Reduce model complexity.
3.Early stopping during the training phase (have an eye over the loss
over the training period as soon as loss begins to increase stop
training).
4.Ridge Regularization and Lasso Regularization
5.Use dropout for neural networks to tackle overfitting.
Overfitting and Underfitting of the Model
Regression model
Underfit
Best fit
High Bias
Low variance
Over fit
Low Bias
Low variance
Low Bias
High Variance
Source: https://www.geeksforgeeks.org/underfitting-and-overfitting-in-machine-learning/
Classification model
Underfit
Best fit
Over fit
Training error
23%
2%
1%
Test error
25%
3%
20%
(Variance)
(Bias)
Performance Measures: Measuring Quality of model
• After cleaning and preprocessing the data and training our model, how do we
know if our classification model performs well?
• Most error measures will calculate the total error in our model, but we cannot
find individual instances of errors in our model. The model might misclassify
some categories more than others, but we cannot see this using a standard
accuracy measure.
• If there is a significant class imbalance in the given data. In that case, i.e., a
class has more instances of data than the other classes, a model might
predict the majority class for all cases and have a high accuracy score; when
it is not predicting the minority classes.
• This is where confusion matrices are useful. A confusion matrix is used to
measure the performance of a classifier in depth.
Confusion Matrix
• A confusion matrix presents a table layout of
the different outcomes of the prediction and
results of a classification problem and helps
visualize its outcomes.
• It plots a table of all the predicted and actual
values of a classifier.
• It not only tells the error made by the
classifiers but also the type of errors such as it
is either type-I or type-II error.
• With the help of the confusion matrix, we can
calculate the different parameters for the
model, such as accuracy, precision, etc.
SOURCE:
https://www.simplilearn.com/tutorials/machinelearning-tutorial/confusion-matrix-machinelearning#:~:text=A%20confusion%20matrix%20p
resents%20a,actual%20values%20of%20a%20cla
ssifier.
Confusion Matrix continued…
• True Positive (TP): The number of times our actual positive values are
equal to the predicted positive. You predicted a positive value, and it is
correct.
• False Positive (FP): The number of times our model wrongly predicts
negative values as positives. You predicted a positive value, and it is
actually negative.
• True Negative (TN): The number of times our actual negative values are
equal to predicted negative values. You predicted a negative value, and it is
actually negative.
• False Negative (FN): The number of times our model wrongly predicts
positive values as negatives. You predicted a negative value, and it is
actually positive.
FPR (Type I Error)
FNR (Type II Error)
Example:
Actual values = [‘dog’, ‘cat’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’, ‘dog’, ‘cat’,
‘dog’, ‘dog’, ‘dog’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’]
Predicted values = [‘dog’, ‘dog’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’, ‘cat’,
‘cat’, ‘cat’, ‘dog’, ‘dog’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’]
Calculate TP, TN, FP, and FN for
following confusion matrix
Type II
Error
Type I
Error
Scaling a Confusion Matrix
• To scale a confusion matrix, increase the number of rows and columns. All the True
Positives will be along the diagonal. The other values will be False Positives or
False Negatives.
Figure : Scaling down our dataset
• Just from looking at the matrix, the performance of the
model is not very clear.
• To find how accurate the model is, following metrics are
used:
➢ Accuracy, precision, Recall, F1 Score, Sensitivity, Specificity
1. Accuracy:
• The accuracy is used to find the portion of correctly classified values.
• It tells us how often our classifier is right.
• It’s the ratio between the number of correct predictions and the total
number of predictions (sum of all true values divided by total values).
In this case:
Accuracy = (86 +79) / (86 + 79 + 10 + 12)
= 0.8823 = 88.23%
• It is a measure of correctness that is
achieved in true prediction. In simple
words, it tells us how many predictions
are actually correct out of all the total
predictions.
• The
accuracy
metric
is
not
suited for imbalanced classes.
• When the model predicts that each point
belongs to the majority class label, the
accuracy will be high. But, the model is
not accurate.
Accuracy is a valid choice of evaluation for classification problems when the data
is well balanced and not skewed or there is no class imbalance.
Predicted positives in denominator.
2. Precision
How many among predicted positives are actual positives?
• Precision is used to calculate the model's ability to classify positive values
correctly. It is the true positives divided by the total number of predicted positive
values.
•Precision is also called as positive predictive value.
•In simple words, it tells us how many predictions are
actually positive out of all the total positive predictions
• Precision should be high(ideally 1).
In this case,
Precision = 86 / (86 + 10) = 0.8958 = 89.58%
• “Precision is a useful metric in cases where
False Positive is a higher concern than False
Negatives”
• Example 1: In Spam Detection : We need to
focus on precision. Suppose mail is not a
spam but model has predicted it as spam : FP
(False Positive).
• Example 2: Precision is important in music or
video recommendation systems, e-commerce
websites, etc. Wrong results could lead to
customer churn and be harmful to the
business.
• We always try to reduce FP.
Trick to remember : Precision has Predictive Positive Results in the denominator.
Actual positives in denominator.
3. Recall
How many among actual positives are correctly predicted as positives?
• It is used to calculate the model's ability to predict positive values. "How often does
the model predict the correct positive values?". It is the true positives divided by
the total number of actual positive values.
• In this case, Recall = 86 / (86 + 12) = 0.8775 = 87.75%
• It is also known as Sensitivity or True positive rate.
• Recall is a valid choice of evaluation metric when we want
to capture as many positives as possible.
• Recall should be high(ideally 1).
• “Recall is a useful metric in cases
where False Negative trumps False Positive”
• Ex 1:- suppose person having cancer (or)
not? He is suffering from cancer but model
predicted as not suffering from cancer
• Ex 2:- Recall is important in medical
cases where it doesn’t matter whether we
raise a false alarm but the actual positive
cases should not go undetected!
• Recall would be a better metric because we
don’t want to accidentally discharge an
infected person and let them mix with the
healthy population thereby spreading
contagious virus.
4. F1-Score
• The F1 score is a number between 0 and 1 and it is the harmonic mean of Precision
and Recall.
•
•
•
•
It is useful when we need to take both Precision and Recall into account.
In this case,
F1-Score = (2* 0.8983 * 0.8775 ) / (0.8983 + 0.8775) = 0.8877 = 88.77%
F-score should be high(ideally 1).
• Harmonic mean is used because it is not sensitive to
extremely large values, it punishes the extreme values
more, unlike simple averages(Arithmetic Mean).
• F1 score sort of maintains a balance between
the precision and recall for the classifier.
• If the precision is low, the F1 is low and if the recall is
low again the F1 score is low.
• There will be some cases where there is no clear
distinction between whether Precision is more
important or Recall. We combine them!
• In practice, when we try to increase the precision of
our model, the recall goes down and vice-versa. The
F1-score captures both the trends in a single value.
5. Sensitivity and specificity
(True positive rate and True negative rate)
(TP+FN)
When to use Accuracy / Precision / Recall / F1-Score?
❑ Accuracy is used when the True Positives and True Negatives are more
important. Accuracy is a better metric for Balanced Data.
❑ Whenever False Positive is much more important use Precision.
❑ Whenever False Negative is much more important use Recall.
❑ F1-Score is used when the False Negatives and False Positives both are
important. F1-Score is a better metric for Imbalanced Data.
Download