Module 1 Introduction to Machine Learning What We Talk About, When We Talk About“Learning” • Learning is “a process that leads to change, which occurs as a result of experience and increases the potential for improved performance and future learning” (Ambrose et al, 2010, p.3). • The change in the learner may happen at the level of knowledge, attitude or behavior to see concepts, ideas, and/or the world differently. • Knowledge is expensive and scarce; Data is cheap and abundant (data warehouses, data marts); • Example in retail: Customer transactions to consumer behavior: People who bought “Da Vinci Code” also bought “The Five People You Meet in Heaven” (www.amazon.com) • Build a model that is a good and useful approximation to the data. • Writing software is the bottleneck. So let the data do the work instead! Data Economy 1 zettabyte = 1,024 exabytes or 1 billion terabytes. 1 Exabyte (EB) = 1,024 Petabytes or one billion gigabytes (GB). 1 Petabyte =1,024 Terabytes. 1 Terabyte (TB) = 1,024 Gigabytes 1 Gigabyte = 1024 MB (or approximately 1000 MB) 1MB = 1024 KB 1KB = 1024 bytes (2^10 bytes) 1 byte = 8 bits Why “Learn” ? Learning is used when: ❑ Human expertise does not exist ➢navigating on Mars ➢industrial/manufacturing control. ➢mass spectrometer analysis, drug design, astronomic discovery. ❑ Black-box human expertise (Humans are unable to explain their expertise) ➢face/handwriting/speech recognition. ➢driving a car, flying a plane. ❑ Rapidly changing phenomena (Solution changes in time) ➢routing on a computer network ➢credit scoring, financial modelling. ➢diagnosis, fraud detection. ❑ Need for customization/personalization (Solution needs to be adapted to particular cases) ➢user biometrics ➢personalized news reader. ➢movie/book recommendation. Machine Learning Definitions • Machine learning is programming computers to optimize a performance criterion using example data or past experience. • Giving capability to computers to learn from data without being explicitly programmed. • A branch of artificial intelligence, concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data. • Machine learning is an application of artificial intelligence that involves algorithms and data that automatically analyse and make decision by itself without human intervention. What is Necessary to Understand Machine Learning (ML) ? ➢Role of Statistics: Inference from a sample ➢Role of Computer science: Efficient algorithms for ✓ Solving the optimization problem ✓ Representing and evaluating the model for inference Traditional Programming Data Program Computer Output Computer Model/Program Machine Learning Data Output Difference Between Machine Learning And Artificial Intelligence Artificial Intelligence is a concept of creating intelligent machines that simulates human behaviour whereas Machine learning is a subset of Artificial intelligence that allows machine to learn from data (Mostly structured data) without being programmed. AI ML DL Additional knowledge: Deep Learning is subset of machine learning which is based on highly complex neural networks that mimic the way a human brain works to detect patterns in large unstructured data sets. Machine learning and Data Science go hand in hand Advantages of Machine Learning • Fast, Accurate, Efficient. • Automation of most applications. • Wide range of real life applications. • Enhanced cyber security and spam detection. • No human Intervention is needed. • Handling multi dimensional data. Disadvantages of Machine Learning • It is very difficult to identify and rectify the errors. • Data Acquisition. • Interpretation of results • Requires more time and space. What algorithms should be used? • Selecting the right algorithm for a machine learning task is a critical decision and depends on various factors. Here are some common issues related to algorithm selection in machine learning: 1.Type of Problem: 1. Classification vs. Regression vs. Clustering: The choice of algorithm depends on the type of problem you are trying to solve. Classification algorithms are used for categorical outcomes, regression for continuous outcomes, and clustering for grouping similar data points. 2.Size and Nature of Data: 1. Data Size: The size of the dataset can influence the choice of algorithms. Some algorithms may work well with small datasets, while others require large amounts of data for effective training. 2. Data Distribution: The distribution of the data, including its linearity and the presence of outliers, can impact the performance of certain algorithms. If the training data is biased, the model may learn and perpetuate those biases. This can lead to unfair or discriminatory outcomes. 3.Computational Resources: Training complex models, especially deep neural networks, can be computationally expensive and may require specialized hardware. (Especially for large datasets or complex models.) 4. Interpretability: Many advanced machine learning models, such as deep neural networks, are often viewed as black boxes, making it challenging to interpret their decision-making processes. Depending on the application, you may need an interpretable model. Some algorithms, such as decision trees, are more interpretable than others, like complex neural networks. 5. Feature Engineering: Naive Bayes assumes feature independence, which may not hold in all cases. Understanding the relationships between features can help choose an algorithm that captures the underlying patterns in the data. Poor feature selection can lead to suboptimal performance. 6. Model Complexity: Different algorithms have varying levels of bias and variance. Finding the right balance is crucial to avoid underfitting or overfitting the data. •Overfitting: Occurs when a model learns the training data too well, including its noise and outliers, and performs poorly on new, unseen data. •Underfitting: Happens when a model is too simple to capture the underlying patterns in the data, resulting in poor performance. 7. Handling Non-linearity: Linear models may not capture non-linear relationships in the data. For complex, non-linear patterns, algorithms like decision trees or neural networks may be more suitable. 8. Robustness to Outliers: Some algorithms are sensitive to outliers and may be heavily influenced by them. Robust algorithms, like support vector machines, may be more appropriate in such cases. 9. Ensemble Methods: In many cases, combining the predictions of multiple models (ensemble methods) can lead to better performance than using a single model. Random forests and gradient boosting are popular ensemble methods. 10. Domain Knowledge: Knowledge about the specific domain and problem at hand can guide the selection of algorithms. Certain algorithms may perform better in specific industries or applications. Prior knowledge is useful even when it is approximately correct. 11. Scalability: Some algorithms are more scalable than others. Deploying and maintaining machine learning models at scale can pose challenges, especially when dealing with large datasets or high inference loads. Addressing these issues often requires a multidisciplinary approach, involving expertise in machine learning, statistics, computer science, and domain-specific knowledge. 12. Availability of Libraries and Tools: The availability of libraries and tools for a specific algorithm in chosen programming language can influence the decision. Popular libraries like scikit-learn, TensorFlow, and PyTorch provide implementations for various algorithms. 13. Ethical and Legal Concerns: • Bias and Fairness: Machine learning models may inadvertently reflect and perpetuate biases present in the training data. Addressing fairness and ethical concerns is an ongoing challenge. • Privacy: Models trained on sensitive data may inadvertently reveal private information. Striking a balance between utility and privacy is crucial. 14. Security: • Adversarial Attacks: Some models are vulnerable to adversarial attacks, where small, intentional modifications to the input data can lead to incorrect predictions. 15. Continuous Learning: • Adapting to Change: Many machine learning models are trained on static datasets, making it challenging for them to adapt to changing patterns in the data. ML in a Nutshell • Tens of thousands of machine learning algorithms • Hundreds new every year • Every machine learning algorithm has three components: i. Representation ii. Evaluation iii. Optimization 1. Representation Representation refers to how data is transformed or encoded into a format that is suitable for a learning algorithm. It involves the conversion of raw input data into a format that the algorithm can effectively work with, allowing it to learn patterns, relationships, and features from the data. The choice of representation is crucial because it can significantly impact the performance of the machine learning model. • Decision trees: A decision tree is a graphical representation of a decision-making process • Sets of rules / Logic programs • Instances • Graphical models (Bayes/Markov nets) • Neural networks • Support vector machines • Model ensembles • Etc. Representation of image data to Neural Network SOURCE: https://ml4a.github.io/ml4a/looking_inside_neural_nets/ 2. Evaluation • • • • • • • • • • Accuracy Precision and recall Squared error Likelihood Posterior probability Cost / Utility Margin Entropy K-L divergence Etc. 3. Optimization •Combinatorial optimization •E.g.: Greedy search •Convex optimization •E.g.: Gradient descent •Constrained optimization •E.g.: Linear programming Applications of Machine Learning: • Image tagging, image recognition, self driving cars, OCR • Human Simulation, humanoid robots, industrial robots • Anomaly detection, grouping and prediction, association rules • Pokémon go, Alpha go, Chess, Super Mario • Spam and malware filtering, information extraction, sentiment analysis • Drug discovery, nutrition, lifestyle management and monitoring, medical imaging, risk analysis, bioinformatics Additional Slide_ Discussion Need for AI in healthcare • AI improves accuracy and efficiency in diagnosing diseases. • AI can handle large volumes of healthcare data for processing and analysis. Accuracy & Efficiency Large Volume • AI automates routine administrative tasks in healthcare. Automation • AI has predictive capabilities for early disease detection and prevention. • AI assists doctors in treatment decisions by considering available information. Predictions Treatment decisions • AI enables personalized medicine by tailoring treatment to individual patient characteristics. • AI speeds up drug discovery and development processes. • AI supports telemedicine services and remote patient monitoring. Personalized Medicine Drug Discovery Telemedicine AI in Healthcare Disease Diagnosis Disease Treatment Predicting Diseases Medical Imaging Drug Discovery Personalized Treatment plans Health Monitoring Wearable Technology Hospital Administration Ethics Data mining example: Learning Association • Basket analysis: P (Y | X ) probability that somebody who buys X also buys Y where X and Y are products or services. Example: P ( chips | beer ) = 0.7 P ( pav | vada) > P( pav | tea ) Many other applications of Machine Learning: •Traffic prediction •Virtual Personal Assistant •Speech recognition •Natural language processing •Astronomy •Anatomy •Agriculture •Banking •Online Advertising •Computer Vision Steps of developing a Machine Learning Application Here Step 1 Defining objective of what we want and how we want it Step 8 Deploying finally built model for application 2. Collect a data • Given the problem we want to solve, we will have to investigate and obtain data that we will use to feed our machine. • The quality and quantity of information we get are very important since it will directly impact how well or badly our model will work. • We may have the information in an existing database or we must create it from scratch. • If it is a small project we can create a spreadsheet that will later be easily exported as a CSV file. • It is also common to use the web scraping technique to automatically collect information from various sources such as APIs. 3. Prepare the data •This is a good time to visualize our data and check if there are correlations between the different characteristics that we obtained. •It will be necessary to make a selection of characteristics since the one we choose will directly impact the execution times and the results. •We can also reduce dimensions by applying PCA if necessary. Prepare the data continued… • Additionally, we must balance the amount of data we have for each result -class- so that it is significant as the learning may be biased towards a type of response and when our model tries to generalize knowledge, it will fail. • At this stage, we can also pre-process our data by normalizing, eliminating duplicates, and making error corrections. • We must also separate the data into two groups: one for training and the other for model evaluation which can be divided approximately in a ratio of 80/20 but it can vary depending on the case and the volume of data we have. 4. Choose the model • There are several models that we can choose according to the objective that we might have: we will use algorithms of classification, prediction, linear regression, clustering, i.e. k-means or KNearest Neighbor, Deep Learning, i.e Neural Networks, Bayesian, etc. • There are various models to be used depending on the data we are going to process such as images, sound, text, and numerical values. • Table aside shows some models and their sample applications. 5. Train the model •We will need to train the model with datasets to run smoothly and see an incremental improvement in the prediction rate. •It is necessary to initialize the weights of our model randomly -the weights are the values that multiply or affect the relationships between the inputs and outputs- which will be automatically adjusted by the selected algorithm the more we train them. 6. Evaluate the model •We will have to check the model created against our evaluation data set that contains inputs that the model does not know and verify the precision of your already trained model. •If the accuracy is less than or equal to 50%, that model will not be useful since it would be like tossing a coin to make decisions. •If we reach 90% or more, we can have good confidence in the results that the model gives us. Parameter Tuning during evaluation • If during the evaluation we did not obtain good predictions and our precision is not the minimum desired, it is possible that we must return to the training step before making a new configuration of parameters in our model. • We can increase the number of times we iterate our training data- termed epochs. • We can also indicate the maximum error allowed for our model. • We can go from taking a few minutes to hours, and even days, to train our machine. • These parameters are often called Hyperparameters. This “tuning” is still more of an art than a science and will improve as we experiment. • There are usually many parameters to adjust and when combined they can trigger all our options. This will be a work of great effort and patience to give good results. 7. Prediction or Inference • We are now ready to use our Machine Learning model inferring results in real-life scenarios. Types of Machine Learning ➢Supervised (inductive) learning ✓ Training data includes desired outputs ➢Unsupervised learning ✓ Training data does not include desired outputs ➢Semi-supervised learning ✓ Training data includes a few desired outputs ➢Reinforcement learning ✓ Rewards from sequence of actions Supervised Learning • Supervised Learning is a type of machine learning used to train models from labeled training data. • It allows you to predict output for future or unseen data Supervised Learning Flow Supervised Learning Flow Unsupervised Learning • The data has no labels. The machine just looks for whatever patterns it can find. • Unsupervised learning can be used for anomaly detection as well as clustering. Notice here there are no labels Example of Clustering Example of Anomaly detection Example of Clustering Classification, Regression and Clustering 1. Concept of Classification • A machine learning task that identifies the class to which an instance belongs • If target variable is categorical (classes), then use classification algorithm. • In other words, classification is applied when the output has finite and discrete values. • Popular Algorithms: Naïve Bayes, SVM, Random forest, Decision Tree • Example: Predict the class of car given its features like horsepower, mileage, weight, colour, etc. • The classifier will build its attributes based on these features. • Analysis has three potential outcomes -Sedan, SUV, or Hatchback Classification Example • Example: Credit scoring • Differentiating between low-risk and high-risk customers from their income and savings Discriminant: IF income > θ1 AND savings > θ2 THEN low-risk ELSE high-risk Sometimes misclassification may also occur if model does not learn well. Classification: Applications • Pattern recognition • Face recognition: Pose, lighting, occlusion (glasses, beard), make-up, hair style • Character recognition: Different handwriting styles. • Speech recognition: Temporal dependency. • Use of a dictionary or the syntax of the language. • Sensor fusion: Combine multiple modalities; eg, visual (lip image) and acoustic for speech • Medical diagnosis: From symptoms to illnesses 2. Concept of Regression (for prediction) • If target variable is a continuous numeric variable (100–2000), then use a regression algorithm. • Example: Predict the price of a house given its sq. ft. area, location, no. of bedrooms, etc. • A simple regression algorithm is given below y=w*x+b • This shows relationship between price (y) and sq. ft. area (x), where price is a any number within a defined range. Regression Example • Example: Price of a used car • x : car attributes y : price y = g (x | θ ) g ( ) model, θ parameters y = wx+w0 Regression: Applications • Navigating a car: Angle of the steering wheel (CMU NavLab) • Kinematics of a robot arm α1= g1(x,y) α2= g2(x,y) (x,y) α2 α1 ◼ Response surface design 3. Concept of Clustering • Grouping objects based on the information found in data that describes the objects or their relationship • Need of Clustering ✓ To determine the intrinsic grouping in a set of unlabeled data ✓ To organize data into clusters showing internal structure of the data ✓ To partition the data points ✓ To understand and extract value from large sets of structured and unstructured data • Popular Algorithms: K-means, C-means. Example: Grouping objects based on the information found in data that describes the objects or their relationship Details Supervised Learning Unsupervised Learning Reinforcement Learning Type of data labelled Unlabelled Not predefined Learning Approach Map labelled input to known output and forecast outcome Discovers underlying pattern from unlabelled input to make group of input samples Follow trial and error method and Learns series of actions Type of problem Regression, Classification Association, Clustering Maze solving Training External Supervision No supervision No supervision Methodology during training Direct feedback needed during training No feedback during training Rewards are used during training Popular Algorithms Linear regression, logistic K-means, Apriori, C-means regression, KNN, SVM, Random forest Q-learning, SARSA Applications Risk evaluation, sales or weather forecast, house price prediction Games, Self driving cars Recommendation system, Anomaly detection Training, Testing and Validation dataset 1. Training dataset • This is the actual dataset from which a model trains .i.e. the model sees and learns from this data to predict the outcome or to make the right decisions. • Most of the training data is collected from several resources and then preprocessed and organized to provide proper performance of the model. • Type of training data hugely determines the ability of the model to generalize .i.e. the better the quality and diversity of training data, the better will be the performance of the model. • This data is more than 60% of the total data available for the project. (there is no specific rule but generally for large dataset 75% to 80% for training and remaining for testing (general 80-20 rule)). 2. Testing dataset • This dataset is independent of the training set but has a somewhat similar type of probability distribution of classes and is used as a benchmark to evaluate the model, used only after the training of the model is complete. • Testing set is usually a properly organized dataset having all kinds of data for scenarios that the model would probably be facing when used in the real world. • Testing data is approximately 20-25% of the total data available for the project. Often the validation and testing set is combined and used as a testing set (alone) which is not considered a good practice. • If the accuracy of the model on training data is greater than that on testing data then the model is said to have overfitting. 3. Validation dataset • The validation set is used to fine-tune the hyperparameters of the model and is considered a part of the training of the model. • The model only sees this data for evaluation but does not learn from this data, providing an objective unbiased evaluation of the model. The validation set affects a model, but only indirectly. • Validation dataset can be utilized for regression as well by interrupting training of model when loss of validation dataset becomes greater than loss of training dataset .i.e. reducing bias and variance. • This data is approximately 10-15% of the total data available for the project but this can change depending upon the number of hyperparameters .i.e. if model has quite many hyperparameters then using large validation set will give better results. • Now, whenever the accuracy of model on validation data is greater than that on training data then the model is said to have generalized well. • Some experts consider that ML models with no hyperparameters or those that do not have tuning options do not need a validation set. Cross validation • In machine learning, we could fit the model on the training data but can’t say that the model will work accurately for the real data. • For this, we must assure that our model got the correct patterns from the data, and it is not getting up too much noise. • For this purpose, we use the cross-validation technique. It is a technique for evaluating ML models by training several ML models on subsets of the available input data and evaluating them on the complementary subset of the data. • Use cross-validation to detect overfitting, i.e., failing to generalize a pattern. • Types of cross validation: i. Leave One Out Cross Validation ii. K-fold cross validation iii. Validation Set Approach iv. Leave-P-out cross-validation v. Stratified k-fold cross-validation 1. LOOCV (Leave One Out Cross Validation) • In this method, we perform training on the whole data-set but leave only one datapoint of the available data-set and then iterate for each data-point. • It has some advantages as well as disadvantages also. ✓ An advantage of using this method is that we make use of all data points and hence it is low bias. ✓ The major drawback of this method is that it leads to higher variation in the testing model as we are testing against one data point. If the data point is an outlier it can lead to higher variation. ✓ Another drawback is it takes a lot of execution time as it iterates over ‘the number of data points’ times. 2. K-fold cross validation • In k-fold cross-validation, we split the input data into k subsets of data (also known as folds). • We train an ML model on all but one (k-1) of the subsets, and then evaluate the model on the subset that was not used for training. • This process is repeated k times, with a different subset reserved for evaluation (and excluded from training) each time. • The following diagram shows an example of the training subsets and complementary evaluation subsets generated for each of the four models that are created and trained during a 4-fold cross-validation. • Model one uses the first 25 percent of data for evaluation, and the remaining 75 percent for training. Model two uses the second subset of 25 percent (25 percent to 50 percent) for evaluation, and the remaining three subsets of the data for training, and so on. Note: A lower k means that the model is trained on a comparatively smaller training set and tested on a larger test fold. • Each model is trained and evaluated using complementary data sources - the data in the evaluation data source includes and is limited to all of the data that is not in the training data source. • Performing a 4-fold cross-validation generates four models, four data sources to train the models, four data sources to evaluate the models, and four evaluations, one for each model. Note: It is always suggested that the value of k should be 10. as the lower value of k don’t give good estimates and higher value of k leads to LOOCV method. Underfitting, Overfitting, Bias and Variance • A model is said to be a good machine learning model if it generalizes any new input data from the problem domain in a proper way. • If we want to check how well our machine learning model learns and generalizes to the new data; we have overfitting and underfitting, which are majorly responsible for the poor performances of the machine learning algorithms. • Bias: Assumptions made by a model to make a function easier to learn. It is actually the error rate of the training data. When the error rate has a high value, we call it High Bias and when the error rate has a low value, we call it low Bias. • Variance: The difference between the error rate of training data and testing data is called variance. If the difference is high then it’s called high variance and when the difference of errors is low then it’s called low variance. Usually, we want to make a low variance for generalizing a model. • Example of scores of poor learner, memorizing learner and good learner in class tests and semester end tests. Poor Learner Memorizing Learner Good Learner Class test score 50 92 93 Semester end test score 47 67 89 1. Underfitting • A statistical model or a machine learning algorithm is said to have underfitting when it cannot capture the underlying trend of the data, i.e., it only performs well on training data but performs poorly on testing data. (It’s just like trying to fit undersized shirt!) • Underfitting destroys the accuracy of machine learning model. • It usually happens when we have fewer data to build an accurate model and also when we try to build a linear model with fewer non-linear data. • In such cases, the rules of the machine learning model are too easy and flexible to be applied to such minimal data and therefore the model will probably make a lot of wrong predictions. • Underfitting can be avoided by using more data and also reducing the features by feature selection. • In a nutshell, Underfitting refers to a model that can neither performs well on the training data nor generalize to new data. • Reasons for Underfitting: 1.High bias and low variance 2.The size of the training dataset used is not enough. 3.The model is too simple. 4.Training data is not cleaned and also contains noise in it. • Techniques to reduce underfitting: 1.Increase model complexity 2.Increase the number of features, performing feature engineering 3.Remove noise from the data. 4.Increase the number of epochs or increase the duration of training to get better results. 2. Overfitting • A statistical model is said to be overfitted when the model does not make accurate predictions on testing data. • When a model gets trained with so much data, it starts learning from the noise and inaccurate data entries in our data set. And when testing with test data results in High variance. • Then the model does not categorize the data correctly, because of too many details and noise. • The causes of overfitting are the non-parametric and non-linear methods because these types of machine learning algorithms have more freedom in building the model based on the dataset and therefore they can really build unrealistic models. • A solution to avoid overfitting is using a linear algorithm if we have linear data or using the parameters like the maximal depth if we are using decision trees. • In a nutshell, Overfitting is a problem where the evaluation of machine learning algorithms on training data is different from unseen data. • Reasons for Overfitting are as follows: 1.Low bias and High variance 2.The model is too complex 3.The size of the training data • Techniques to reduce overfitting: 1.Increase training data. 2.Reduce model complexity. 3.Early stopping during the training phase (have an eye over the loss over the training period as soon as loss begins to increase stop training). 4.Ridge Regularization and Lasso Regularization 5.Use dropout for neural networks to tackle overfitting. Overfitting and Underfitting of the Model Regression model Underfit Best fit High Bias Low variance Over fit Low Bias Low variance Low Bias High Variance Source: https://www.geeksforgeeks.org/underfitting-and-overfitting-in-machine-learning/ Classification model Underfit Best fit Over fit Training error 23% 2% 1% Test error 25% 3% 20% (Variance) (Bias) Performance Measures: Measuring Quality of model • After cleaning and preprocessing the data and training our model, how do we know if our classification model performs well? • Most error measures will calculate the total error in our model, but we cannot find individual instances of errors in our model. The model might misclassify some categories more than others, but we cannot see this using a standard accuracy measure. • If there is a significant class imbalance in the given data. In that case, i.e., a class has more instances of data than the other classes, a model might predict the majority class for all cases and have a high accuracy score; when it is not predicting the minority classes. • This is where confusion matrices are useful. A confusion matrix is used to measure the performance of a classifier in depth. Confusion Matrix • A confusion matrix presents a table layout of the different outcomes of the prediction and results of a classification problem and helps visualize its outcomes. • It plots a table of all the predicted and actual values of a classifier. • It not only tells the error made by the classifiers but also the type of errors such as it is either type-I or type-II error. • With the help of the confusion matrix, we can calculate the different parameters for the model, such as accuracy, precision, etc. SOURCE: https://www.simplilearn.com/tutorials/machinelearning-tutorial/confusion-matrix-machinelearning#:~:text=A%20confusion%20matrix%20p resents%20a,actual%20values%20of%20a%20cla ssifier. Confusion Matrix continued… • True Positive (TP): The number of times our actual positive values are equal to the predicted positive. You predicted a positive value, and it is correct. • False Positive (FP): The number of times our model wrongly predicts negative values as positives. You predicted a positive value, and it is actually negative. • True Negative (TN): The number of times our actual negative values are equal to predicted negative values. You predicted a negative value, and it is actually negative. • False Negative (FN): The number of times our model wrongly predicts positive values as negatives. You predicted a negative value, and it is actually positive. FPR (Type I Error) FNR (Type II Error) Example: Actual values = [‘dog’, ‘cat’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘dog’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’] Predicted values = [‘dog’, ‘dog’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’, ‘cat’, ‘cat’, ‘cat’, ‘dog’, ‘dog’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’, ‘dog’, ‘dog’, ‘cat’] Calculate TP, TN, FP, and FN for following confusion matrix Type II Error Type I Error Scaling a Confusion Matrix • To scale a confusion matrix, increase the number of rows and columns. All the True Positives will be along the diagonal. The other values will be False Positives or False Negatives. Figure : Scaling down our dataset • Just from looking at the matrix, the performance of the model is not very clear. • To find how accurate the model is, following metrics are used: ➢ Accuracy, precision, Recall, F1 Score, Sensitivity, Specificity 1. Accuracy: • The accuracy is used to find the portion of correctly classified values. • It tells us how often our classifier is right. • It’s the ratio between the number of correct predictions and the total number of predictions (sum of all true values divided by total values). In this case: Accuracy = (86 +79) / (86 + 79 + 10 + 12) = 0.8823 = 88.23% • It is a measure of correctness that is achieved in true prediction. In simple words, it tells us how many predictions are actually correct out of all the total predictions. • The accuracy metric is not suited for imbalanced classes. • When the model predicts that each point belongs to the majority class label, the accuracy will be high. But, the model is not accurate. Accuracy is a valid choice of evaluation for classification problems when the data is well balanced and not skewed or there is no class imbalance. Predicted positives in denominator. 2. Precision How many among predicted positives are actual positives? • Precision is used to calculate the model's ability to classify positive values correctly. It is the true positives divided by the total number of predicted positive values. •Precision is also called as positive predictive value. •In simple words, it tells us how many predictions are actually positive out of all the total positive predictions • Precision should be high(ideally 1). In this case, Precision = 86 / (86 + 10) = 0.8958 = 89.58% • “Precision is a useful metric in cases where False Positive is a higher concern than False Negatives” • Example 1: In Spam Detection : We need to focus on precision. Suppose mail is not a spam but model has predicted it as spam : FP (False Positive). • Example 2: Precision is important in music or video recommendation systems, e-commerce websites, etc. Wrong results could lead to customer churn and be harmful to the business. • We always try to reduce FP. Trick to remember : Precision has Predictive Positive Results in the denominator. Actual positives in denominator. 3. Recall How many among actual positives are correctly predicted as positives? • It is used to calculate the model's ability to predict positive values. "How often does the model predict the correct positive values?". It is the true positives divided by the total number of actual positive values. • In this case, Recall = 86 / (86 + 12) = 0.8775 = 87.75% • It is also known as Sensitivity or True positive rate. • Recall is a valid choice of evaluation metric when we want to capture as many positives as possible. • Recall should be high(ideally 1). • “Recall is a useful metric in cases where False Negative trumps False Positive” • Ex 1:- suppose person having cancer (or) not? He is suffering from cancer but model predicted as not suffering from cancer • Ex 2:- Recall is important in medical cases where it doesn’t matter whether we raise a false alarm but the actual positive cases should not go undetected! • Recall would be a better metric because we don’t want to accidentally discharge an infected person and let them mix with the healthy population thereby spreading contagious virus. 4. F1-Score • The F1 score is a number between 0 and 1 and it is the harmonic mean of Precision and Recall. • • • • It is useful when we need to take both Precision and Recall into account. In this case, F1-Score = (2* 0.8983 * 0.8775 ) / (0.8983 + 0.8775) = 0.8877 = 88.77% F-score should be high(ideally 1). • Harmonic mean is used because it is not sensitive to extremely large values, it punishes the extreme values more, unlike simple averages(Arithmetic Mean). • F1 score sort of maintains a balance between the precision and recall for the classifier. • If the precision is low, the F1 is low and if the recall is low again the F1 score is low. • There will be some cases where there is no clear distinction between whether Precision is more important or Recall. We combine them! • In practice, when we try to increase the precision of our model, the recall goes down and vice-versa. The F1-score captures both the trends in a single value. 5. Sensitivity and specificity (True positive rate and True negative rate) (TP+FN) When to use Accuracy / Precision / Recall / F1-Score? ❑ Accuracy is used when the True Positives and True Negatives are more important. Accuracy is a better metric for Balanced Data. ❑ Whenever False Positive is much more important use Precision. ❑ Whenever False Negative is much more important use Recall. ❑ F1-Score is used when the False Negatives and False Positives both are important. F1-Score is a better metric for Imbalanced Data.