Unit -1:1. Write the Four Main Challenges in Machine Learning? 2. Write Short Notes on AI, ML, and DL? 3. What are the Different Types of Machine Learning Systems? 4. Write a Note on Training Loss Vs Testing Loss? 5. What are different Tradeoffs in Statistical Learning? Explain 6. Write the procedure for estimating sampling distribution of an estimator? 7. Write the statistics in supervise learning and unsupervised learning? 1. In Machine Learning, there occurs a process of analyzing data for building or training models. It is just everywhere; from Amazon product recommendations to self-driven cars, it beholds great value throughout. As per the latest research, the global machine learning market is expected to grow by 43% by 2024. This revolution has enhanced the demand for machine learning professionals to a great extent. AI and machine learning jobs have observed a significant growth rate of 75% in the past four years, and the industry is growing continuously. A career in the Machine learning domain offers job satisfaction, excellent growth, insanely high salary, but it is a complex and challenging process. There are a lot of challenges that machine learning professionals face to inculcate ML skills and create an application from scratch. What are these challenges? In this blog, we will discuss seven major challenges faced by machine learning professionals. Let’s have a look. 1. Poor Quality of Data Data plays a significant role in the machine learning process. One of the significant issues that machine learning professionals face is the absence of good quality data. Unclean and noisy data can make the whole process extremely exhausting. We don’t want our algorithm to make inaccurate or faulty predictions. Hence the quality of data is essential to enhance the output. Therefore, we need to ensure that the process of data preprocessing which includes removing outliers, filtering missing values, and removing unwanted features, is done with the utmost level of perfection. 2. Underfitting of Training Data This process occurs when data is unable to establish an accurate relationship between input and output variables. It simply means trying to fit in undersized jeans. It signifies the data is too simple to establish a precise relationship. To overcome this issue: ● Maximize the training time ● Enhance the complexity of the model ● Add more features to the data ● Reduce regular parameters ● Increasing the training time of model 3. Overfitting of Training Data Overfitting refers to a machine learning model trained with a massive amount of data that negatively affect its performance. It is like trying to fit in Oversized jeans. Unfortunately, this is one of the significant issues faced by machine learning professionals. This means that the algorithm is trained with noisy and biased data, which will affect its overall performance. Let’s understand this with the help of an example. Let’s consider a model trained to differentiate between a cat, a rabbit, a dog, and a tiger. The training data contains 1000 cats, 1000 dogs, 1000 tigers, and 4000 Rabbits. Then there is a considerable probability that it will identify the cat as a rabbit. In this example, we had a vast amount of data, but it was biased; hence the prediction was negatively affected. We can tackle this issue by: ● Analyzing the data with the utmost level of perfection ● Use data augmentation technique ● Remove outliers in the training set ● Select a model with lesser features To know more, you can visit here. 4. Machine Learning is a Complex Process The machine learning industry is young and is continuously changing. Rapid hit and trial experiments are being carried on. The process is transforming, and hence there are high chances of error which makes the learning complex. It includes analyzing the data, removing data bias, training data, applying complex mathematical calculations, and a lot more. Hence it is a really complicated process which is another big challenge for Machine learning professionals. 5. Lack of Training Data The most important task you need to do in the machine learning process is to train the data to achieve an accurate output. Less amount training data will produce inaccurate or too biased predictions. Let us understand this with the help of an example. Consider a machine learning algorithm similar to training a child. One day you decided to explain to a child how to distinguish between an apple and a watermelon. You will take an apple and a watermelon and show him the difference between both based on their color, shape, and taste. In this way, soon, he will attain perfection in differentiating between the two. But on the other hand, a machine-learning algorithm needs a lot of data to distinguish. For complex problems, it may even require millions of data to be trained. Therefore we need to ensure that Machine learning algorithms are trained with sufficient amounts of data. 6. Slow Implementation This is one of the common issues faced by machine learning professionals. The machine learning models are highly efficient in providing accurate results, but it takes a tremendous amount of time. Slow programs, data overload, and excessive requirements usually take a lot of time to provide accurate results. Further, it requires constant monitoring and maintenance to deliver the best output. 7. Imperfections in the Algorithm When Data Grows So you have found quality data, trained it amazingly, and the predictions are really concise and accurate. Yay, you have learned how to create a machine learning algorithm!! But wait, there is a twist; the model may become useless in the future as data grows. The best model of the present may become inaccurate in the coming Future and require further rearrangement. So you need regular monitoring and maintenance to keep the algorithm working. This is one of the most exhausting issues faced by machine learning professionals. Conclusion: Machine learning is all set to bring a big bang transformation in technology. It is one of the most rapidly growing technologies used in medical diagnosis, speech recognition, robotic training, product recommendations, video surveillance, and this list goes on. This continuously evolving domain offers immense job satisfaction, excellent opportunities, global exposure, and exorbitant salary. It is a high risk and a high return technology. Before starting your machine learning journey, ensure that you carefully examine the challenges mentioned above. To learn this fantastic technology, you need to plan carefully, stay patient, and maximize your efforts. Once you win this battle, you can conquer the Future of work and land your dream job! 2. Artificial Intelligence is basically the mechanism to incorporate human intelligence into machines through a set of rules(algorithm). AI is a combination of two words: “Artificial” meaning something made by humans or non-natural things and “Intelligence” meaning the ability to understand or think accordingly. Another definition could be that “AI is basically the study of training your machine(computers) to mimic a human brain and its thinking capabilities”. AI focuses on 3 major aspects(skills): learning, reasoning, and self-correction to obtain the maximum efficiency possible. Machine Learning: Machine Learning is basically the study/process which provides the system(computer) to learn automatically on its own through experiences it had and improve accordingly without being explicitly programmed. ML is an application or subset of AI. ML focuses on the development of programs so that it can access data to use it for itself. The entire process makes observations on data to identify the possible patterns being formed and make better future decisions as per the examples provided to them. The major aim of ML is to allow the systems to learn by themselves through experience without any kind of human intervention or assistance. Deep Learning: Deep Learning is basically a sub-part of the broader family of Machine Learning which makes use of Neural Networks(similar to the neurons working in our brain) to mimic human brain-like behavior. DL algorithms focus on information processing patterns mechanism to possibly identify the patterns just like our human brain does and classifies the information accordingly. DL works on larger sets of data when compared to ML and the prediction mechanism is self-administered by machines. Below is a table of differences between Artificial Intelligence, Machine Learning and Deep Learning: 3. Types of Machine Learning Machine learning is a subset of AI, which enables the machine to automatically learn from data, improve performance from past experiences, and make predictions. Machine learning contains a set of algorithms that work on a huge amount of data. Data is fed to these algorithms to train them, and on the basis of training, they build the model & perform a specific task. These ML algorithms help to solve different business problems like Regression, Classification, Forecasting, Clustering, and Associations, etc. Based on the methods and way of learning, machine learning is divided into mainly four types, which are: 1. Supervised Machine Learning 2. Unsupervised Machine Learning 3. Semi-Supervised Machine Learning 4. Reinforcement Learning In this topic, we will provide a detailed description of the types of Machine Learning along with their respective algorithms: 1. Supervised Machine Learning As its name suggests, Supervised machine learning is based on supervision. It means in the supervised learning technique, we train the machines using the "labelled" dataset, and based on the training, the machine predicts the output. Here, the labelled data specifies that some of the inputs are already mapped to the output. More preciously, we can say; first, we train the machine with the input and corresponding output, and then we ask the machine to predict the output using the test dataset. Let's understand supervised learning with an example. Suppose we have an input dataset of cats and dog images. So, first, we will provide the training to the machine to understand the images, such as the shape & size of the tail of cat and dog, Shape of eyes, colour, height (dogs are taller, cats are smaller), etc. After completion of training, we input the picture of a cat and ask the machine to identify the object and predict the output. Now, the machine is well trained, so it will check all the features of the object, such as height, shape, colour, eyes, ears, tail, etc., and find that it's a cat. So, it will put it in the Cat category. This is the process of how the machine identifies the objects in Supervised Learning. The main goal of the supervised learning technique is to map the input variable(x) with the output variable(y). Some real-world applications of supervised learning are Risk Assessment, Fraud Detection, Spam filtering, etc. Categories of Supervised Machine Learning Supervised machine learning can be classified into two types of problems, which are given below: ○ Classification ○ Regression a) Classification Classification algorithms are used to solve the classification problems in which the output variable is categorical, such as "Yes" or No, Male or Female, Red or Blue, etc. The classification algorithms predict the categories present in the dataset. Some real-world examples of classification algorithms are Spam Detection, Email filtering, etc. Some popular classification algorithms are given below: ○ Random Forest Algorithm ○ Decision Tree Algorithm ○ Logistic Regression Algorithm ○ Support Vector Machine Algorithm b) Regression Regression algorithms are used to solve regression problems in which there is a linear relationship between input and output variables. These are used to predict continuous output variables, such as market trends, weather prediction, etc. Some popular Regression algorithms are given below: ○ Simple Linear Regression Algorithm ○ Multivariate Regression Algorithm ○ Decision Tree Algorithm ○ Lasso Regression Advantages and Disadvantages of Supervised Learning Advantages: ○ Since supervised learning work with the labelled dataset so we can have an exact idea about the classes of objects. ○ These algorithms are helpful in predicting the output on the basis of prior experience. Disadvantages: ○ These algorithms are not able to solve complex tasks. ○ It may predict the wrong output if the test data is different from the training data. ○ It requires lots of computational time to train the algorithm. Applications of Supervised Learning Some common applications of Supervised Learning are given below: ○ Image Segmentation: Supervised Learning algorithms are used in image segmentation. In this process, image classification is performed on different image data with pre-defined labels. ○ Medical Diagnosis: Supervised algorithms are also used in the medical field for diagnosis purposes. It is done by using medical images and past labelled data with labels for disease conditions. With such a process, the machine can identify a disease for the new patients. ○ Fraud Detection - Supervised Learning classification algorithms are used for identifying fraud transactions, fraud customers, etc. It is done by using historic data to identify the patterns that can lead to possible fraud. ○ Spam detection - In spam detection & filtering, classification algorithms are used. These algorithms classify an email as spam or not spam. The spam emails are sent to the spam folder. ○ Speech Recognition - Supervised learning algorithms are also used in speech recognition. The algorithm is trained with voice data, and various identifications can be done using the same, such as voice-activated passwords, voice commands, etc. 2. Unsupervised Machine Learning Unsupervised learning is different from the Supervised learning technique; as its name suggests, there is no need for supervision. It means, in unsupervised machine learning, the machine is trained using the unlabeled dataset, and the machine predicts the output without any supervision. In unsupervised learning, the models are trained with the data that is neither classified nor labelled, and the model acts on that data without any supervision. The main aim of the unsupervised learning algorithm is to group or categories the unsorted dataset according to the similarities, patterns, and differences. Machines are instructed to find the hidden patterns from the input dataset. Let's take an example to understand it more preciously; suppose there is a basket of fruit images, and we input it into the machine learning model. The images are totally unknown to the model, and the task of the machine is to find the patterns and categories of the objects. So, now the machine will discover its patterns and differences, such as colour difference, shape difference, and predict the output when it is tested with the test dataset. Categories of Unsupervised Machine Learning Unsupervised Learning can be further classified into two types, which are given below: ○ Clustering ○ Association 1) Clustering The clustering technique is used when we want to find the inherent groups from the data. It is a way to group the objects into a cluster such that the objects with the most similarities remain in one group and have fewer or no similarities with the objects of other groups. An example of the clustering algorithm is grouping the customers by their purchasing behaviour. Some of the popular clustering algorithms are given below: ○ K-Means Clustering algorithm ○ Mean-shift algorithm ○ DBSCAN Algorithm ○ Principal Component Analysis ○ Independent Component Analysis 2) Association Association rule learning is an unsupervised learning technique, which finds interesting relations among variables within a large dataset. The main aim of this learning algorithm is to find the dependency of one data item on another data item and map those variables accordingly so that it can generate maximum profit. This algorithm is mainly applied in Market Basket analysis, Web usage mining, continuous production, etc. Some popular algorithms of Association rule learning are Apriori Algorithm, Eclat, FP-growth algorithm. Advantages and Disadvantages of Unsupervised Learning Algorithm Advantages: ○ These algorithms can be used for complicated tasks compared to the supervised ones because these algorithms work on the unlabeled dataset. ○ Unsupervised algorithms are preferable for various tasks as getting the unlabeled dataset is easier as compared to the labelled dataset. Disadvantages: ○ The output of an unsupervised algorithm can be less accurate as the dataset is not labelled, and algorithms are not trained with the exact output in prior. ○ Working with Unsupervised learning is more difficult as it works with the unlabelled dataset that does not map with the output. Applications of Unsupervised Learning ○ Network Analysis: Unsupervised learning is used for identifying plagiarism and copyright in document network analysis of text data for scholarly articles. ○ Recommendation Systems: Recommendation systems widely use unsupervised learning techniques for building recommendation applications for different web applications and e-commerce websites. ○ Anomaly Detection: Anomaly detection is a popular application of unsupervised learning, which can identify unusual data points within the dataset. It is used to discover fraudulent transactions. ○ Singular Value Decomposition: Singular Value Decomposition or SVD is used to extract particular information from the database. For example, extracting information of each user located at a particular location. 3. Semi-Supervised Learning Semi-Supervised learning is a type of Machine Learning algorithm that lies between Supervised and Unsupervised machine learning. It represents the intermediate ground between Supervised (With Labelled training data) and Unsupervised learning (with no labelled training data) algorithms and uses the combination of labelled and unlabeled datasets during the training period. Although Semi-supervised learning is the middle ground between supervised and unsupervised learning and operates on the data that consists of a few labels, it mostly consists of unlabeled data. As labels are costly, but for corporate purposes, they may have few labels. It is completely different from supervised and unsupervised learning as they are based on the presence & absence of labels. To overcome the drawbacks of supervised learning and unsupervised learning algorithms, the concept of Semi-supervised learning is introduced. The main aim of semi-supervised learning is to effectively use all the available data, rather than only labelled data like in supervised learning. Initially, similar data is clustered along with an unsupervised learning algorithm, and further, it helps to label the unlabeled data into labelled data. It is because labelled data is a comparatively more expensive acquisition than unlabeled data. We can imagine these algorithms with an example. Supervised learning is where a student is under the supervision of an instructor at home and college. Further, if that student is self-analysing the same concept without any help from the instructor, it comes under unsupervised learning. Under semi-supervised learning, the student has to revise himself after analyzing the same concept under the guidance of an instructor at college. Advantages and disadvantages of Semi-supervised Learning Advantages: ○ It is simple and easy to understand the algorithm. ○ It is highly efficient. ○ It is used to solve drawbacks of Supervised and Unsupervised Learning algorithms. Disadvantages: ○ Iterations results may not be stable. ○ We cannot apply these algorithms to network-level data. ○ Accuracy is low. 4. Reinforcement Learning Reinforcement learning works on a feedback-based process, in which an AI agent (A software component) automatically explore its surrounding by hitting & trail, taking action, learning from experiences, and improving its performance. Agent gets rewarded for each good action and get punished for each bad action; hence the goal of reinforcement learning agent is to maximize the rewards. In reinforcement learning, there is no labelled data like supervised learning, and agents learn from their experiences only. The reinforcement learning process is similar to a human being; for example, a child learns various things by experiences in his day-to-day life. An example of reinforcement learning is to play a game, where the Game is the environment, moves of an agent at each step define states, and the goal of the agent is to get a high score. Agent receives feedback in terms of punishment and rewards. Due to its way of working, reinforcement learning is employed in different fields such as Game theory, Operation Research, Information theory, multi-agent systems. A reinforcement learning problem can be formalized using Markov Decision Process(MDP). In MDP, the agent constantly interacts with the environment and performs actions; at each action, the environment responds and generates a new state. Categories of Reinforcement Learning Reinforcement learning is categorized mainly into two types of methods/algorithms: ○ Positive Reinforcement Learning: Positive reinforcement learning specifies increasing the tendency that the required behaviour would occur again by adding something. It enhances the strength of the behaviour of the agent and positively impacts it. ○ Negative Reinforcement Learning: Negative reinforcement learning works exactly opposite to the positive RL. It increases the tendency that the specific behaviour would occur again by avoiding the negative condition. Real-world Use cases of Reinforcement Learning ○ Video Games: RL algorithms are much popular in gaming applications. It is used to gain super-human performance. Some popular games that use RL algorithms are AlphaGO and AlphaGO Zero. ○ Resource Management: The "Resource Management with Deep Reinforcement Learning" paper showed that how to use RL in computer to automatically learn and schedule resources to wait for different jobs in order to minimize average job slowdown. ○ Robotics: RL is widely being used in Robotics applications. Robots are used in the industrial and manufacturing area, and these robots are made more powerful with reinforcement learning. There are different industries that have their vision of building intelligent robots using AI and Machine learning technology. ○ Text Mining Text-mining, one of the great applications of NLP, is now being implemented with the help of Reinforcement Learning by Salesforce company. Advantages and Disadvantages of Reinforcement Learning Advantages ○ It helps in solving complex real-world problems which are difficult to be solved by general techniques. ○ The learning model of RL is similar to the learning of human beings; hence most accurate results can be found. ○ Helps in achieving long term results. Disadvantage ○ RL algorithms are not preferred for simple problems. ○ RL algorithms require huge data and computations. ○ Too much reinforcement learning can lead to an overload of states which can weaken the results. The curse of dimensionality limits reinforcement learning for real physical systems. 4. Introduction: When training a machine learning model, it is crucial to monitor its performance using various evaluation metrics. Among these metrics, training loss and testing loss play a fundamental role in assessing the model's learning progress and generalization capabilities. This note aims to shed light on the concepts of training loss and testing loss, their differences, and their significance in the model development process. Training Loss: During the training phase, a machine learning model is exposed to a labeled dataset to learn patterns and relationships between the input data and the desired output. Training loss, also known as the empirical or objective loss, measures the error or discrepancy between the predicted output of the model and the actual ground truth labels within the training data. The training loss quantifies how well the model is fitting the training data and how effectively it is minimizing the error during the optimization process. Testing Loss: Once the model is trained, it needs to be evaluated on unseen data to assess its performance on examples that were not part of the training set. Testing loss, also called validation loss or generalization loss, measures the model's performance on this unseen data. It calculates the error or discrepancy between the predicted outputs and the true labels from the testing dataset. The testing loss serves as an estimate of how well the model will perform on new, unseen data and reflects its ability to generalize and make accurate predictions beyond the training set. Key Differences: 1. Data Usage: Training loss is computed using the training dataset, which the model has already seen and learned from. Testing loss, on the other hand, is computed using a separate dataset that was not used during the training phase and represents real-world scenarios. 2. Purpose: Training loss is primarily used to guide the model's learning process and optimize its parameters by minimizing the error. Testing loss provides an evaluation of the model's performance and serves as an estimate of how it will perform in real-world scenarios. 3. Overfitting Detection: Monitoring the training loss is essential for detecting overfitting. Overfitting occurs when a model becomes excessively complex and starts to memorize the training data, resulting in a low training loss but a high testing loss. An increasing gap between training loss and testing loss indicates overfitting, suggesting that the model is not generalizing well to unseen data. Significance: 1. Model Optimization: Training loss serves as the primary feedback signal during the model's optimization process, guiding the learning algorithm to adjust the model's parameters to minimize the error on the training data. 2. Generalization Assessment: Testing loss provides insights into how well the model is performing on unseen data, indicating its ability to generalize beyond the training set. It helps evaluate and compare different models or model configurations. 3. Hyperparameter Tuning: Testing loss is often used to tune the hyperparameters of a model, such as learning rate or regularization strength, to improve its generalization performance. By monitoring the testing loss, one can find the optimal hyperparameter settings that result in the lowest testing error. Conclusion: Training loss and testing loss are crucial metrics for assessing the performance of machine learning models. While training loss indicates how well a model fits the training data, testing loss provides insights into its generalization capabilities. Monitoring both loss values allows for detecting overfitting, optimizing models, and making informed decisions during the development process. By striking a balance between minimizing training loss and achieving low testing loss, one can create models that learn effectively and perform well on new, unseen data. 5. Tradeoffs in Statistical Learning In statistical learning, various tradeoffs arise when developing and deploying machine learning models. These tradeoffs involve factors such as model performance, complexity, interpretability, and computational resources. Understanding these tradeoffs is essential for making informed decisions and achieving the desired balance. Here are some common tradeoffs in statistical learning: 1. Bias-variance tradeoff: The bias-variance tradeoff is about finding the right balance between capturing underlying patterns in the data and avoiding overfitting or underfitting. Models with high bias oversimplify the data, resulting in underfitting, while models with high variance are overly complex and fit noise, leading to overfitting. Striking the right balance between bias and variance is crucial for optimal predictive performance. 2. Model complexity vs. interpretability: Increasing model complexity often improves predictive accuracy by capturing intricate patterns. However, complex models, such as deep neural networks, may be difficult to interpret. Simpler models, like linear regression, are more interpretable but may have limited predictive power. The tradeoff between model complexity and interpretability depends on the application and the importance of interpretability in decision-making. 3. Training time vs. model performance: Some models, like deep neural networks, require substantial computational resources and longer training times to achieve high performance. Simpler models, such as decision trees or linear models, can be trained quickly but may have limited predictive capabilities. The tradeoff between training time and model performance depends on available computational resources, time constraints, and specific application requirements. 4. Underfitting vs. overfitting: Underfitting occurs when a model is too simple to capture underlying patterns, resulting in poor performance on both training and testing data. Overfitting happens when a model becomes overly complex and fits noise or random variations, leading to poor performance on unseen data. The tradeoff involves finding the right level of model complexity that minimizes both underfitting and overfitting. 5. Feature selection vs. feature dimensionality: Feature selection involves choosing relevant features for the model's predictive performance. Including irrelevant or redundant features increases model complexity, training time, and the risk of overfitting. However, removing informative features may result in a loss of valuable information. The tradeoff lies in selecting a subset of features that balances model performance and complexity. 6. Model robustness vs. computational efficiency: Complex models may be more robust to noise and data variations, but they require increased computational resources, memory, and time. Simpler models may be computationally efficient but more sensitive to noise. The tradeoff involves finding a balance between model robustness and computational efficiency based on available resources and desired performance level. Understanding and managing these tradeoffs are essential for developing effective and efficient statistical learning models. Consider the problem requirements, available data, computational resources, interpretability needs, and desired performance level to make informed decisions and strike an appropriate balance. 6. Estimating the Sampling Distribution of an Estimator The sampling distribution of an estimator provides insights into the behavior and variability of the estimator's values when repeatedly sampling from a population. It allows us to make inferences about the estimator's accuracy and precision. Here is a procedure for estimating the sampling distribution of an estimator: 1. Define the Estimator: Begin by clearly defining the estimator you want to study. This could be the sample mean, sample proportion, regression coefficient, or any other statistic used to estimate a population parameter. 2. Define the Population: Specify the characteristics of the population you are interested in. This includes identifying the population distribution, any assumed parameters, and the sampling method. 3. Simulate Sampling: Simulate the process of sampling from the population to generate multiple samples. The number of samples and the sample size may vary depending on the desired level of precision. 4. Calculate the Estimator: For each simulated sample, calculate the value of the estimator. This involves applying the estimator formula to the sample data. Record the estimator's value for each sample. 5. Repeat Steps 3 and 4: Repeat the process of simulating sampling and calculating the estimator multiple times. The more repetitions you perform, the more accurate the estimation of the sampling distribution will be. 6. Analyze the Results: Once you have generated multiple samples and obtained the corresponding estimator values, analyze the results to estimate the sampling distribution. Commonly used methods include calculating descriptive statistics such as the mean, standard deviation, and confidence intervals of the estimator values. 7. Visualize the Sampling Distribution: To gain a better understanding of the sampling distribution, plot a histogram or a density plot of the estimator values. This visual representation can help identify the shape, center, and spread of the sampling distribution. 8. Interpret the Results: Analyze the estimated sampling distribution to draw conclusions about the estimator's behavior. Look for characteristics such as bias, efficiency, consistency, or other relevant properties. Compare the estimated sampling distribution to theoretical expectations if applicable. 9. Validate and Refine: Assess the validity of the estimated sampling distribution by comparing it with theoretical properties, if available. If necessary, refine the simulation process by adjusting the sample size, the number of repetitions, or other parameters to improve the accuracy and reliability of the estimation. By following this procedure, you can estimate the sampling distribution of an estimator. This provides valuable insights into the behavior and variability of the estimator's values when sampling from a population. These estimates can guide decision-making, hypothesis testing, and other statistical inference tasks. 7. Statistics in Supervised Learning and Unsupervised Learning Supervised learning and unsupervised learning are two main categories of machine learning techniques. Both approaches utilize various statistical methods to analyze and make inferences from data. Here's an overview of the role of statistics in supervised and unsupervised learning: Supervised Learning: In supervised learning, the goal is to learn a mapping between input data and corresponding output labels based on labeled training examples. Statistics plays a crucial role in different aspects of supervised learning, including: 1. Descriptive Statistics: Descriptive statistics are used to summarize and understand the characteristics of the input features and output labels. This involves calculating measures such as mean, median, variance, and correlation coefficients to gain insights into the data distribution and relationships. 2. Inferential Statistics: Inferential statistics are employed to make inferences about the population based on the sample data. Techniques such as hypothesis testing and confidence intervals can help assess the significance of relationships between input features and output labels and determine if they are statistically meaningful. 3. Regression Analysis: Regression analysis is commonly used in supervised learning to model the relationship between input features and continuous output variables. Statistical methods like linear regression, polynomial regression, or more advanced techniques like ridge regression or lasso regression are employed to estimate the parameters of the regression models. 4. Classification Analysis: Classification analysis is used when the output variable is categorical. Statistical techniques such as logistic regression, decision trees, random forests, or support vector machines are employed to build classification models that can predict the class labels of new, unseen data based on the input features. Unsupervised Learning: In unsupervised learning, the goal is to explore and discover patterns or structures within the data without labeled examples. Statistics plays a vital role in several aspects of unsupervised learning, including: 1. Clustering Analysis: Clustering algorithms are employed to group similar data points together based on their feature similarities. Statistical methods like k-means clustering, hierarchical clustering, or Gaussian mixture models are utilized to identify clusters and estimate cluster centers and boundaries. 2. Dimensionality Reduction: Dimensionality reduction techniques are used to reduce the complexity of high-dimensional data by transforming it into a lower-dimensional representation. Statistical methods such as principal component analysis (PCA), factor analysis, or t-distributed stochastic neighbor embedding (t-SNE) are employed to capture the most informative features or project the data onto a lower-dimensional space. 3. Association Rule Mining: Association rule mining is used to discover interesting relationships or patterns in transactional or categorical data. Statistical techniques like Apriori algorithm or frequent itemset mining are applied to identify associations or dependencies among different items or variables. 4. Outlier Detection: Outlier detection aims to identify unusual or anomalous data points that deviate significantly from the norm. Statistical methods such as z-score, Mahalanobis distance, or robust statistical measures like median absolute deviation (MAD) are utilized to detect outliers based on their deviations from the expected statistical patterns. In both supervised and unsupervised learning, statistical techniques and principles provide the foundation for data analysis, model building, and interpretation. They enable researchers and practitioners to extract meaningful insights, assess the significance of relationships, and make informed decisions based on the data at hand. Unit -2:1. Explain KNN with an example? 2. Explain Naïve Bayes classification with example? 3. Explain Linear and logistic regression with examples? 4. Explain Binary classification with example in machine Learning? 5. What is decision tree? Explain the procedure to construct decision tree? 6. Write a short note on various Distance based methods of Classification/regression? 1. K-Nearest Neighbor(KNN) Machine Learning Algorithm for ○ K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on Supervised Learning technique. ○ K-NN algorithm assumes the similarity between the new case/data and available cases and put the new case into the category that is most similar to the available categories. ○ K-NN algorithm stores all the available data and classifies a new data point based on the similarity. This means when new data appears then it can be easily classified into a well suite category by using K- NN algorithm. ○ K-NN algorithm can be used for Regression as well as for Classification but mostly it is used for the Classification problems. ○ K-NN is a non-parametric algorithm, which means it does not make any assumption on underlying data. ○ It is also called a lazy learner algorithm because it does not learn from the training set immediately instead it stores the dataset and at the time of classification, it performs an action on the dataset. ○ KNN algorithm at the training phase just stores the dataset and when it gets new data, then it classifies that data into a category that is much similar to the new data. ○ Example: Suppose, we have an image of a creature that looks similar to cat and dog, but we want to know either it is a cat or dog. So for this identification, we can use the KNN algorithm, as it works on a similarity measure. Our KNN model will find the similar features of the new data set to the cats and dogs images and based on the most similar features it will put it in either cat or dog category. Why do we need a K-NN Algorithm? Suppose there are two categories, i.e., Category A and Category B, and we have a new data point x1, so this data point will lie in which of these categories. To solve this type of problem, we need a K-NN algorithm. With the help of K-NN, we can easily identify the category or class of a particular dataset. Consider the below diagram: How does K-NN work? The K-NN working can be explained on the basis of the below algorithm: ○ Step-1: Select the number K of the neighbors ○ Step-2: Calculate the Euclidean distance of K number of neighbors ○ Step-3: Take the K nearest neighbors as per the calculated Euclidean distance. ○ Step-4: Among these k neighbors, count the number of the data points in each category. ○ Step-5: Assign the new data points to that category for which the number of the neighbor is maximum. ○ Step-6: Our model is ready. Suppose we have a new data point and we need to put it in the required category. Consider the below image: ○ Firstly, we will choose the number of neighbors, so we will choose the k=5. ○ Next, we will calculate the Euclidean distance between the data points. The Euclidean distance is the distance between two points, which we have already studied in geometry. It can be calculated as: ○ By calculating the Euclidean distance we got the nearest neighbors, as three nearest neighbors in category A and two nearest neighbors in category B. Consider the below image: ○ As we can see the 3 nearest neighbors are from category A, hence this new data point must belong to category A Advantages of KNN Algorithm: ○ It is simple to implement. ○ It is robust to the noisy training data ○ It can be more effective if the training data is large. Disadvantages of KNN Algorithm: ○ Always needs to determine the value of K which may be complex some time. ○ The computation cost is high because of calculating the distance between the data points for all the training samples. 2. Naïve Bayes classification is a popular and simple probabilistic machine learning algorithm used for classification tasks. It is based on Bayes' theorem and assumes that the features are conditionally independent given the class. Despite its simplicity, Naïve Bayes can be highly effective, especially in text classification and spam filtering. Here's an explanation of Naïve Bayes classification with an example: Bayes' Theorem: ○ Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the probability of a hypothesis with prior knowledge. It depends on the conditional probability. ○ The formula for Bayes' theorem is given as: Where, P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B. P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a hypothesis is true. P(A) is Prior Probability: Probability of hypothesis before observing the evidence. P(B) is Marginal Probability: Probability of Evidence. Example: Let's consider a simple example of classifying emails as either "spam" or "not spam" based on certain words that appear in the email. 1. Training Phase: In the training phase, we collect a labeled dataset consisting of emails and their corresponding class labels (spam or not spam). We also preprocess the data by tokenizing the emails into words and removing any irrelevant information. Suppose we have the following training data: Email 1: "Get a free gift!" Class: Spam Email 2: "Meeting at 3 pm." Class: Not spam Email 3: "Claim your prize now!" Class: Spam Email 4: "Reminder: Project deadline tomorrow." Class: Not spam 2. Calculating Class Priors: First, we calculate the prior probabilities of each class (spam and not spam). The prior probability of a class is the probability of an email being in that class without considering any features. Let's assume that in our training data, we have 2 spam emails and 2 not spam emails. Therefore, the class priors are as follows: P(Spam) = 2/4 = 0.5 P(Not Spam) = 2/4 = 0.5 3. Building Feature Models: Next, we build feature models by estimating the likelihood probabilities of each feature (word) given the class. In Naïve Bayes, we assume that the features (words) are conditionally independent given the class. This is known as the "naïve" assumption. For simplicity, let's assume that our feature set consists of only three words: "free," "meeting," and "claim." To calculate the likelihood probabilities, we count the number of occurrences of each word in each class and divide it by the total number of words in that class. P(free|Spam) = 2/6 = 1/3 P(free|Not Spam) = 0/6 = 0 P(meeting|Spam) = 0/6 = 0 P(meeting|Not Spam) = 1/7 ≈ 0.143 P(claim|Spam) = 1/6 ≈ 0.167 P(claim|Not Spam) = 0/7 = 0 4. Classifying New Emails: Now, let's suppose we have a new email: "Claim your free gift now!" and we want to classify it as spam or not spam using the Naïve Bayes classifier. To classify the new email, we calculate the posterior probability for each class given the features (words) in the email. The posterior probability is obtained by multiplying the class prior and the likelihood probabilities of each feature. For the new email: P(Spam|new email) ∝ P(Spam) * P(claim|Spam) * P(free|Spam) = 0.5 * 0.167 * 1/3 ≈ 0.0278 P(Not Spam|new email) ∝ P(Not Spam) * P(claim|Not Spam) * P(free|Not Spam) = 0.5 * 0*0≈0 Since the posterior probability for spam is higher than that for not spam 3. Linear Regression vs Logistic Regression Linear Regression and Logistic Regression are the two famous Machine Learning Algorithms which come under supervised learning technique. Since both the algorithms are of supervised in nature hence these algorithms use labeled dataset to make the predictions. But the main difference between them is how they are being used. The Linear Regression is used for solving Regression problems whereas Logistic Regression is used for solving the Classification problems. The description of both the algorithms is given below along with difference table. Linear Regression: ○ Linear Regression is one of the most simple Machine learning algorithm that comes under Supervised Learning technique and used for solving regression problems. ○ It is used for predicting the continuous dependent variable with the help of independent variables. ○ The goal of the Linear regression is to find the best fit line that can accurately predict the output for the continuous dependent variable. ○ If single independent variable is used for prediction then it is called Simple Linear Regression and if there are more than two independent variables then such regression is called as Multiple Linear Regression. ○ By finding the best fit line, algorithm establish the relationship between dependent variable and independent variable. And the relationship should be of linear nature. ○ The output for Linear regression should only be the continuous values such as price, age, salary, etc. The relationship between the dependent variable and independent variable can be shown in below image: In above image the dependent variable is on Y-axis (salary) and independent variable is on x-axis(experience). The regression line can be written as: y= a0+a1x+ ε Where, a0 and a1 are the coefficients and ε is the error term. Logistic Regression: ○ Logistic regression is one of the most popular Machine learning algorithm that comes under Supervised Learning techniques. ○ It can be used for Classification as well as for Regression problems, but mainly used for Classification problems. ○ Logistic regression is used to predict the categorical dependent variable with the help of independent variables. ○ The output of Logistic Regression problem can be only between the 0 and 1. ○ Logistic regression can be used where the probabilities between two classes is required. Such as whether it will rain today or not, either 0 or 1, true or false etc. ○ Logistic regression is based on the concept of Maximum Likelihood estimation. According to this estimation, the observed data should be most probable. ○ In logistic regression, we pass the weighted sum of inputs through an activation function that can map values in between 0 and 1. Such activation function is known as sigmoid function and the curve obtained is called as sigmoid curve or S-curve. Consider the below image: ○ The equation for logistic regression is: Difference between Linear Regression and Logistic Regression: Linear Regression Logistic Regression Linear regression is used to predict the Logistic Regression is used to predict the continuous dependent variable using a categorical dependent variable using a given set of independent variables. given set of independent variables. Linear Regression is used for solving Logistic regression is used for solving Regression problem. Classification problems. In Linear regression, we predict the value In logistic Regression, we predict the of continuous variables. values of categorical variables. In linear regression, we find the best fit In Logistic Regression, we find the line, by which we can easily predict the S-curve by which we can classify the output. samples. Least square estimation method is used Maximum likelihood estimation method for estimation of accuracy. is used for estimation of accuracy. The output for Linear Regression must be The output of Logistic Regression must a continuous value, such as price, age, be a Categorical value such as 0 or 1, Yes etc. or No, etc. In Linear regression, it is required that In Logistic regression, it is not required to relationship between dependent variable have the linear relationship between the and independent variable must be linear. dependent and independent variable. In linear regression, there may be In logistic regression, there should not be collinearity between the independent collinearity between the independent variables. variable. 4. Binary classification is a machine learning task that involves classifying data instances into one of two possible classes. The goal is to build a model that can learn from labeled training data to accurately predict the class of unseen instances. Here's an explanation of binary classification with an example: Example: Let's consider a binary classification problem of predicting whether a bank loan applicant is "approved" or "rejected" based on certain attributes such as income, credit score, and loan amount. 1. Dataset: We have a labeled dataset that contains information about previous loan applicants along with their loan approval status. Each instance in the dataset consists of input features (income, credit score, loan amount) and the corresponding class label (approved or rejected). 2. Training Phase: In the training phase, we use the labeled dataset to train a binary classification model. The model learns to identify patterns and relationships between the input features and the corresponding class labels. For example, the model may learn that applicants with a higher income and a good credit score are more likely to be approved for a loan, while those with a low income and a poor credit score are more likely to be rejected. 3. Model Building: Based on the training data, we select an appropriate algorithm for binary classification, such as logistic regression, support vector machines (SVM), or decision trees. For instance, if we choose logistic regression, the algorithm will estimate the coefficients for the input features to create a decision boundary that separates the approved and rejected classes. This boundary will be based on the probabilities of class membership. 4. Feature Engineering: Before training the model, we may perform feature engineering to preprocess and transform the input features. This can include steps like normalization, handling missing values, or encoding categorical variables. For example, we may normalize the income and loan amount values to a standardized scale and encode categorical variables like employment type or education level using one-hot encoding. 5. Model Training: During the training phase, the binary classification model adjusts its parameters using an optimization algorithm to minimize the prediction error. The model learns to distinguish between the two classes based on the provided training data. The training process iteratively updates the model's parameters until it reaches a point where it minimizes the difference between predicted and actual class labels for the training instances. 6. Evaluation and Prediction: After training, we evaluate the performance of the binary classification model using a separate set of labeled test data. The model predicts the class labels for the test instances based on their input features, and we compare the predicted labels with the true labels to assess accuracy, precision, recall, and other performance metrics. For example, we can evaluate how accurately the model predicts loan approval status for the test instances by calculating metrics like accuracy (the proportion of correctly predicted instances), precision (the proportion of true positives among predicted positives), recall (the proportion of true positives identified), and F1 score (a combined metric of precision and recall). 7. Model Deployment: Once the binary classification model has been trained and evaluated, it can be deployed to make predictions on new, unseen instances. It can classify new loan applicants as either approved or rejected based on their input features. This allows banks or financial institutions to automate the loan approval process, enabling faster and more efficient decision-making. Binary classification is a fundamental task in machine learning with numerous applications across various domains, including finance, healthcare, marketing, and more. 5. A decision tree is a popular supervised machine learning algorithm that is used for both classification and regression tasks. It creates a tree-like model of decisions and their possible consequences. Each internal node in the tree represents a feature or attribute, each branch represents a decision rule, and each leaf node represents an outcome or class label. Decision trees are easy to interpret and visualize, making them widely used in various domains. Here's an explanation of the procedure to construct a decision tree: 1. Dataset: Start with a labeled dataset that contains instances with input features and their corresponding class labels. Each instance should have a set of features and a known class label. For example, consider a dataset of patients with attributes like age, gender, symptoms, and a binary class label indicating whether they have a certain disease or not. 2. Select a Root Node: Choose an attribute from the dataset that will act as the root node of the decision tree. The attribute is selected based on various criteria, such as information gain, Gini impurity, or gain ratio. These criteria evaluate the effectiveness of an attribute in splitting the data and creating distinct classes. 3. Splitting the Data: Divide the dataset into subsets based on the values of the selected attribute (root node). Each subset represents a different branch from the root node. For example, if the root node is the "age" attribute, the data might be split into subsets for different age ranges. 4. Attribute Selection: Repeat the attribute selection process for each subset or branch. Choose the next attribute that best splits the data and creates pure or homogeneous subsets. This process is typically based on the same criteria used for selecting the root node. 5. Recursive Construction: Continue recursively splitting the data based on attribute selection until a stopping criterion is met. Stopping criteria can include reaching a maximum depth for the tree, having a minimum number of instances per leaf, or achieving a pure subset (where all instances belong to the same class). 6. Handling Missing Values: Decide how to handle instances with missing attribute values. This can be done by either ignoring those instances, replacing the missing values with the most common value of the attribute, or using more advanced imputation techniques. 7. Pruning (Optional): Pruning is an optional step to prevent overfitting and improve the generalization ability of the decision tree. Pruning involves removing branches or nodes from the tree that do not significantly contribute to its accuracy. This helps simplify the tree and reduces the risk of overfitting the training data. 8. Assigning Class Labels: Assign class labels to the leaf nodes of the decision tree based on the majority class in each leaf or by using probability-based rules. 9. Visualization and Interpretation: Visualize the constructed decision tree to gain insights into the decision-making process. The tree structure can be displayed graphically, showing the attributes, decision rules, and class labels at each node. This visualization aids in understanding the decision logic and provides a clear interpretation of the model's behavior. 10. Prediction and Evaluation: Finally, use the constructed decision tree to make predictions on new, unseen instances. Traverse the tree by following the decision rules based on the input features of the instance until reaching a leaf node, which provides the predicted class label. Evaluate the performance of the decision tree model using appropriate evaluation metrics such as accuracy, precision, recall, or F1 score. The procedure outlined above outlines the basic steps involved in constructing a decision tree. Various algorithms, such as ID3, C4.5, and CART, implement these steps with slight variations. Each algorithm may have different attribute selection measures, splitting rules, or pruning techniques, but the general idea remains the same: recursively split the data based on attribute selection until a stopping criterion is met, and assign class labels to the leaf nodes. 6. Distance-based methods are a class of algorithms used for classification and regression tasks. These methods rely on the concept of distance to make predictions. Here's a concise explanation of various distance-based methods that you can use in your semester exam: 1. k-Nearest Neighbors (k-NN): k-Nearest Neighbors is a simple yet effective algorithm. It calculates the distance between a new data instance and the instances in the training set. The k nearest neighbors, determined by the smallest distances, are considered. Their class labels or target values are then used to predict the class or estimate the target variable of the new instance. The value of k determines the number of neighbors to consider. 2. Euclidean Distance: Euclidean distance is a commonly used metric in distance-based methods. It measures the straight-line distance between two data points in a multidimensional feature space. The Euclidean distance between two points (x1, y1) and (x2, y2) in a two-dimensional space is calculated as the square root of the sum of the squared differences along each dimension: sqrt((x2 - x1)^2 + (y2 - y1)^2). It can be extended to higher dimensions. 3. Manhattan Distance: Manhattan distance, also known as city block distance or L1 distance, calculates the distance between two points by summing the absolute differences between their corresponding coordinates. It is called Manhattan distance because it measures the distance as if traveling along the city blocks in a grid-like pattern. The Manhattan distance between two points (x1, y1) and (x2, y2) is calculated as the sum of the absolute differences: |x2 - x1| + |y2 - y1|. It can also be extended to higher dimensions. 4. Minkowski Distance: Minkowski distance is a generalized form of distance that includes both Euclidean and Manhattan distances as special cases. It is defined as the p-th root of the sum of the p-th powers of the absolute differences between the coordinates of two points. The Minkowski distance between two points (x1, y1) and (x2, y2) is calculated as the p-th root of the sum of the p-th powers of the absolute differences: (|x2 - x1|^p + |y2 y1|^p)^(1/p). The value of p determines the shape of the distance measure. 5. Mahalanobis Distance: Mahalanobis distance takes into account the correlation and covariance structure of the data. It measures the distance between a point and a distribution or a group of points. The Mahalanobis distance between a point and a distribution is calculated as the square root of the sum of the squared differences between the point's coordinates and the mean of the distribution, weighted by the inverse covariance matrix. It is useful when dealing with correlated features. These are some important distance-based methods used in classification and regression. They provide a flexible framework for analyzing data based on the concept of proximity. Understanding these methods will help you tackle questions related to distance-based algorithms in your semester exam. Unit -3:1. Explain about AdaBoost ensemble and Gradient Boosting ensemble? 2. Differences between bagging and Boosting? 3. What is the difference between hard and soft voting classifiers? 4. Differences between decision tree and random forest? 5. Explain the Stacking? 6. Write a note on SVM regression? 7. Write about Naive Bayes Classifiers Vs SVM in text classification. 1. AdaBoost (Adaptive Boosting) AdaBoost is a boosting ensemble model and works especially well with the decision tree. Boosting model’s key is learning from the previous mistakes, e.g. misclassification data points. AdaBoost learns from the mistakes by increasing the weight of misclassified data points. Let’s illustrate how AdaBoost adapts. Step 0: Initialize the weights of data points. if the training set has 100 data points, then each point’s initial weight should be 1/100 = 0.01. Step 1: Train a decision tree Step 2: Calculate the weighted error rate (e) of the decision tree. The weighted error rate (e) is just how many wrong predictions out of total and you treat the wrong predictions differently based on its data point’s weight. The higher the weight, the more the corresponding error will be weighted during the calculation of the (e). Step 3: Calculate this decision tree’s weight in the ensemble the weight of this tree = learning rate * log( (1 — e) / e) ● the higher weighted error rate of a tree, 😫, the less decision power the tree will be given during the later voting ● the lower weighted error rate of a tree, 😃, the higher decision power the tree will be given during the later voting Step 4: Update weights of wrongly classified points the weight of each data point = ● if the model got this data point correct, the weight stays the same ● if the model got this data point wrong, the new weight of this point = old weight * np.exp(weight of this tree) Note: The higher the weight of the tree (more accurate this tree performs), the more boost (importance) the misclassified data point by this tree will get. The weights of the data points are normalized after all the misclassified points are updated. Step 5: Repeat Step 1(until the number of trees we set to train is reached) Step 6: Make the final prediction The AdaBoost makes a new prediction by adding up the weight (of each tree) multiply the prediction (of each tree). Obviously, the tree with higher weight will have more power of influence the final decision. Gradient Boosting Gradient boosting is another boosting model. Remember, boosting model’s key is learning from the previous mistakes. Gradient Boosting learns from the mistake — residual error directly, rather than update the weights of data points. Let’s illustrate how Gradient Boost learns. Step 1: Train a decision tree Step 2: Apply the decision tree just trained to predict Step 3: Calculate the residual of this decision tree, Save residual errors as the new y Step 4: Repeat Step 1 (until the number of trees we set to train is reached) Step 5: Make the final prediction The Gradient Boosting makes a new prediction by simply adding up the predictions (of all trees). 2. Bagging vs Boosting in Machine Learning As we know, Ensemble learning helps improve machine learning results by combining several models. This approach allows the production of better predictive performance compared to a single model. Basic idea is to learn a set of classifiers (experts) and to allow them to vote. Bagging and Boosting are two types of Ensemble Learning. These two decrease the variance of a single estimate as they combine several estimates from different models. So the result may be a model with higher stability. Let’s understand these two terms in a glimpse. 1. Bagging: It is a homogeneous weak learners’ model that learns from each other independently in parallel and combines them for determining the model average. 2. Boosting: It is also a homogeneous weak learners’ model but works differently from Bagging. In this model, learners learn sequentially and adaptively to improve model predictions of a learning algorithm. Let’s look at both of them in detail and understand the Difference between Bagging and Boosting. Bagging Bootstrap Aggregating, also known as bagging, is a machine learning ensemble meta-algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression. It decreases the variance and helps to avoid overfitting. It is usually applied to decision tree methods. Bagging is a special case of the model averaging approach. Description of the Technique Suppose a set D of d tuples, at each iteration i, a training set Di of d tuples is selected via row sampling with a replacement method (i.e., there can be repetitive elements from different d tuples) from D (i.e., bootstrap). Then a classifier model Mi is learned for each training set D < i. Each classifier Mi returns its class prediction. The bagged classifier M* counts the votes and assigns the class with the most votes to X (unknown sample). Implementation Steps of Bagging ● Step 1: Multiple subsets are created from the original data set with equal tuples, selecting observations with replacement. ● Step 2: A base model is created on each of these subsets. ● Step 3: Each model is learned in parallel with each training set and independent of each other. ● Step 4: The final predictions are determined by combining the predictions from all the models. An illustration for the concept of bootstrap aggregating (Bagging) Example of Bagging The Random Forest model uses Bagging, where decision tree models with higher variance are present. It makes random feature selection to grow trees. Several random trees make a Random Forest. To read more refer to this article: Bagging classifier Boosting Boosting is an ensemble modeling technique that attempts to build a strong classifier from the number of weak classifiers. It is done by building a model by using weak models in series. Firstly, a model is built from the training data. Then the second model is built which tries to correct the errors present in the first model. This procedure is continued and models are added until either the complete training data set is predicted correctly or the maximum number of models is added. Boosting Algorithms There are several boosting algorithms. The original ones, proposed by Robert Schapire and Yoav Freund were not adaptive and could not take full advantage of the weak learners. Schapire and Freund then developed AdaBoost, an adaptive boosting algorithm that won the prestigious Gödel Prize. AdaBoost was the first really successful boosting algorithm developed for the purpose of binary classification. AdaBoost is short for Adaptive Boosting and is a very popular boosting technique that combines multiple “weak classifiers” into a single “strong classifier”. Algorithm: 1. Initialise the dataset and assign equal weight to each of the data point. 2. Provide this as input to the model and identify the wrongly classified data points. 3. Increase the weight of the wrongly classified data points and decrease the weights of correctly classified data points. And then normalize the weights of all data points. 4. if (got required results) Goto step 5 else Goto step 2 5. End An illustration presenting the intuition behind the boosting algorithm, consisting of the parallel learners and weighted dataset. To read more refer to this article: Boosting and AdaBoost in ML Similarities Between Bagging and Boosting Bagging and Boosting, both being the commonly used methods, have a universal similarity of being classified as ensemble methods. Here we will explain the similarities between them. 1. Both are ensemble methods to get N learners from 1 learner. 2. Both generate several training data sets by random sampling. 3. Both make the final decision by averaging the N learners (or taking the majority of them i.e Majority Voting). 4. Both are good at reducing variance and provide higher stability. Differences Between Bagging and Boosting S.NO Bagging The simplest way of combining predictions that 1. belong to the same type. 2. Aim to decrease variance, not bias. Boosting A way of combining predictions that belong to the different types. Aim to decrease bias, not variance. 3. Each model receives equal weight. Models are weighted according to their performance. New models are influenced 4. Each model is built independently. by the performance of previously built models. Different training data subsets are selected 5. using row sampling with replacement and random sampling methods from the entire training dataset. 6. 7. 8. 9 Bagging tries to solve the over-fitting problem. If the classifier is unstable (high variance), then apply bagging. In this base classifiers are trained parallelly. Every new subset contains the elements that were misclassified by previous models. Boosting tries to reduce bias. If the classifier is stable and simple (high bias) the apply boosting. In this base classifiers are trained sequentially. Example: The Random forest model uses Example: The AdaBoost uses Bagging. Boosting techniques 3. Hard and soft voting classifiers are two approaches used in ensemble learning, where multiple individual classifiers are combined to make predictions. The main difference between hard and soft voting classifiers lies in how they aggregate the predictions of the individual classifiers. Here's a breakdown of each approach: 1. Hard Voting Classifier: In a hard voting classifier, the final prediction is made by taking a majority vote among the predictions of the individual classifiers. Each classifier in the ensemble contributes one vote, and the class label that receives the most votes is selected as the final prediction. For example, consider an ensemble of three classifiers: Classifier A predicts class 1, Classifier B predicts class 2, and Classifier C predicts class 1. In hard voting, the majority class is determined by counting the votes, and the final prediction would be class 1 since it received two out of three votes. Hard voting classifiers are suitable when the individual classifiers are equally weighted, and the focus is on the majority opinion of the ensemble. It can be effective in situations where the individual classifiers have complementary strengths and weaknesses, leading to a more robust and accurate prediction. 2. Soft Voting Classifier: In a soft voting classifier, the final prediction is made based on the average or weighted average of the predicted probabilities for each class from the individual classifiers. Rather than considering just the class labels, soft voting takes into account the confidence or probability estimates associated with each class. For example, suppose we have an ensemble of three classifiers that provide predicted probabilities for class 1 and class 2. Classifier A predicts (0.8, 0.2), Classifier B predicts (0.6, 0.4), and Classifier C predicts (0.7, 0.3). In soft voting, the probabilities are averaged for each class, resulting in (0.7, 0.3). The class with the highest average probability, in this case, is class 1, so it would be selected as the final prediction. Soft voting classifiers consider more nuanced information from the individual classifiers, allowing them to capture confidence levels and subtle differences in probabilities. This approach can be beneficial when the individual classifiers provide probability estimates, and the goal is to make more informed and calibrated predictions. In summary, the main difference between hard and soft voting classifiers is that hard voting relies on majority voting based on class labels, while soft voting considers the predicted probabilities of the individual classifiers to make a more nuanced decision. The choice between these approaches depends on the specific problem and the characteristics of the individual classifiers in the ensemble. 4. Sr. No 1. Random Forest Decision Tree While building a random forest Whereas, it built several decision the number of rows are selected trees and find out the output. randomly. 2. It combines two or more Whereas the decision is a decision trees together. collection of variables or data set or attributes. 3. It gives accurate results. Whereas it gives less accurate results. 4. By using multiple trees it reduces On the other hand, decision trees, the chances of overfitting. it has the possibility of overfitting, which is an error that occurs due to variance or due to bias. 5. Random forest is more Whereas, the decision tree is complicated to interpret. simple so it is easy to read and understand. 6. In a random forest, we need to The decision tree is not accurate generate, process, and analyze but it processes fast which means trees so that this process is slow, it is fast to implement. it may take one hour or even days. 7. It has more computation because it has n number of decision trees, so more decision trees more computation. Whereas it has less computation. 8. It has complex visualization, but On the other hand, it is simple to it plays an important role to visualize because we just need to show hidden patterns behind the fit the decision tree model. data. 9. 10. The classification and regression Whereas a decision tree is used to problems can be solved by using solve the classification and random forest. regression problems. It uses the random subspace Whereas a decision is made based method and bagging during tree on the selected sample’s feature, construction, which has built-in this is usually a feature that is feature importance. used to make a decision, decision tree learning is a process to find the optimal value for each internal tree node. 5. Stacking in Machine Learning There are many ways to ensemble models in machine learning, such as Bagging, Boosting, and stacking. Stacking is one of the most popular ensemble machine learning techniques used to predict multiple nodes to build a new model and improve model performance. Stacking enables us to train multiple models to solve similar problems, and based on their combined output, it builds a new model with improved performance. In this topic, "Stacking in Machine Learning", we will discuss a few important concepts related to stacking, the general architecture of stacking, important key points to implement stacking, and how stacking differs from bagging and boosting in machine learning. Before starting this topic, first, understand the concepts of the ensemble in machine learning. So, let's start with the definition of ensemble learning in machine learning. What is Ensemble learning in Machine Learning? Ensemble learning is one of the most powerful machine learning techniques that use the combined output of two or more models/weak learners and solve a particular computational intelligence problem. E.g., a Random Forest algorithm is an ensemble of various decision trees combined. Ensemble learning is primarily used to improve the model performance, such as classification, prediction, function approximation, etc. In simple words, we can summarise the ensemble learning as follows: "An ensembled model is a machine learning model that combines the predictions from two or more models.” There are 3 most common ensemble learning methods in machine learning. These are as follows: ○ Bagging ○ Boosting ○ Stacking However, we will mainly discuss Stacking on this topic. 1. Bagging Bagging is a method of ensemble modeling, which is primarily used to solve supervised machine learning problems. It is generally completed in two steps as follows: ○ Bootstrapping: It is a random sampling method that is used to derive samples from the data using the replacement procedure. In this method, first, random data samples are fed to the primary model, and then a base learning algorithm is run on the samples to complete the learning process. ○ Aggregation: This is a step that involves the process of combining the output of all base models and, based on their output, predicting an aggregate result with greater accuracy and reduced variance. Example: In the Random Forest method, predictions from multiple decision trees are ensembled parallelly. Further, in regression problems, we use an average of these predictions to get the final output, whereas, in classification problems, the model is selected as the predicted class. 2. Boosting Boosting is an ensemble method that enables each member to learn from the preceding member's mistakes and make better predictions for the future. Unlike the bagging method, in boosting, all base learners (weak) are arranged in a sequential format so that they can learn from the mistakes of their preceding learner. Hence, in this way, all weak learners get turned into strong learners and make a better predictive model with significantly improved performance. We have a basic understanding of ensemble techniques in machine learning and their two common methods, i.e., bagging and boosting. Now, let's discuss a different paradigm of ensemble learning, i.e., Stacking. 3. Stacking Stacking is one of the popular ensemble modeling techniques in machine learning. Various weak learners are ensembled in a parallel manner in such a way that by combining them with Meta learners, we can predict better predictions for the future. This ensemble technique works by applying input of combined multiple weak learners' predictions and Meta learners so that a better output prediction model can be achieved. In stacking, an algorithm takes the outputs of sub-models as input and attempts to learn how to best combine the input predictions to make a better output prediction. Stacking is also known as a stacked generalization and is an extended form of the Model Averaging Ensemble technique in which all sub-models equally participate as per their performance weights and build a new model with better predictions. This new model is stacked up on top of the others; this is the reason why it is named stacking. Architecture of Stacking The architecture of the stacking model is designed in such as way that it consists of two or more base/learner's models and a meta-model that combines the predictions of the base models. These base models are called level 0 models, and the meta-model is known as the level 1 model. So, the Stacking ensemble method includes original (training) data, primary level models, primary level prediction, secondary level model, and final prediction. The basic architecture of stacking can be represented as shown below the image. ○ Original data: This data is divided into n-folds and is also considered test data or training data. ○ Base models: These models are also referred to as level-0 models. These models use training data and provide compiled predictions (level-0) as an output. ○ Level-0 Predictions: Each base model is triggered on some training data and provides different predictions, which are known as level-0 predictions. ○ Meta Model: The architecture of the stacking model consists of one meta-model, which helps to best combine the predictions of the base models. The meta-model is also known as the level-1 model. ○ Level-1 Prediction: The meta-model learns how to best combine the predictions of the base models and is trained on different predictions made by individual base models, i.e., data not used to train the base models are fed to the meta-model, predictions are made, and these predictions, along with the expected outputs, provide the input and output pairs of the training dataset used to fit the meta-model. Steps to implement Stacking models: There are some important steps to implementing stacking models in machine learning. These are as follows: ○ Split training data sets into n-folds using the RepeatedStratifiedKFold as this is the most common approach to preparing training datasets for meta-models. ○ Now the base model is fitted with the first fold, which is n-1, and it will make predictions for the nth folds. ○ The prediction made in the above step is added to the x1_train list. ○ Repeat steps 2 & 3 for remaining n-1folds, so it will give x1_train array of size n, ○ Now, the model is trained on all the n parts, which will make predictions for the sample data. ○ Add this prediction to the y1_test list. ○ In the same way, we can find x2_train, y2_test, x3_train, and y3_test by using Model 2 and 3 for training, respectively, to get Level 2 predictions. ○ Now train the Meta model on level 1 prediction, where these predictions will be used as features for the model. ○ Finally, Meta learners can now be used to make a prediction on test data in the stacking model. Summary of Stacking Ensemble Stacking is an ensemble method that enables the model to learn how to use combine predictions given by learner models with meta-models and prepare a final model with accurate prediction. The main benefit of stacking ensemble is that it can shield the capabilities of a range of well-performing models to solve classification and regression problems. Further, it helps to prepare a better model having better predictions than all individual models. In this topic, we have learned various ensemble techniques and their definitions, the stacking ensemble method, the architecture of stacking models, and steps to implement stacking models in machine learning. 6. 7. Naive Bayes classifiers and Support Vector Machines (SVM) are two popular machine learning algorithms commonly used for text classification tasks. While both approaches can be effective, they have distinct characteristics and operate based on different principles. Here's a comparison of Naive Bayes classifiers and SVM in the context of text classification: Naive Bayes Classifiers: - Naive Bayes classifiers are based on the probabilistic principle of Bayes' theorem. - They assume that the features (words or tokens) in the input text are conditionally independent, given the class label. - Naive Bayes classifiers are computationally efficient and require a relatively small amount of training data. - They perform well even in situations with high-dimensional feature spaces, such as text classification tasks. - Naive Bayes classifiers are known for their simplicity and interpretability, as they provide clear probabilistic predictions. - However, they may struggle when the independence assumption is violated or when dealing with rare or unseen words. Support Vector Machines (SVM): - SVM is a discriminative algorithm that aims to find an optimal hyperplane to separate data points of different classes. - SVMs map the input text into a high-dimensional feature space, where the data can be linearly separable. - They work well in situations where the number of features is large and the data is not linearly separable in the original space. - SVMs can handle both linear and non-linear decision boundaries through the use of different kernel functions. - They are effective in dealing with high-dimensional text data and can handle large-scale text classification tasks. - SVMs are less interpretable than Naive Bayes classifiers, as they do not provide direct probability estimates. - However, they can be computationally expensive, especially when dealing with large datasets. Choosing between Naive Bayes classifiers and SVMs for text classification depends on various factors: - Dataset Size: Naive Bayes classifiers can work well with small training datasets, while SVMs can handle larger datasets. - Data Characteristics: If the independence assumption of Naive Bayes is reasonable and the classes are well-separated, it can be a good choice. If the data is complex or nonlinear, SVMs may be more suitable. - Interpretability: If interpretability is important, Naive Bayes classifiers provide clear probabilistic predictions, whereas SVMs focus on maximizing classification performance. - Computational Efficiency: Naive Bayes classifiers are generally faster to train and require less computational resources compared to SVMs. In summary, Naive Bayes classifiers are simple, interpretable, and computationally efficient, making them suitable for text classification tasks with smaller datasets. SVMs, on the other hand, are powerful models that can handle large-scale text classification tasks and nonlinear data, but they are less interpretable and can be computationally expensive. The choice between the two depends on the specific requirements and characteristics of the text classification problem at hand. Unit -4:1. Write about DBSCAN and Gaussian Mixtures? 2. Write the Main Approaches for Dimensionality Reduction? 3. Explain K-Means clustering with an Example? 4. How to implement PCA using Sci-Kit learn? 5. Write a note on Randomized PCA and Kernel PCA? 1. Certainly! Here's a more detailed explanation of DBSCAN (Density-Based Spatial Clustering of Applications with Noise) and Gaussian Mixtures: DBSCAN: DBSCAN is a density-based clustering algorithm that groups data points based on their density and proximity. It is particularly effective for discovering clusters of arbitrary shape within a dataset. Here's how DBSCAN works: 1. Density-Based Clustering: DBSCAN defines clusters as dense regions of data points separated by regions of lower density. It considers two important parameters: epsilon (ε), which represents the radius within which neighboring points are considered, and minPts, which is the minimum number of points required to form a dense region. 2. Core Points, Border Points, and Noise: DBSCAN identifies three types of points: - Core Points: A data point is considered a core point if it has at least minPts points within a distance of ε. - Border Points: A data point is considered a border point if it has fewer than minPts points within ε but is within ε of a core point. - Noise Points: A data point is considered noise if it is neither a core point nor a border point. 3. Clustering Process: The clustering process of DBSCAN involves the following steps: - Initially, all points are marked as unvisited. - A core point is randomly chosen, and its ε-neighborhood is explored. - If the ε-neighborhood contains at least minPts points, a new cluster is created. - The ε-neighborhood points are added to the cluster, and their ε-neighborhoods are further explored recursively until no more points can be added. - The process repeats with unvisited core points until all points are visited. 4. Resulting Clusters: DBSCAN outputs clusters as connected components of core points and their reachable neighbors. Border points may belong to more than one cluster, while noise points do not belong to any cluster. Gaussian Mixtures: Gaussian Mixtures, also known as Gaussian Mixture Models (GMM), is a probabilistic model that assumes the data is generated from a mixture of Gaussian distributions. It is a parametric approach to clustering that models the underlying probability distribution of the data. Here's how Gaussian Mixtures work: 1. Probability Distribution Modeling: Gaussian Mixtures model the data as a weighted sum of multiple Gaussian distributions, where each distribution represents a component or cluster. The model assumes that each data point is generated by one of these Gaussian components. 2. Parameter Estimation: The parameters of a Gaussian Mixture model include the means, covariances, and weights of the Gaussian components. These parameters are estimated using the Expectation-Maximization (EM) algorithm, which iteratively maximizes the likelihood of the data. 3. Soft Assignments: Unlike DBSCAN, which assigns data points to discrete clusters, Gaussian Mixtures provide soft assignments. Each data point is assigned probabilities indicating the likelihood of belonging to each cluster. 4. Cluster Assignment: To assign data points to clusters, a threshold can be set on the probabilities. Points with high probabilities for a particular cluster are assigned to that cluster. Alternatively, the most likely cluster can be determined based on the maximum probability. 5. Flexibility and Complexity: Gaussian Mixtures can model complex and overlapping clusters due to the flexibility of the underlying Gaussian distributions. However, the number of components or clusters needs to be specified in advance, which can be a limitation. In summary, DBSCAN is a density-based clustering algorithm that identifies dense regions in the data, while Gaussian Mixtures model the data as a mixture of Gaussian distributions. DBSCAN discovers clusters based on density and proximity, whereas Gaussian Mixtures assume data is generated from Gaussian components. DBSCAN is particularly useful for finding clusters of arbitrary shape, while Gaussian Mixtures can handle complex and overlapping clusters. The choice between the two depends on the characteristics of the dataset and the specific requirements of the clustering task at hand. DB SCAN GAUSSIAN MIXTURE 2. Introduction Technique to Dimensionality Reduction What is Dimensionality Reduction? The number of input features, variables, or columns present in a given dataset is known as dimensionality, and the process to reduce these features is called dimensionality reduction. A dataset contains a huge number of input features in various cases, which makes the predictive modeling task more complicated. Because it is very difficult to visualize or make predictions for the training dataset with a high number of features, for such cases, dimensionality reduction techniques are required to use. Dimensionality reduction technique can be defined as, "It is a way of converting the higher dimensions dataset into lesser dimensions dataset ensuring that it provides similar information." These techniques are widely used in machine learning for obtaining a better fit predictive model while solving the classification and regression problems. It is commonly used in the fields that deal with high-dimensional data, such as speech recognition, signal processing, bioinformatics, etc. It can also be used for data visualization, noise reduction, cluster analysis, etc. The Curse of Dimensionality Handling the high-dimensional data is very difficult in practice, commonly known as the curse of dimensionality. If the dimensionality of the input dataset increases, any machine learning algorithm and model becomes more complex. As the number of features increases, the number of samples also gets increased proportionally, and the chance of overfitting also increases. If the machine learning model is trained on high-dimensional data, it becomes overfitted and results in poor performance. Hence, it is often required to reduce the number of features, which can be done with dimensionality reduction. Benefits of applying Dimensionality Reduction Some benefits of applying dimensionality reduction technique to the given dataset are given below: ○ By reducing the dimensions of the features, the space required to store the dataset also gets reduced. ○ Less Computation training time is required for reduced dimensions of features. ○ Reduced dimensions of features of the dataset help in visualizing the data quickly. ○ It removes the redundant features (if present) by taking care of multicollinearity. Disadvantages of dimensionality Reduction There are also some disadvantages of applying the dimensionality reduction, which are given below: ○ Some data may be lost due to dimensionality reduction. ○ In the PCA dimensionality reduction technique, sometimes the principal components required to consider are unknown. Approaches of Dimension Reduction There are two ways to apply the dimension reduction technique, which are given below: Feature Selection Feature selection is the process of selecting the subset of the relevant features and leaving out the irrelevant features present in a dataset to build a model of high accuracy. In other words, it is a way of selecting the optimal features from the input dataset. Three methods are used for the feature selection: 1. Filters Methods In this method, the dataset is filtered, and a subset that contains only the relevant features is taken. Some common techniques of filters method are: ○ Correlation ○ Chi-Square Test ○ ANOVA ○ Information Gain, etc. 2. Wrappers Methods The wrapper method has the same goal as the filter method, but it takes a machine learning model for its evaluation. In this method, some features are fed to the ML model, and evaluate the performance. The performance decides whether to add those features or remove to increase the accuracy of the model. This method is more accurate than the filtering method but complex to work. Some common techniques of wrapper methods are: ○ Forward Selection ○ Backward Selection ○ Bi-directional Elimination 3. Embedded Methods: Embedded methods check the different training iterations of the machine learning model and evaluate the importance of each feature. Some common techniques of Embedded methods are: ○ LASSO ○ Elastic Net ○ Ridge Regression, etc. Feature Extraction: Feature extraction is the process of transforming the space containing many dimensions into space with fewer dimensions. This approach is useful when we want to keep the whole information but use fewer resources while processing the information. Some common feature extraction techniques are: a. Principal Component Analysis b. Linear Discriminant Analysis c. Kernel PCA d. Quadratic Discriminant Analysis Common techniques of Dimensionality Reduction a. Principal Component Analysis b. Backward Elimination c. Forward Selection d. Score comparison e. Missing Value Ratio f. Low Variance Filter g. High Correlation Filter h. Random Forest i. Factor Analysis j. Auto-Encoder Principal Component Analysis (PCA) Principal Component Analysis is a statistical process that converts the observations of correlated features into a set of linearly uncorrelated features with the help of orthogonal transformation. These new transformed features are called the Principal Components. It is one of the popular tools that is used for exploratory data analysis and predictive modeling. PCA works by considering the variance of each attribute because the high attribute shows the good split between the classes, and hence it reduces the dimensionality. Some real-world applications of PCA are image processing, movie recommendation system, optimizing the power allocation in various communication channels. Backward Feature Elimination The backward feature elimination technique is mainly used while developing Linear Regression or Logistic Regression model. Below steps are performed in this technique to reduce the dimensionality or in feature selection: ○ In this technique, firstly, all the n variables of the given dataset are taken to train the model. ○ The performance of the model is checked. ○ Now we will remove one feature each time and train the model on n-1 features for n times, and will compute the performance of the model. ○ We will check the variable that has made the smallest or no change in the performance of the model, and then we will drop that variable or features; after that, we will be left with n-1 features. ○ Repeat the complete process until no feature can be dropped. In this technique, by selecting the optimum performance of the model and maximum tolerable error rate, we can define the optimal number of features require for the machine learning algorithms. Forward Feature Selection Forward feature selection follows the inverse process of the backward elimination process. It means, in this technique, we don't eliminate the feature; instead, we will find the best features that can produce the highest increase in the performance of the model. Below steps are performed in this technique: ○ We start with a single feature only, and progressively we will add each feature at a time. ○ Here we will train the model on each feature separately. ○ The feature with the best performance is selected. ○ The process will be repeated until we get a significant increase in the performance of the model. Missing Value Ratio If a dataset has too many missing values, then we drop those variables as they do not carry much useful information. To perform this, we can set a threshold level, and if a variable has missing values more than that threshold, we will drop that variable. The higher the threshold value, the more efficient the reduction. Low Variance Filter As same as missing value ratio technique, data columns with some changes in the data have less information. Therefore, we need to calculate the variance of each variable, and all data columns with variance lower than a given threshold are dropped because low variance features will not affect the target variable. High Correlation Filter High Correlation refers to the case when two variables carry approximately similar information. Due to this factor, the performance of the model can be degraded. This correlation between the independent numerical variable gives the calculated value of the correlation coefficient. If this value is higher than the threshold value, we can remove one of the variables from the dataset. We can consider those variables or features that show a high correlation with the target variable. Random Forest Random Forest is a popular and very useful feature selection algorithm in machine learning. This algorithm contains an in-built feature importance package, so we do not need to program it separately. In this technique, we need to generate a large set of trees against the target variable, and with the help of usage statistics of each attribute, we need to find the subset of features. Random forest algorithm takes only numerical variables, so we need to convert the input data into numeric data using hot encoding. Factor Analysis Factor analysis is a technique in which each variable is kept within a group according to the correlation with other variables, it means variables within a group can have a high correlation between themselves, but they have a low correlation with variables of other groups. We can understand it by an example, such as if we have two variables Income and spend. These two variables have a high correlation, which means people with high income spends more, and vice versa. So, such variables are put into a group, and that group is known as the factor. The number of these factors will be reduced as compared to the original dimension of the dataset. Auto-encoders One of the popular methods of dimensionality reduction is auto-encoder, which is a type of ANN or artificial neural network, and its main aim is to copy the inputs to their outputs. In this, the input is compressed into latent-space representation, and output is occurred using this representation. It has mainly two parts: ○ Encoder: The function of the encoder is to compress the input to form the latent-space representation. ○ Decoder: The function of the decoder is to recreate the output from the latent-space representation. 3. K-Means Clustering Algorithm K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering problems in machine learning or data science. In this topic, we will learn what is K-means clustering algorithm, how the algorithm works, along with the Python implementation of k-means clustering. What is K-Means Algorithm? K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into different clusters. Here K defines the number of pre-defined clusters that need to be created in the process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on. It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way that each dataset belongs only one group that has similar properties. It allows us to cluster the data into different groups and a convenient way to discover the categories of groups in the unlabeled dataset on its own without the need for any training. It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of this algorithm is to minimize the sum of distances between the data point and their corresponding clusters. The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters, and repeats the process until it does not find the best clusters. The value of k should be predetermined in this algorithm. The k-means clustering algorithm mainly performs two tasks: ○ Determines the best value for K center points or centroids by an iterative process. ○ Assigns each data point to its closest k-center. Those data points which are near to the particular k-center, create a cluster. Hence each cluster has datapoints with some commonalities, and it is away from other clusters. The below diagram explains the working of the K-means Clustering Algorithm: How does the K-Means Algorithm Work? The working of the K-Means algorithm is explained in the below steps: Step-1: Select the number K to decide the number of clusters. Step-2: Select random K points or centroids. (It can be other from the input dataset). Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters. Step-4: Calculate the variance and place a new centroid of each cluster. Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of each cluster. Step-6: If any reassignment occurs, then go to step-4 else go to FINISH. Step-7: The model is ready. Let's understand the above steps by considering the visual plots: Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables is given below: ○ Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into different clusters. It means here we will try to group these datasets into two different clusters. ○ We need to choose some random k points or centroid to form the cluster. These points can be either the points from the dataset or any other point. So, here we are selecting the below two points as k points, which are not the part of our dataset. Consider the below image: ○ Now we will assign each data point of the scatter plot to its closest K-point or centroid. We will compute it by applying some mathematics that we have studied to calculate the distance between two points. So, we will draw a median between both the centroids. Consider the below image: From the above image, it is clear that points left side of the line is near to the K1 or blue centroid, and points to the right of the line are close to the yellow centroid. Let's color them as blue and yellow for clear visualization. ○ As we need to find the closest cluster, so we will repeat the process by choosing a new centroid. To choose the new centroids, we will compute the center of gravity of these centroids, and will find new centroids as below: ○ Next, we will reassign each datapoint to the new centroid. For this, we will repeat the same process of finding a median line. The median will be like below image: From the above image, we can see, one yellow point is on the left side of the line, and two blue points are right to the line. So, these three points will be assigned to new centroids. As reassignment has taken place, so we will again go to the step-4, which is finding new centroids or K-points. ○ We will repeat the process by finding the center of gravity of centroids, so the new centroids will be as shown in the below image: ○ As we got the new centroids so again will draw the median line and reassign the data points. So, the image will be: ○ We can see in the above image; there are no dissimilar data points on either side of the line, which means our model is formed. Consider the below image: As our model is ready, so we can now remove the assumed centroids, and the two final clusters will be as shown in the below image: 4. 5. Randomized PCA: Randomized PCA is a variant of Principal Component Analysis (PCA) that provides an efficient approximation to the traditional PCA algorithm. It is particularly useful when dealing with large datasets where the computational cost of traditional PCA becomes prohibitive. Randomized PCA speeds up the computation by using randomization techniques while still preserving the most important components. The main steps involved in Randomized PCA are as follows: 1. Randomized Sampling: Instead of processing the entire dataset, Randomized PCA starts by randomly selecting a subset of the data. This step significantly reduces the computational complexity. 2. Matrix Approximation: Next, the algorithm approximates the covariance matrix of the selected data subset using various matrix approximation techniques, such as randomized SVD (Singular Value Decomposition) or random projection. These approximation methods allow for faster computations without sacrificing much accuracy. 3. Traditional PCA on Approximated Matrix: The reduced and approximated covariance matrix is then used as input for the traditional PCA algorithm. The remaining steps of PCA, including eigendecomposition or singular value decomposition, are performed on this matrix to compute the principal components. Randomized PCA provides a good approximation of the principal components with a lower computational cost compared to traditional PCA. However, it may introduce a small amount of error in the results, which is generally acceptable for many practical applications. Kernel PCA: Kernel PCA is a nonlinear extension of traditional PCA that can capture complex patterns and structures in the data by using kernel functions. Unlike linear PCA, which operates in the original feature space, Kernel PCA maps the data into a higher-dimensional feature space, where linear PCA is then applied. The key steps involved in Kernel PCA are as follows: 1. Kernel Function: Kernel PCA begins by selecting an appropriate kernel function, such as the Gaussian kernel or polynomial kernel. The kernel function measures the similarity between pairs of data points in the original feature space. 2. Kernel Matrix: A kernel matrix is constructed using the kernel function, which quantifies the pairwise similarities between all data points in the original feature space. This matrix captures the nonlinear relationships among the data points. 3. Eigendecomposition: The kernel matrix is then eigendecomposed to obtain the eigenvectors and eigenvalues. These eigenvectors represent the principal components in the higher-dimensional feature space. 4. Projection: Finally, the data is projected onto the principal components obtained from the eigendecomposition. The projected data can then be used for further analysis or visualization. Kernel PCA is particularly useful when dealing with nonlinear relationships and complex data structures. It can capture nonlinear patterns that may be missed by linear PCA. However, it is important to note that Kernel PCA involves additional computational costs compared to linear PCA, as it requires the computation of the kernel matrix and eigendecomposition in the higher-dimensional feature space. Both Randomized PCA and Kernel PCA are powerful techniques that extend the capabilities of traditional PCA. Randomized PCA offers computational efficiency for large datasets, while Kernel PCA allows for nonlinear dimensionality reduction and capturing complex patterns in the data. The choice between these techniques depends on the specific requirements of the analysis task and the nature of the dataset. KERNAL PCA RANDOMIZED PCA Unit -5:1. Write about different ways to Installing TensorFlow2? 2. Explain the procedure of Loading and preprocessing Data with TensorFlow? 3. What are the various ways to implement MLP’s with Keras? Explain. 1. 2. 3. These are the main ways to implement MLP models using Keras: the Sequential API, the Functional API, and the Subclassing API. Each approach has its own benefits and suitability for different use cases. The choice depends on the complexity of the model architecture and the level of flexibility required.