1. What is accuracy, precision, recall? Precision Precision is the ratio of system generated results given by True positives to total predicted positives. It is also given by predicted positive observations (True Positives) to the system’s total predicted positive observations, both correct (True Positives) and incorrect (False Positives). Accuracy Accuracy is simply a ratio of the correctly predicted classifications (both True Positives + True Negatives) to the total Test Data Also given by, Accuracy is a great measure but when you have symmetric datasets, accuracy is potentially unhelpful for asymmetric datasets. Recall Recall is the ratio of system generated results that correctly predicted positive observations (True Positives) to a ll observations in the actual malignant class (Actual Positives). 2. What is a decision tree? Decision tree is the most easiest and popular classification algorithm. It can be utilized for both classification and regression problems. A decision tree is a flowchart-like tree structure where an internal node represents a feature(or attribute), the branch represents a decision rule, and each leaf node represents the outcome. This flowchart-like structure helps you in decision making. Supervised - presence of x variable, non-parametric Algo 3. What is the difference between regression and classification? Classification is basically a two step process. Learning step and prediction step. In the learning step, the model is developed based on given training data. In the prediction step, the model is used to predict the response for given data. classification - for predicting categorical variable regression - for predicting continuous variable Unordered data - classification - Decision tree, Logistic Regression Ordered data - regression - Regression tree, linear regression 4. What is information gain, gain ratio and gini index? Information Gain, Gain Ratio and Gini index all these three are attribute selection measures (ASM) for decision trees. Attribute selection is the first step in any decision tree algorithm which is used to split the records. Information Gain Entropy basically measures the impurity of a data set, it refers to the impurity in a group of examples. Information gain is the decrease in entropy. Information gain computes the difference between entropy before split and average entropy after splitting of the data set. Gain Ratio An extension to information gain is called gain ratio. It uniquely identifies the data and maximizes the information gain. Information gain is biased for the attributes with many outcomes. Gain ratio handles the issue of bias by normalizing the information gain using Split Info. Gain Ratio = Gain/ Split info Gini Index - measure of the total variance across k classes The Gini Index considers a binary split for each attribute. In case of discrete value attribute , subset that gives the minimum gini index chosen as the splitting attribute. In case of continuous value attributes, each pair of adjacent values as a possible split-point, The attribute with minimum Gini index is chosen as the splitting attribute. The range of Entropy lies in between 0 to 1 and the range of Gini Impurity lies in between 0 to 0.5. Hence we can conclude that Gini Impurity is better as compared to entropy for selecting the best features. 5. What is meant by leave one out cross validation? LOOCV is the case of Cross-Validation where just a single observation is held out for validation. The Leave-One-Out Cross-Validation, or LOOCV, procedure is used to estimate the performance of machine learning algorithms when they are used to make predictions on data not used to train the model. 6. What is meant by Pmk? pmk - probability(proportion) of kth class in mth region OR Proportion of training observation in the mth region that are from kth class Chose that k that gives maximum pmk ... Thus max homogeneity ... Thus max purity 7. What is cross entropy? Cross-entropy is a measure of the difference between two probability distributions for a given random variable or set of events. 8. What is cross validation? Why do we perform it? dividing data into sets of k fold and training and testing Cross-validation is a technique for evaluating ML models by training several ML models. Like In k-fold cross-validation, we split the input data into k subsets of data The goal of cross-validation is to estimate the expected level of fit of a model to a data set that is independent of the data that were used to train the model. It can be used to estimate any quantitative measure of fit that is appropriate for the data and model. K fold CV = sum ( i = 1 to k ) MSE (k)/K 9. What is the bagging and boosting algorithm ? Variation in decision trees due to small variation in data can be avoided by bagging and boosting algorithms. In bagging algorithms, predictions combine are of the same type. It aims to decrease variance and not bias, where each model builds independently and receives the equal weight. Basically it tries to solve the overfitting problem. High variance use bagging. In boosting algorithms, predictions combine are of different types. It aims to decrease bias and not variance, where every new model is influenced by the performance of previous models and they are weighted according to their performance. Basically it tries to reduce the bias. Higher bias uses boosting. 10. What is a random forest? Random forests is a supervised learning algorithm. It can be used both for classification and regression. Random forests create decision trees on randomly selected data samples, get prediction from each tree and select the best solution by means of voting. It also provides a pretty good indicator of the feature importance. It works in four steps: 1. Select random samples from a given dataset. 2. Construct a decision tree for each sample and get a prediction result from each decision tree. 3. Perform a vote for each predicted result. 4. Select the prediction result with the most votes as the final prediction. 11. What is bootstrapping? The bootstrap method is a resampling technique used to estimate statistics on a population by sampling a dataset with replacement.It can be used to estimate summary statistics such as the mean or standard deviation. 12. What does gini index signify? measures the impurity of the node. You can say a node is pure when all of its records belong to the same class, such nodes known as the leaf node. Gini index or Gini impurity measures the degree or probability of a particular variable being wrongly classified when it is randomly chosen. But what is actually meant by 'impurity'? If all the elements belong to a single class, then it can be called pure. The decision to split at each node is made according to the metric called purity . A node is 100% impure when a node is split evenly 50/50 and 100% pure when all of its data belongs to a single class. 13. Explain confusion matrix. A Confusion matrix is an N x N matrix used for evaluating the performance of a classification model, where N is the number of target classes. The matrix compares the actual target values with those predicted by the machine learning model. This gives us a holistic view of how well our classification model is performing and what kinds of errors it is making. For a binary classification problem, we would have a 2 x 2 matrix as shown below with 4 values: Understanding True Positive, True Negative, False Positive and False Negative in a Confusion Matrix True Positive (TP) ● The predicted value matches the actual value ● The actual value was positive and the model predicted a positive value True Negative (TN) ● The predicted value matches the actual value ● The actual value was negative and the model predicted a negative value False Positive (FP) – Type 1 error ● The predicted value was falsely predicted ● The actual value was negative but the model predicted a positive value ● Also known as the Type 1 error False Negative (FN) – Type 2 error ● The predicted value was falsely predicted ● The actual value was positive but the model predicted a negative value ● Also known as the Type 2 error Why is it needed - To get Accuracy, precision and recall ( upar explained hai ) 14. KNN KNN is a non-parametric and lazy learning algorithm. Non-parametric means there is no assumption for underlying data distribution. In other words, the model structure determined from the dataset. This will be very helpful in practice where most of the real world datasets do not follow mathematical theoretical assumptions. Lazy algorithm means it does not need any training data points for model generation. All training data used in the testing phase. This makes training faster and the testing phase slower and costlier. Costly testing phase means time and memory. In the worst case, KNN needs more time to scan all data points and scanning all data points will require more memory for storing training data. 15. What is a decision tree stump? A decision stump is a machine learning model consisting of a one-level decision tree. That is, it is a decision tree with one internal node (the root) which is immediately connected to the terminal nodes (its leaves). A decision stump makes a prediction based on the value of just a single input feature. 16. How to measure node purity? The decision to split at each node is made according to the metric called purity . A node is 100% impure when a node is split evenly 50/50 and 100% pure when all of its data belongs to a single class. The node impurity is a measure of the homogeneity of the labels at the node. The current implementation provides two impurity measures for classification (Gini impurity and entropy) and one impurity measure for regression (variance).