Uploaded by Rumani Chakraborty

QnA - Business Analytics

advertisement
1.
What is accuracy, precision, recall?
Precision
Precision is the ratio of system generated results given by True positives to total predicted positives.
It is also given by predicted positive observations (​True Positives​) to the system’s total predicted positive
observations, both correct (​True Positives​) and incorrect (​False Positives​).
Accuracy
Accuracy is simply a ratio of the correctly predicted classifications (both ​True Positives​ + ​True Negatives​) to the
total Test Data
Also given by,
Accuracy is a great measure but when you have symmetric datasets, accuracy is potentially unhelpful for
asymmetric datasets.
Recall
Recall is the ratio of system generated results that correctly predicted positive observations (​True
Positives​) to a
​ ll​ observations in the actual malignant class (​Actual Positives​).
2.
What is a decision tree?
Decision tree is the most easiest and popular classification algorithm. It can be utilized for both
classification and regression problems. A decision tree is a flowchart-like tree structure where an internal
node represents a feature(or attribute), the branch represents a decision rule, and each leaf node
represents the outcome. This flowchart-like structure helps you in decision making.
Supervised - presence of x variable, non-parametric Algo
3.
What is the difference between regression and classification?
Classification is basically a two step process. Learning​ step and prediction step. In the learning step, the
model is developed based on given training data. In the prediction step, the model is used to predict the
response for given data.
classification - for predicting categorical variable
regression - for predicting continuous variable
Unordered data - ​classification​ - Decision tree, Logistic Regression
Ordered data - ​regression​ - Regression tree, linear regression
4.
What is information gain, gain ratio and gini index?
Information Gain, Gain Ratio and Gini index all these three are attribute selection measures (ASM) for
decision trees. Attribute selection is the first step in any decision tree algorithm which is used to split the
records.
Information Gain
Entropy​ basically measures the impurity of a data set, it refers to the impurity in a group of examples.
Information gain is the decrease in entropy. Information gain computes the difference between entropy
before split and average entropy after splitting of the data set.
Gain Ratio
An extension to information gain is called gain ratio. It uniquely identifies the data and maximizes the
information gain. Information gain is biased for the attributes with many outcomes. ​Gain ratio handles
the issue of bias by normalizing the information gain using Split Info.
Gain Ratio = Gain/ Split info
Gini Index - ​measure of the total variance across k classes
The Gini Index considers a binary split for each attribute.
In case of discrete value attribute , subset that gives the minimum gini index chosen as the splitting
attribute.
In case of continuous value attributes, each pair of adjacent values as a possible split-point, The attribute
with minimum Gini index is chosen as the splitting attribute.
The range of ​Entropy​ lies in between 0 to 1 and the range of ​Gini​ Impurity lies in between 0 to 0.5. Hence
we can conclude that ​Gini​ Impurity is ​better​ as compared to ​entropy​ for selecting the best features.
5.
What is meant by leave one out cross validation?
LOOCV is the case of Cross-Validation where just a single observation is held out for validation.
The Leave-One-Out Cross-Validation, or LOOCV, procedure is used to estimate the performance of machine
learning algorithms when they are used to make predictions on data not used to train the model.
6.
What is meant by Pmk?
pmk - probability(proportion) of kth class in mth region
OR
Proportion of training observation in the mth region that are from kth class
Chose that k that gives maximum pmk ... Thus max homogeneity ... Thus max purity
7. What is cross entropy?
Cross-entropy is a measure of the difference between two probability distributions for a given random variable
or set of events.
8.
What is cross validation? Why do we perform it​?
dividing data into sets of k fold and training and testing
Cross​-​validation​ is a technique for evaluating ML models by training several ML models.
Like In k-fold ​cross​-​validation​, we split the input data into k subsets of data
The goal of ​cross​-​validation​ is to estimate the expected level of fit of a model to a data set that is independent
of the data that were used to train the model. It can be used to estimate any quantitative measure of fit
that is appropriate for the data and model.
K fold CV = sum ( i = 1 to k ) MSE (k)/K
9.
What is the bagging and boosting algorithm ?
Variation in decision trees due to small variation in data can be avoided by bagging and boosting algorithms.
In bagging algorithms, predictions combine are of the same type. It aims to decrease variance and not bias,
where each model builds independently and receives the equal weight. Basically it tries to solve the
overfitting problem. High variance use bagging.
In boosting algorithms, predictions combine are of different types. It aims to decrease bias and not variance,
where every new model is influenced by the performance of previous models and they are weighted
according to their performance. Basically it tries to reduce the bias. Higher bias uses boosting.
10. What is a random forest?
Random forests is a supervised learning algorithm. It can be used both for classification and regression.
Random forests create decision trees on randomly selected data samples, get prediction from each tree and
select the best solution by means of voting. It also provides a pretty good indicator of the feature
importance.
It works in four steps:
1. Select random samples from a given dataset.
2. Construct a decision tree for each sample and get a prediction result from each decision
tree.
3. Perform a vote for each predicted result.
4. Select the prediction result with the most votes as the final prediction.
11. What is bootstrapping?
The bootstrap method is a resampling technique used to estimate statistics on a population by
sampling a dataset with replacement.It can be used to estimate summary statistics such as the
mean or standard deviation.
12. What does gini index signify?
measures the impurity of the node. You can say a node is pure when all of its records belong to the same class, such
nodes ​known as the leaf node.
Gini index​ or ​Gini​ impurity measures the degree or probability of a particular variable being wrongly
classified when it ​is​ randomly chosen. But what ​is​ actually meant by 'impurity'? If all the elements belong
to a single class, then it ​can​ be called pure.
The decision to split at each node is made according to the metric called purity . A node is 100%
impure when a node is split evenly 50/50 and 100% pure when all of its data belongs to a single
class.
13. Explain confusion matrix.
A Confusion matrix is an N x N matrix used for evaluating the performance of a classification
model, where N is the number of target classes. The matrix compares the actual target values
with those predicted by the machine learning model. This gives us a holistic view of how well
our classification model is performing and what kinds of errors it is making.
For a binary classification problem, we would have a 2 x 2 matrix as shown below with 4 values:
Understanding True Positive, True Negative, False Positive and
False Negative in a Confusion Matrix
True Positive (TP)
● The predicted value matches the actual value
● The actual value was positive and the model predicted a positive value
True Negative (TN)
● The predicted value matches the actual value
● The actual value was negative and the model predicted a negative value
False Positive (FP) – Type 1 error
● The predicted value was falsely predicted
● The actual value was negative but the model predicted a positive value
● Also known as the ​Type 1 error
False Negative (FN) – Type 2 error
● The predicted value was falsely predicted
● The actual value was positive but the model predicted a negative value
● Also known as the ​Type 2 error
Why is it needed - To get Accuracy, precision and recall ( upar explained hai )
14. KNN
KNN is a non-parametric and lazy learning algorithm.
Non-parametric​ means there is no assumption for underlying data distribution. In other words, the model
structure determined from the dataset. This will be very helpful in practice where most of the real world
datasets do not follow mathematical theoretical assumptions.
Lazy algorithm​ means it does not need any training data points for model generation. All training data used
in the testing phase. This makes training faster and the testing phase slower and costlier. Costly testing
phase means time and memory. In the worst case, KNN needs more time to scan all data points and
scanning all data points will require more memory for storing training data.
15. What is a decision tree stump?
A decision stump is a machine learning model consisting of a one-level decision tree. That is, it is a decision tree
with one internal node (the root) which is immediately connected to the terminal nodes (its leaves). A
decision stump makes a prediction based on the value of just a single input feature.
16. How to measure node purity?
The decision to split at each node is made according to the metric called purity . A node is 100%
impure when a node is split evenly 50/50 and 100% pure when all of its data belongs to a single
class.
The node impurity is a measure of the homogeneity of the labels at the node. The current implementation
provides two impurity measures for classification (Gini impurity and entropy) and one impurity measure for
regression (variance).
Download