Data Warehouse & Data Management Assignment - 2 BAYESIAN CLASSIFICATION BY:Ishika Hooda (2K16/CO/133) Jatin Artwani (2K16/CO/137) Jessjit Singh (2K16/CO/142) Introduction ❖ In numerous applications, the connection between the attribute set and the class variable is non- deterministic. In other words, we can say the class label of a test record cant be assumed with certainty even though its attribute set is the same as some of the training examples. ❖ These circumstances may emerge due to the noisy data or the presence of certain confusing factors that influence classification, but it is not included in the analysis ❖ Bayesian classification is based on Bayes' Theorem. Bayesian classifiers are the statistical classifiers. Bayesian classifiers can predict class membership probabilities such as the probability that a given tuple belongs to a particular class. Baye's Theorem ❖ It came into existence after Thomas Bayes, who first utilised conditional probability to provide an algorithm that uses evidence to calculate limits on an unknown parameter. ❖ There are two types of probabilities: 1. Posterior Probability [P(H/X)] 2. Prior Probability [P(H)] where, X is data tuple and H is some hypothesis. ❖ P(H/X) is a conditional probability that describes the occurrence of event H is given that X is true. ❖ P(X/H) is a conditional probability that describes the occurrence of event X is given that H is true. ❖ P(H) and P(X) are the probabilities of observing H and X independently of each other. This is known as the marginal probability. ❖ Bayesian Interpretation ❖ ❖ ❖ In the Bayesian interpretation, probability determines a "degree of belief." Bayes theorem connects the degree of belief in a hypothesis before and after accounting for evidence. For example, Lets us consider an example of the coin. If we toss a coin, then we get either heads or tails, and the per cent of occurrence of either heads and tails is 50%. If the coin is flipped numbers of times, and the outcomes are observed, the degree of belief may rise, fall, or remain the same depending on the outcomes. For proposition X and evidence Y, ❖ ❖ P(X), the prior, is the primary degree of belief in X P(X/Y), the posterior is the dIn the Bayesian interpretation, probability determines a "degree of belief." Bayes theorem connects the degree of belief in a hypothesis before and after accounting for evidence. ❖ For proposition X and evidence Y, ❖ P(X), the prior, is the primary degree of belief in X ❖ P(X/Y), the posterior is the degree of belief having accounted for Y. ❖ The quotient represents the supports Y provides for X. Bayesian Belief Networks ❖ ❖ Bayesian Belief Networks specify joint conditional probability distributions. They are also known as Belief Networks, Bayesian Networks, or Probabilistic Networks. A Belief Network allows class conditional independencies to be defined between subsets of variables. ❖ It provides a graphical model of a causal relationship on which learning can be performed. ❖ We can use a trained Bayesian Network for classification. ❖ There are two components that define a Bayesian Belief Network: ❖ Directed acyclic graph ❖ A set of conditional probability tables An example of a Bayesian Belief Network Directed Acyclic Graph (DAG) ❖ In computer science and mathematics, a Directed Acyclic Graph (DAG) is a graph that is directed and without cycles connecting the other edges. This means that it is impossible to traverse the entire graph starting at one edge. The edges of the directed graph only go one way. ❖ Each node in a directed acyclic graph represents a random variable. ❖ These variables may be discrete or continuous-valued. ❖ These variables may correspond to the actual attribute given in the data. ❖ An Example of a Directed Acyclic Graph Conditional Probability Table (CPT) In statistics, the conditional probability table (CPT) is defined for a set of discrete and mutually dependent random variables to display conditional probabilities of a single variable with respect to the others (i.e., the probability of each possible value of one variable if we know the values taken on by the other variables). ❖ Assumptions ❖ Assumes that all attributes are independent within each class. ❖ Discrete attributes can take on arbitrary multinomial distributions, and real-valued attributes are assumed to be distributed normally. ❖ It is important to point out that we do not assume that the classification parameters or the number of classes are "random variables." Rather, we merely assume that they are unknown quantities about which we wish to perform inference. ❖ Bayesian methods have often been discredited due to their use of prior distributions, and the belief that this implies their results are personalistic and therefore somewhat arbitrary. An Example of a Conditional Probability Table Naive Bayes Classifier Fundamentals ❖ In machine learning, naïve Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes' theorem with strong (naïve) independence assumptions between the features. They are among the simplest Bayesian network models. ❖ Abstractly, naive Bayes is a conditional probability model: given a problem instance to be classified, represented by a vector representing some n features (independent variables), it assigns to this instance probabilities for each of K possible outcomes or classes . ❖ The problem with the above formulation is that if the number of features n is large or if a feature can take on a large number of values, then basing such a model on probability tables is infeasible. We therefore reformulate the model to make it more tractable. Using Bayes' theorem, the conditional probability can be decomposed as Naive Bayes Classifier Fundamentals (contd*) ❖ We need to calculate the following probability as follows - which can be rewritten ❖ Now the "naive" conditional independence assumptions come into play: assume that all features of the vector x are independent, then we get the following result - ❖ Thus, the joint model can be expressed as - Naive Bayes Classifier Pipeline Prediction of class requires the following equation - Naive Bayes Classifier Example We want to classify a Red Domestic SUV. Note there is no example of a Red Domestic SUV in our data set. But, firstly we need to calculate the probabilities P(Red|Yes), P(SUV|Yes), P(Domestic|Yes) , P(Red|No) , P(SUV|No), and P(Domestic|No) and multiply them by P(Yes) & P(No) respectively. Looking at P(Red | Yes), we have 5 cases where vj = Yes , and in 3 of those cases ai = Red. So for P(Red | Yes), n = 5 and nc = 3. Note that all attribute are binary (two possible values). We are assuming no other information so, p = 1 / (number-of-attribute-values) = 0.5 for all of our attributes. Our m value is arbitrary, (We will use m = 3) but consistent for all attributes Naive Bayes Classifier Example (contd*) Further we calculate the respected probabilities We have P(Y es) = .5 and P(No) = .5 For v = Yes, we have P(Yes) * P(Red | Yes) * P(SUV | Yes) * P(Domestic|Yes) = .5 * .56 * .31 * .43 = .037 For v = No, we have P(No) * P(Red | No) * P(SUV | No) * P (Domestic | No) = .5 * .43 * .56 * .56 = .069 Since 0.069 > 0.037, our example gets classified as ’NO’ Naive Bayes Classifier Applications 1. Text Classification - The Bayesian classification is used as a probabilistic learning method (Naive Bayes text classification). Naive Bayes classifiers are among the most successful known algorithms for learning to classify text documents. 2. Spam Filtering - Spam filtering is the best-known use of Naive Bayesian text classification. It makes use of a naive Bayes classifier to identify spam e-mail. Bayesian spam filtering has become a popular mechanism to distinguish illegitimate spam email from legitimate email. 3. Recommendation Systems - Recommender Systems apply machine learning and data mining techniques for filtering unseen information and can predict whether a user would like a given resource. Naive Bayes plays an important role behind the algorithms that help detect the likings of a user. Filtering Spam using Naive Bayes ❖ This algorithm will classify each object by looking at all of its features individually. Bayes Rule below shows us how to calculate the posterior probability for just one feature. The posterior probability of the object is calculated for each feature and then these probabilities are multiplied together to get a final probability.Which ever has the greater probability that ultimately determines what class the object is in.