Classification : Naïve Bayes Haramaya University College of Computing and Informatics Department of Software Engineering Mr. Dita Abdujebar(M.Sc.) Outline What is Naïve Bayes Pros and Cons of Naïve Bayes Probability theory Conditional probability Naïve Bayes classification 8/13/2023 2 Probability Theory : Naïve Bayes In both kNN and DT; we asked the classifiers to make hard decisions. We asked for a defined answer for the question. Asking the best guess about the class is better. (probability) Probability theory forms the basis for many machine learning algorithms. Probability theory can help us in classify things. 8/13/2023 3 Probability Theory : Naïve Bayes Classifying with Bayesian Decision Theory: Pros: Works with small amount of data, handles multiple classes. Cons: Sensitive to how the input data is prepared. Works with: Nominal values 8/13/2023 4 Probability Theory : Naïve Bayes Naïve Bayes is a subset of Bayesian Decision Theory. The decision tree wouldn’t be very successful, and kNN would require a lot of calculations compared to the simple probability calculation. Conditional Probability: P(gray/bucket B) = P(gray and bucket B) / P(bucket B) 8/13/2023 5 Probability Theory : Naïve Bayes Figure 1: Seven stones in two buckets 8/13/2023 6 Probability Theory : Naïve Bayes Conditional Probability: Calculating the probability of a gray stone, given that the unknown stone comes from bucket B. P(gray / bucket B) = 1/3 P(gray / bucket A) = 2/4 To formalize how to calculate the conditional probability, we can say: P(gray / bucket B) = P(gray and bucket B) / P(bucket B) P(gray and bucket B) = 1/7 (gray stone in B / total stone ) P(bucket B) = 3/7 (Three stone in bucket B) – Simple 8/13/2023 7 Probability Theory : Naïve Bayes Conditional Probability P(gray / bucket B) = P(gray and bucket B) / P(bucket B) P(gray / bucket B) = (1/7) / (3/7) P(gray / bucket B) = 1/3 Another useful way to manipulate conditional probabilities is known as Bayes’ rule. If we have P(x|c) but want to have P(c|x) 8/13/2023 8 Classifying with Conditional Probabilities Bayesian decision theory can told us to find the two probabilities: If P1(x, y) > P2(x, y) , then the class is 1. If P1(x, y) < p2(x, y), then the class is 2. What we really need to compare p(c1|x,y) and p(c2|x,y). Given a point identified as x,y; what is the probability it came from class c1? What is the probability it came from class c2? 8/13/2023 9 Classifying with Conditional Probabilities Posterior = (likelihood * prior) / evidence With these definitions, we can define the Bayesian classification rule: If P(c1 | x, y) > P(c2 | x, y) , the class is c1. If P(c1 | x, y) < p(c2 | x, y), the class is c2. 8/13/2023 10 Uses of Naïve Bayes Classification Application of Naïve Bayes: Naïve Bayes text classification Spam filtering Hybrid recommender system (Collaborative and Content based filtering) Online application Bayesian reasoning is applied to decision making and inferential statistics that deals with probability inference. It used the knowledge of prior events to predict future events. 8/13/2023 11 Example One 8/13/2023 Figure 2: Example training data 12 Example One X = ( age = youth, income = medium, student = yes, credit_rating = fair) A person belonging to tuple X will buy a computer? Maximum Posteriori Hypothesis : P(Ci | X) = P(X | Ci) P(Ci) / P(X) Maximum P(Ci | X) = P(X | Ci) P(Ci) as P(X) is constant 8/13/2023 13 Example One P(C1=yes) = P(buys_computer = yes) = 9/14 = 0.643 P(C2=no) = P(buys_computer = no) = 5/14 = 0.357 P(age=youth /buys_computer = yes) = 2/9 = 0.222 P(age=youth /buys_computer = no) = 3/5 = 0.600 P(income=medium /buys_computer = yes) = 4/9 = 0.444 P(income=medium /buys_computer = no) = 2/5 = 0.400 P(student=yes /buys_computer = yes) = 6/9 = 0.667 P(student=yes/buys_computer = no) = 1/5 = 0.200 P(credit rating=fair /buys_computer = yes) = 6/9 = 0.667 P(credit rating=fair /buys_computer = no) = 2/5 =0.400 8/13/2023 14 Example One P(X/Buys a computer = yes) = P(age=youth /buys_computer = yes) * P(income=medium /buys_computer = yes) * P(student=yes /buys_computer = yes) * P(credit rating=fair /buys_computer = yes) P(X/Buys a computer = yes) = 0.222 * 0.444 * 0.667 * 0.667 = 0.044 P(X/Buys a computer = No) = 0.600 * 0.400 * 0.200 * 0.400 = 0.019 8/13/2023 15 Example One Find class Ci that Maximizes P(X/Ci) * P(Ci): P(X/Buys a computer = yes) * P(buys_computer = yes) = 0.044 * 0.643 = 0.028 P(X/Buys a computer = No) * P(buys_computer = no) = 0.019 * 0.357 = 0.007 Prediction : Buys a computer for Tuple X. (x can buy a computer) 8/13/2023 16 Example Two Consider a set of documents, each of which is related either to Sports (S ) or to Informatics (I). Given a training set of 11 documents, we would like to estimate a Naive Bayes classifier, using the Bernoulli document model, to classify unlabelled documents as S or I. We define a vocabulary of eight words: 8/13/2023 17 Example Two Types of Naïve Bayes: 8/13/2023 18 Example Two Figure 3: Vocabulary of eight words 8/13/2023 19 Example Two Thus each document is represented as an 8-dimensional binary vector. The training data is presented below as a matrix for each class, in which each row represents an 8-dimensional document vector. 8/13/2023 20 Example Two Classify the following into Sports or Informatics using a Naive Bayes classifier. b1 = (1, 0, 0, 1, 1, 1, 0, 1) = S or I b2 = (0, 1, 1, 0, 1, 0, 1, 0) = S or I 8/13/2023 21 Example Two The total number of documents in the training set N =11; NS =6, NI =5. We can estimate the prior probabilities from the training data as: P(S) = 6/11 P(I) = 5/11 8/13/2023 22 Example Two The word count in the training data are: 8/13/2023 23 Example Two We can estimate the word likelihood using: 8/13/2023 24 Example Two The word likelihood for class I: 8/13/2023 25 Example Two To compute the posterior probabilities of the two test vectors and hence classify them. b1 = (1, 0, 0, 1, 1, 1, 0, 1) P(S| b1) = P(wt | S) x P(S) (1/2 X 5/6 X 2/3 X ½ X ½ X 2/3 X 1/3 X 2/3) x (6/11) (5/891) = 5.6 x 10-3 P(I| b1) = P(wt | I) x P(I) (1/5 X 2/5 X 2/5 X 1/5 X 1/5 X 1/5 X 2/5 X 1/5) x (5/11) (8/859375) = 9.3 x 10-6 Classify this document as S. 8/13/2023 26 Example Two To compute the posterior probabilities of the two test vectors and hence classify them. b2 = (0, 1, 1, 0, 1, 0, 1, 0) P(S| b2) = P(wt | S) x P(S) (1/2 X 1/6 X 1/3 X ½ X ½ X 1/3 X 2/3 X 1/3) x (6/11) (12/14256) = 8.4 x 10-4 P(I| b2) = P(wt | I) x P(I) (4/5 X 3/5 X 3/5 X 4/5 X 1/5 X 4/5 X 3/5 X 4/5) x (5/11) (34560/4296875) = 8.0 x 10-3 Classify this document as I. 8/13/2023 27 Naïve Bayes: Syntax Import the class containing the classification method: 8/13/2023 28 Summary Using probabilities can sometimes be more effective than using hard rules for classification. Bayesian probability and Bayes’ rule gives us a way to estimate unknown probabilities from known values. You can reduce the need for a lot of data by assuming conditional independence among the features in your data. The assumption we make is that the probability of one word doesn’t depend on any other words in the document. 8/13/2023 29 Summary Despite its incorrect assumptions, naïve Bayes is effective at classification. Underflow is one problem that can be addressed by using the logarithm of probabilities in your calculations. 8/13/2023 30 Question & Answer 8/13/2023 31 Thank You !!! 8/13/2023 32 Assignment Three Predict outcome for the following: x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong) 8/13/2023 33