Uploaded by Seena Baker

Lecture -4 Classification(Naive Bayes) (2)

advertisement
Classification : Naïve Bayes
Haramaya University
College of Computing and Informatics
Department of Software Engineering
Mr. Dita Abdujebar(M.Sc.)
Outline
 What is Naïve Bayes
 Pros and Cons of Naïve Bayes
 Probability theory
 Conditional probability
 Naïve Bayes classification
8/13/2023
2
Probability Theory : Naïve
Bayes
 In both kNN and DT; we asked the classifiers to make hard
decisions.
 We asked for a defined answer for the question.
 Asking the best guess about the class is better. (probability)
 Probability theory forms the basis for many machine learning
algorithms.
 Probability theory can help us in classify things.
8/13/2023
3
Probability Theory : Naïve
Bayes
Classifying with Bayesian Decision Theory:
 Pros:
 Works with small amount of data, handles multiple classes.
 Cons:
 Sensitive to how the input data is prepared.
 Works with:
 Nominal values
8/13/2023
4
Probability Theory : Naïve
Bayes
 Naïve Bayes is a subset of Bayesian Decision Theory.
 The decision tree wouldn’t be very successful, and kNN would
require a lot of calculations compared to the simple probability
calculation.
 Conditional Probability:
 P(gray/bucket B) = P(gray and bucket B) / P(bucket B)
8/13/2023
5
Probability Theory : Naïve
Bayes
Figure 1: Seven stones in two buckets
8/13/2023
6
Probability Theory : Naïve
Bayes
 Conditional Probability:
 Calculating the probability of a gray stone, given that the
unknown stone comes from bucket B.
 P(gray / bucket B) = 1/3
 P(gray / bucket A) = 2/4
 To formalize how to calculate the conditional probability, we
can say:
 P(gray / bucket B) = P(gray and bucket B) / P(bucket B)
 P(gray and bucket B) = 1/7 (gray stone in B / total stone )
 P(bucket B) = 3/7 (Three stone in bucket B) – Simple
8/13/2023
7
Probability Theory : Naïve
Bayes
 Conditional Probability
 P(gray / bucket B) = P(gray and bucket B) / P(bucket B)
 P(gray / bucket B) = (1/7) / (3/7)
 P(gray / bucket B) = 1/3
 Another useful way to manipulate conditional probabilities is
known as Bayes’ rule.
 If we have P(x|c) but want to have P(c|x)
8/13/2023
8
Classifying with Conditional
Probabilities
 Bayesian decision theory can told us to find the two
probabilities:
 If P1(x, y) > P2(x, y) , then the class is 1.
 If P1(x, y) < p2(x, y), then the class is 2.
 What we really need to compare p(c1|x,y) and p(c2|x,y).
 Given a point identified as x,y; what is the probability it came from
class c1?
 What is the probability it came from class c2?
8/13/2023
9
Classifying with Conditional
Probabilities
 Posterior = (likelihood * prior) / evidence
 With these definitions, we can define the Bayesian classification
rule:
 If P(c1 | x, y) > P(c2 | x, y) , the class is c1.
 If P(c1 | x, y) < p(c2 | x, y), the class is c2.
8/13/2023
10
Uses of Naïve Bayes
Classification
 Application of Naïve Bayes:
 Naïve Bayes text classification
 Spam filtering
 Hybrid recommender system (Collaborative and Content based
filtering)
 Online application
 Bayesian reasoning is applied to decision making and inferential
statistics that deals with probability inference.
 It used the knowledge of prior events to predict future events.
8/13/2023
11
Example One
8/13/2023
Figure 2: Example training data
12
Example One
 X = ( age = youth, income = medium, student = yes,
credit_rating = fair)
 A person belonging to tuple X will buy a computer?
 Maximum Posteriori Hypothesis :
 P(Ci | X) = P(X | Ci) P(Ci) / P(X)
 Maximum P(Ci | X) = P(X | Ci) P(Ci) as P(X) is constant
8/13/2023
13
Example One
 P(C1=yes) = P(buys_computer = yes) = 9/14 = 0.643
 P(C2=no) = P(buys_computer = no) = 5/14 = 0.357
 P(age=youth /buys_computer = yes) = 2/9 = 0.222
 P(age=youth /buys_computer = no) = 3/5 = 0.600
 P(income=medium /buys_computer = yes) = 4/9 = 0.444
 P(income=medium /buys_computer = no) = 2/5 = 0.400
 P(student=yes /buys_computer = yes) = 6/9 = 0.667
 P(student=yes/buys_computer = no) = 1/5 = 0.200
 P(credit rating=fair /buys_computer = yes) = 6/9 = 0.667
 P(credit rating=fair /buys_computer = no) = 2/5 =0.400
8/13/2023
14
Example One
 P(X/Buys a computer = yes) = P(age=youth /buys_computer =
yes) * P(income=medium /buys_computer = yes) *
P(student=yes /buys_computer = yes) * P(credit rating=fair
/buys_computer = yes)
 P(X/Buys a computer = yes) = 0.222 * 0.444 * 0.667 * 0.667 =
0.044
 P(X/Buys a computer = No) = 0.600 * 0.400 * 0.200 * 0.400 =
0.019
8/13/2023
15
Example One
 Find class Ci that Maximizes P(X/Ci) * P(Ci):
 P(X/Buys a computer = yes) * P(buys_computer = yes)
 = 0.044 * 0.643
 = 0.028
 P(X/Buys a computer = No) * P(buys_computer = no)
 = 0.019 * 0.357
 = 0.007
 Prediction : Buys a computer for Tuple X. (x can buy a
computer)
8/13/2023
16
Example Two
 Consider a set of documents, each of which is related either to
Sports (S ) or to Informatics (I).
 Given a training set of 11 documents, we would like to estimate
a Naive Bayes classifier, using the Bernoulli document model,
to classify unlabelled documents as S or I.
 We define a vocabulary of eight words:
8/13/2023
17
Example Two
 Types of Naïve Bayes:
8/13/2023
18
Example Two
Figure 3: Vocabulary of eight words
8/13/2023
19
Example Two
 Thus each document is represented as an 8-dimensional binary
vector.
 The training data is presented below as a matrix for each class,
in which each row represents an 8-dimensional document
vector.
8/13/2023
20
Example Two
 Classify the following into Sports or Informatics using a Naive
Bayes classifier.
 b1 = (1, 0, 0, 1, 1, 1, 0, 1) = S or I
 b2 = (0, 1, 1, 0, 1, 0, 1, 0) = S or I
8/13/2023
21
Example Two
 The total number of documents in the training set N =11; NS
=6, NI =5.
 We can estimate the prior probabilities from the training data
as:
 P(S) = 6/11
 P(I) = 5/11
8/13/2023
22
Example Two
 The word count in the training data are:
8/13/2023
23
Example Two
 We can estimate the word likelihood using:
8/13/2023
24
Example Two
 The word likelihood for class I:
8/13/2023
25
Example Two
 To compute the posterior probabilities of the two test vectors
and hence classify them.
 b1 = (1, 0, 0, 1, 1, 1, 0, 1)
 P(S| b1) = P(wt | S) x P(S)
 (1/2 X 5/6 X 2/3 X ½ X ½ X 2/3 X 1/3 X 2/3) x (6/11)
 (5/891) = 5.6 x 10-3
 P(I| b1) = P(wt | I) x P(I)
 (1/5 X 2/5 X 2/5 X 1/5 X 1/5 X 1/5 X 2/5 X 1/5) x (5/11)
 (8/859375) = 9.3 x 10-6
 Classify this document as S.
8/13/2023
26
Example Two
 To compute the posterior probabilities of the two test vectors
and hence classify them.
 b2 = (0, 1, 1, 0, 1, 0, 1, 0)
 P(S| b2) = P(wt | S) x P(S)
 (1/2 X 1/6 X 1/3 X ½ X ½ X 1/3 X 2/3 X 1/3) x (6/11)
 (12/14256) = 8.4 x 10-4
 P(I| b2) = P(wt | I) x P(I)
 (4/5 X 3/5 X 3/5 X 4/5 X 1/5 X 4/5 X 3/5 X 4/5) x (5/11)
 (34560/4296875) = 8.0 x 10-3
 Classify this document as I.
8/13/2023
27
Naïve Bayes: Syntax
 Import the class containing the classification method:
8/13/2023
28
Summary
 Using probabilities can sometimes be more effective than using
hard rules for classification.
 Bayesian probability and Bayes’ rule gives us a way to estimate
unknown probabilities from known values.
 You can reduce the need for a lot of data by assuming
conditional independence among the features in your data.
 The assumption we make is that the probability of one word
doesn’t depend on any other words in the document.
8/13/2023
29
Summary
 Despite its incorrect assumptions, naïve Bayes is effective at
classification.
 Underflow is one problem that can be addressed by using the
logarithm of probabilities in your calculations.
8/13/2023
30
Question & Answer
8/13/2023
31
Thank You !!!
8/13/2023
32
Assignment Three
 Predict outcome for the following: x’=(Outlook=Sunny,
Temperature=Cool, Humidity=High, Wind=Strong)
8/13/2023
33
Download