Naive Bayes Classifiers

Naive Bayes Classifiers,
an Overview
By Roozmehr Safi
What is Naive Bayes Classifier (NBC)?
• NBC is a probabilistic classification method.
• Classification (A.K.A. discrimination, or
supervised learning) is assigning new cases to
one of a the pre-defined classes given a
sample of cases for which the true classes are
• NBC is one of the oldest and simplest
classification methods.
Some NBC Applications
Credit scoring
Marketing applications
Employee selection
Image processing
Speech recognition
Search engines…
How does NBC Work?
• NBC applies Bayes’ theorem with (naive)
independence assumptions.
• A more descriptive term for it would be
"independent feature model".
How does NBC work, Cntd.
Let X1,…, Xm denote our features (Height, weight, foot size…), Y is the class number
(1 for men,2 for women), and C is the number of classes (2). The problem consists
of classifying the case (x1,…, xm) to the class c maximizing P(Y=c| X1=x1,…,
Xm=xm) over c=1,…, C. Applying Bayes’ rule gives:
P(Y=c| X1=x1,…, Xm=xm) = P(X1=x1,…, Xm=xm | Y=c)P(Y=c) / P(X1=x1,…, Xm=xm)
• Under the NB’s assumption of conditional independence, P(X1=x1,…, Xm=xm |
Y=c) is replaced by
And NB reduces the original problem to:
An example:
• P(Obserevd Height|Male) = a
• P(Observed Weight|Male) = b
• P(Observed Foot size|Male) = c
P(Male|observed case)≈ P(male) × a × b × C
• P(Observed Height|Female) = d
• P(Observed Weight|Female) = e
• P(Observed Foot size|Female) = f
P(Female|observed case)≈ P(Female) × d × e × f
Pick the one that is larger
NBC advantage
• Despite unrealistic assumption of independence,
NBC is remarkably successful even when
independence is violated.
• Due to its simple structure the NBC it is
appealing when the set of variables is large.
• NBC requires a small amount of training data:
– It only needs to estimate means and variances of the
– No need to form the covariance matrix.
– Computationally inexpensive.
A Demonstration
Data: From an online B2B exchange (1220 cases).
Purpose: To distinguish cheaters of good sellers.
• Member type: Enterprise, personal, other
• Years since joined: 1 to 10 years.
• No. of months since last membership renewal
• Membership Renewal duration.
• Type of service bought: standard, limited edition…
• If the member has a registered company.
• If the company page is decorated.
• Number of days in which member logged in during past 60 days.
• Industry: production, distribution, investment…
Target: to predict if a seller is likely to cheat buyers based on data from old sellers.
Issues involved: Prob. distribution
• With discrete (categorical) features, estimating
the probabilities can be done using frequency
• With continuous features one can assume a
certain form of quantitative probability
• There is evidence that discretization of data
before applying NB is effective.
• Equal Frequency Discretization (EFD) divides the
sorted values of a continuous variable into k
equally populated bins.
Issues involved: Zero probabilities
• The case when a class and a feature value
never occur together in the training set
creates a problem, because assigning a
probability of zero to one of the terms causes
the whole expression to evaluate to zero.
• The zero probability can be replaced by a
small constant, such as 0.5/n where n is the
number of observations in the training set.
Issues involved: Missing values
• In some applications, values are missing not at
random and can be meaningful. Therefore,
missing values are treated as a separate
• If one does not want to treat missing values as
a separate category, they should be handled
prior to applying this macro with either a
missing value imputation or excluding cases
where they are present.
Thank you