Naive Bayes Classifiers, an Overview By Roozmehr Safi What is Naive Bayes Classifier (NBC)? • NBC is a probabilistic classification method. • Classification (A.K.A. discrimination, or supervised learning) is assigning new cases to one of a the pre-defined classes given a sample of cases for which the true classes are known. • NBC is one of the oldest and simplest classification methods. Some NBC Applications • • • • • • Credit scoring Marketing applications Employee selection Image processing Speech recognition Search engines… How does NBC Work? • NBC applies Bayes’ theorem with (naive) independence assumptions. • A more descriptive term for it would be "independent feature model". How does NBC work, Cntd. • Let X1,…, Xm denote our features (Height, weight, foot size…), Y is the class number (1 for men,2 for women), and C is the number of classes (2). The problem consists of classifying the case (x1,…, xm) to the class c maximizing P(Y=c| X1=x1,…, Xm=xm) over c=1,…, C. Applying Bayes’ rule gives: P(Y=c| X1=x1,…, Xm=xm) = P(X1=x1,…, Xm=xm | Y=c)P(Y=c) / P(X1=x1,…, Xm=xm) . • Under the NB’s assumption of conditional independence, P(X1=x1,…, Xm=xm | Y=c) is replaced by • . And NB reduces the original problem to: An example: • P(Obserevd Height|Male) = a • P(Observed Weight|Male) = b • P(Observed Foot size|Male) = c P(Male|observed case)≈ P(male) × a × b × C • P(Observed Height|Female) = d • P(Observed Weight|Female) = e • P(Observed Foot size|Female) = f P(Female|observed case)≈ P(Female) × d × e × f * Pick the one that is larger NBC advantage • Despite unrealistic assumption of independence, NBC is remarkably successful even when independence is violated. • Due to its simple structure the NBC it is appealing when the set of variables is large. • NBC requires a small amount of training data: – It only needs to estimate means and variances of the variables – No need to form the covariance matrix. – Computationally inexpensive. A Demonstration Data: From an online B2B exchange (1220 cases). Purpose: To distinguish cheaters of good sellers. Predictors: • Member type: Enterprise, personal, other • Years since joined: 1 to 10 years. • No. of months since last membership renewal • Membership Renewal duration. • Type of service bought: standard, limited edition… • If the member has a registered company. • If the company page is decorated. • Number of days in which member logged in during past 60 days. • Industry: production, distribution, investment… Target: to predict if a seller is likely to cheat buyers based on data from old sellers. Issues involved: Prob. distribution • With discrete (categorical) features, estimating the probabilities can be done using frequency counts. • With continuous features one can assume a certain form of quantitative probability distribution. • There is evidence that discretization of data before applying NB is effective. • Equal Frequency Discretization (EFD) divides the sorted values of a continuous variable into k equally populated bins. Issues involved: Zero probabilities • The case when a class and a feature value never occur together in the training set creates a problem, because assigning a probability of zero to one of the terms causes the whole expression to evaluate to zero. • The zero probability can be replaced by a small constant, such as 0.5/n where n is the number of observations in the training set. Issues involved: Missing values • In some applications, values are missing not at random and can be meaningful. Therefore, missing values are treated as a separate category. • If one does not want to treat missing values as a separate category, they should be handled prior to applying this macro with either a missing value imputation or excluding cases where they are present. Thank you