1 Classification Naïve Bayes Business Intelligence 2 Naïve Bayes: The concept • Bayes Theorem is used for conditional probability calculation in presence of some information. • The conditional probability is typically of the following form: Pr(C|X1, X2, X3, etc.) = [Pr(X1, X2, X3, etc.|C)*Pr(X)]/[Pr(X1, X2, X3, etc.|C)*Pr(C)+ Pr(X1, X2, X3, etc.|C bar)*Pr(C bar)] Where, Pr(C|X1, X2, X3, etc.) means probability of event C in the presence of condition/information X1, X2, X3, etc.) and C bar is the complement event of C. Example: Let C denotes the event that 405 will be moving really slow without any prior information. We can estimate that from our prior experience and put a value of 30%. Let now X1, X2, and X3 denote the given information that it is raining, there is an accident, and one of the left lane is closed, then clearly, the probability of C given this new information will change dramatically. Bayes theorem provides a precise way to calculate this. Naïve Bayes contd.. • However, with many conditional information present, calculation of posterior probability with Bayes can be very involved. • We then use a simplified version of Bayes Theorem based on the independence of the conditional probabilities. • In this case we use the formula • P(C|X1, X2, X3, etc.) = P(X1|C)*P(X2|C)*P(X3|C)etc.*P(C)/ P(X1|C)*P(X2|C)*P(X3|C)etc.*P(C)+ P(X1|C bar)*P(X2|C bar)*P(X3|C bar), etc. * P(C bar)] 3 4 Naïve Bayes: Example • We have a list with the following information about the size of a company, their audit status, and if there were filed charged against them. Charge Filed Company Size y small n small n large n large n small n small y small y large n large y large Status truthful truthful truthful truthful truthful truthful fraudulent fraudulent fraudulent fraudulent Count of Status Status fraudulent truthful Grand Total Count of Status Status fraudulent truthful Grand Total Company Size large n 1 2 3 Company Size small n 3 3 Charge Filed large Total y 2 3 2 5 2 Charge Filed small Total y 1 1 2 1 4 5 Naïve Bayes • If we want to know the probability that a company will be fraudulent given it is small in size and there is a charge filed against it or, P(fraudulent|size = small, charges=y) • From the crosstab/pivot tables we can see that the above probability = ½ (there are 2 companies that are small and have charges filed against them and 1 of them is fraudulent). • Similarly, P(fraudulent|small, no) = 0/3 = 0 • P(fraudulent|large, y) = 2/2 = 1 • P(fraudulent|large. N) = 1/3 = 0.33 • Using Naïve Bayes we can get the following: • P(fraudulent|small, y) = P(small|fraudulent)*P(y|fraudulent)*P(fraudulent)/[P(small|fraud ulent)*P(y|fraudulent)*P(fraudulent)+ P(small|truthful)*P(y|truthful)*P(truthfult)] • = (1/4)*(3/4)*(4/10)/[(1/4)*(3/4)*(4/10)+(4/6)*(1/6)*(6/10)] • = 0.53 and is very close to the 0.5 value that we had from exact calculation! 5 6 Naïve Bayes: Flight Delay Example • Let us use XLMiner to create the conditional probabilities (each of the individual probability items) for flight delay. • We will use only – Carrier, Day of the week, Dep Time in one hour block, Destination, Origin, and Weather • Run XLMiner and look at the conditional probability of the training set • For any record in the validation set, the probability for classification is computed by multiplying the corresponding conditional probabilities and the prior probability of that particular class • Let us do two examples 7 Examples • Example 1: Record Details (row 633 in …NNBforlecture.xlsx) Row Id. 626 Cum ontime 2 Predicted Class Actual Class ontime ontime Prob. for ontime (success) CARRIER DEP_TIME DEST DISTANCE ORIGIN Weather DAY_WEEK 0.804686121 DH 1640 JFK 213 DCA 0 4 • Multiply all the relevant conditional probabilities for ontime and get p1 • Multiply all the relevant conditional probabilities for delayed and get p2 • Weigh each one of them with the corresponding prior class probabilities and add the two numbers (w1*p1 + w2*p2) • Probability for class i = wipi/(w1*p1 + w2*p2) • Classify based on if the above prob > cut-off • Example 2: Record Details (row 610 in …NNBforlecture.xlsx) Row Id. 194 Cum ontime Predicted Class Actual Class ontime delayed Prob. for ontime (success) CARRIER DEP_TIME DEST DISTANCE ORIGIN Weather DAY_WEEK 0.846384259 MQ 1936 LGA 214 DCA 0 7 • • Details Record 1. Let us list the conditions. Input Variables CARRIER DH DEP_TIME 1640 DEST JFK DISTANCE 213 ORIGIN DCA Weather 0 DAY_WEEK 4 Corresponding conditional probabilities for ontime are (I used a vlookup from the conditional probability tables given by XLMiner) extracted from the ontime side. Title CARRIER DEP_TIME DEST DISTANCE ORIGIN Weather DAY_WEEK • • 8 Condition DH 1640 JFK 213 DCA 0 4 Ontime Conditional Prob 0.243192 0.004695 0.176526 0.187793 0.635681 1.000000 0.159624 Calculate p1 or by multiplying the numbers above. For p1*w1, mutiply the number below with 0.80620 p1 = 3.84059E-06 CARRIER DEP_TIME (Partially show n) DEST DISTANCE ORIGIN Weather ontime Value Prob CO 0.0384977 DH 0.2431925 DL 0.2 MQ 0.1126761 OH 0.0178404 RU 0.170892 UA 0.0169014 US 0.2 548 0.000939 550 0.000939 552 0.0018779 553 0.0056338 EWR 0.2835681 JFK 0.1765258 LGA 0.5399061 169 0.0507042 184 0.0178404 199 0.1079812 213 0.1877934 214 0.4647887 228 0.0957746 229 0.0751174 BWI 0.0685446 DCA 0.6356808 IAD 0.2957746 0 1 1 0 delayed Value Prob CO 0.06640625 DH 0.33984375 DL 0.109375 MQ 0.1796875 OH 0.01171875 RU 0.21484375 UA 0.0078125 US 0.0703125 548 0 550 0 552 0.00390625 553 0 EWR 0.38671875 JFK 0.1875 LGA 0.42578125 169 0.08203125 184 0.01171875 199 0.11328125 213 0.26953125 214 0.29296875 228 0.09765625 229 0.1328125 BWI 0.09375 DCA 0.484375 IAD 0.421875 0 0.92578125 1 0.07421875 Day of the week is not listed to save space According to relative occurrences in training data Prior class probabilities Class ontime delayed Prob. 0.806207419 <-- Success Class 0.193792581 Details • By following the exact same methods and subsequent calculations for the p1*w1 and p2*w2 we can easily get the following results. 9 Input Variables CARRIER conditional probability 0.804686121 0.195313879 DEP_TIME (Partially show n) sum pi*wi 3.09631E-06 7.51538E-07 pi 3.84059E-06 3.87805E-06 3.84785E-06 DEST DISTANCE Ontime Delayed Cut off Prob.Val. for Success (Updatable) 0.5 ORIGIN 626 2 ontime ontime 0.804686121 DH 1640 JFK 213 DCA 0 4 Weather • Verify the results for record 2. ontime Value Prob CO 0.0384977 DH 0.2431925 DL 0.2 MQ 0.1126761 OH 0.0178404 RU 0.170892 UA 0.0169014 US 0.2 548 0.000939 550 0.000939 552 0.0018779 553 0.0056338 EWR 0.2835681 JFK 0.1765258 LGA 0.5399061 169 0.0507042 184 0.0178404 199 0.1079812 213 0.1877934 214 0.4647887 228 0.0957746 229 0.0751174 BWI 0.0685446 DCA 0.6356808 IAD 0.2957746 0 1 1 0 delayed Value Prob CO 0.06640625 DH 0.33984375 DL 0.109375 MQ 0.1796875 OH 0.01171875 RU 0.21484375 UA 0.0078125 US 0.0703125 548 0 550 0 552 0.00390625 553 0 EWR 0.38671875 JFK 0.1875 LGA 0.42578125 169 0.08203125 184 0.01171875 199 0.11328125 213 0.26953125 214 0.29296875 228 0.09765625 229 0.1328125 BWI 0.09375 DCA 0.484375 IAD 0.421875 0 0.92578125 1 0.07421875 Day of the week is not listed to save space According to relative occurrences in training data Prior class probabilities Class ontime delayed Prob. 0.806207419 <-- Success Class 0.193792581 10 Notes • Quite simple and useful • Better than exact Bayes approach because all combinations may not be present in the data (Exact Bayes will fail as there will be no conditional probability for that particular combination) • However, dependent on data and thus can give erroneous results for small data set • If an association makes sense, but is not present then the classification scheme will not work – Example: Yatch owners may be target for high value life insurance. However, collected data has no incidence of high value life insurance! • Next: Other classification schemes!