Classification Methods: k-Nearest Neighbor Naïve Bayes Ram Akella Lecture 4 February 9, 2011 UC Berkeley Silicon Valley Center/SC 1 Overview Example The Naïve rule Two data-driven methods (no model) K-nearest neighbors Naïve Bayes 2 Example: Personal Loan Offer As part of customer acquisition efforts, Universal bank wants to run a campaign for current customers to purchase a loan. In order to improve target marketing, they want to find customers that are most likely to accept the personal loan offer. They use data from a previous campaign on 5000 customers, 480 of them accepted. 3 Personal Loan Data Description ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal Loan Securities Account CD Account Online CreditCard Customer ID Customer's age in completed years #years of professional experience Annual income of the customer ($000) Home Address ZIP code. Family size of the customer Avg. spending on credit cards per month ($000) Education Level. 1: Undergrad; 2: Graduate; 3: Advanced/Professional Value of house mortgage if any. ($000) Did this customer accept the personal loan offered in the last campaign? Does the customer have a securities account with the bank? Does the customer have a certificate of deposit (CD) account with the bank? Does the customer use internet banking facilities? Does the customer use a credit card issued by UniversalBank? File: “UniversalBank KNN NBayes.xls” 4 The Naïve Rule Classify a new observation as a member of the majority class In the personal loan example, the majority of customers did not accept the loan 5 K-Nearest Neighbor: Idea Find the k closest records to the one to be classified, and let them “vote”. 100 90 80 70 Age 60 Regular beer 50 Light beer 40 30 20 10 0 $0 $20,000 $40,000 $60,000 $80,000 Income 6 What does the algorithm do? Computes the distance between the record to be classified and each of records in the training set Finds the k shortest distances Computes the vote of these k neighbors This is repeated for every record in the validation set 7 Experiment We have 100 training points : 60 pink and 40 blue. Then we have 50 test points, For each point, we voted, using 5-nearest neighbor How do we measure how well the classifier did? We compare the predicted with actual value in each of the 50 point validation/test set 8 Distance between 2 observations Single variable case: each item has 1 value. Customer 1 has income = 49K Multivariate case: Each observation is a vector of values. Customer1 = (Age=25,Exp=1,Income=49,…,CC=0) Customer2 = (Age=49,Exp=19,Income=34,…,CC=0) The distance between obs i and j is denoted dij. Distance Requirements: Non-negative ( dij > 0 ) dii = 0 Symmetry (dij = dji ) Triangle inequality ( dij + djk dik ) 9 Types of Distances Notation: Example: xi ( xi1 , xi 2 ,, xip ) x j ( x j1 , x j 2 ,, x jp ) Customer1=(Age=25,Exp=1, Inc=49, fam=4,CCAvg=1.6) Customer2=(Age=49,Exp=19,Inc=34, fam=3,CCAvg=1.5) 10 Euclidean Distance dij x i1 x j1 xi 2 x j 2 xip x jp 2 2 2 The Euclidean distance between the age of customer1 (25) and customer2 (49): [ (25-49)2 ] = 24 The Euclidean distance between the two on the 5dimensions (Age, Exper, Income, Family, CCAvg): [ (25-49)2 + (1-19)2 + (49-34)2 + (4-3)2 + (1.6-1.5)2]= =30.82 11 which pair is closest ? Carry Sam Miranda Income $31,779 $32,739 $33,880 Age 36 40 38 55% 27% 1. Carry & Sam 2. Sam & Miranda 18% 3. Carry & Miranda Carry & Sam: (31.779-32.739)2 + (36-40)2 = 960.00 Now, income is in $000. Which pair is closest? 12% 84% 4% Carry Sam Miranda Income $31.779 $32.739 $33.880 1. Carry & Sam 2. Sam & Miranda 3. Carry & Miranda Sam & Miranda: √(32.739-33.88)2 + (40-38)2 = 5.30 Age 36 40 38 Why do we need to standardize the variables? The distance measure is influenced by the units of the different variables, especially if there is a wide variation in units. Variables with “larger” units will influence the distances more than others. The solution: standardize each variable before measuring distances! 14 Other distances Squared Euclidean distance Correlation-based distance: the correlation between two vectors of (standardized) items/observations, rij, measures their similarity. We can define a distance measure as dij = 1- rij2 Statistical distance (no need to standardize) dij x x S x x 1 i j T i j The only measure that accounts for covariance! Manhattan distance (“city-block”) d ij xi1 x j1 xi 2 x j 2 xip x jp Note: some software use “similarities” instead of “distances”. 15 Distances for Binary Data Are obtained from the 2x2 table of counts. Carrie 0 1 0 a b 0 Miranda 1 c d Married? Carrie Sam Miranda Smoker? 1 0 0 Manager? 1 1 1 0 0 1 1 0 0 2 1 0 1 16 Choosing the number or neighbors (K) Too small: under-smoothing Too large: over-smoothing Typically k<20 K should be odd (to avoid ties) Solution: Use validation set to find “best” k 17 Training Data scoring - Summary Report (for k=4) Cut off Prob.Val. for Success (Updatable) Output 0.5 Classification Confusion Matrix Predicted Class Actual Class 1 0 1 243 20 0 43 2694 Error Report We’re using the validation data here to choose the best k Validation error log for different k % Error Training Value of k 1 2 3 4 5 4.15 4.45 4.10 3.80 <--- Best k 4.50 Cumulative 150off Prob.Val. for Success Cut 100 50 Classification Confusion Matrix 0 Predicted Class 0 1000 2000 Actual Class# cases 1 3000 1 243 Cumulative Personal Loan using average 0 43 % Error 15.03 0.74 2.10 0.5 Classification Confusion Matrix Predicted Class Actual Class 1 0 1 134 16 Class 1 0 Overall # Cases 194 1806 2000 0 60 1790 Error Report 250 Training Data scoring - Summary Report (for k=4) Cumulative Personal Loan when sorted (Updatable) using predicted values # Errors 43 20 63 Cut off Prob.Val. for Success (Updatable) Lift chart (validation dataset) 200 # Cases 286 2714 3000 Validation Data scoring - Summary Report (for k=4) % Error Validation 0.00 1.30 2.47 2.10 3.40 Class 1 0 Overall 0.5 # Errors 60 16 76 % Error 30.93 0.89 3.80 18 Advantages and Disadvantages of K nearest neighbors The Good Very flexible, data-driven Simple With large amount of data, where predictor levels are well represented, has good performance Can also be used for continuous y: instead of voting, take average of neighbors (XLMiner: Prediction > KNN) The bad No insight about importance/role of each predictor Beware of over-fitting! Need a test set Can be computationally intensive for large k Need LOTS of data (exponential in #predictors) 19 Conditional Probability - reminder A = the event “customer accepts loan” B = the event “customer has credit card” P( A | B) denotes the probability of A given B (the conditional probability that A occurs given that B occurred) P( A B) P( A | B) P( B) If P(B)>0 20 Naïve Bayes Naive Bayes is one of the most efficient and effective inductive learning algorithms for machine learning and data mining. It calculates the probability of a point E to belong to a certain class Ci based on its attributes (x1, x2, …, xn) It assumes that the attributes are conditional independent on the class Ci C x1 x2 xn …. 21 Illustrative Example The example E is represented by a set of attribute values (x1, x2, · · · , xn), where xi is the value of attribute Xi. Let C represents the classification variable, and let c be the value of C. In this example we assume that there are only two classes: + (the positive class) or − (the negative class). A classifier is a function that assigns a class label to an example. From the probability perspective, according to Bayes Rule, the probability of an example E = (x1, x2, · · · , xn) being class c is 22 Naïve Bayes Classifier E is classified as the class C = +if and only if: where fb(E) is called a Bayesian classifier. Assume that all attributes are independent given the value of the class variable, that is: The function fb(E) is called a naive Bayesian classifier, or simply naive Bayes (NB). 23 Augmented Naïve Bayes Naive Bayes is the simplest form of Bayesian network, in which all attributes are independent given the value of the class variable. This conditional independence assumption is rarely true in most real-world applications. A straightforward approach to overcome the limitation of naive Bayes is to extend its structure to represent explicitly the dependencies among attributes. 24 Augmented Naïve Bayes An augmented naive Bayes (ANB), is an extended classifier, in which the class node directly points to all attribute nodes, and there exist links among attribute nodes. An ANB represents a joint probability distribution represented by: where pa(xi) denotes an assignment to values of the parents of Xi. C x1 x2 …. Xn-1 xn 25 Why does this classifier work? The basic idea comes from In a given dataset, two attributes may depend on each other, but the dependence may distribute evenly in each class. Clearly, in this case, the conditional independence assumption is violated, but naive Bayes is still the optimal classifier. What eventually affects the classification is the combination of dependencies among all attributes. If we just look at two attributes, there may exist strong dependence between them that affects the classification. When the dependencies among all attributes work together, however, they may cancel each other out and no longer affect the classification. 26 Why does this classifier work? Definition 1: Given an example E, two classifiers f1 and f2 are said to be equal under zeroone loss on E, if f1(E) ≥ 0 if and only if f2(E) ≥ 0, denoted by f1(E) = f2(E) for every example E in the example space. 27 Local Dependence Distribution Definition 2: For a node X on ANB, the local dependence derivative of X in classes + and − are defined as: where dd+G(x|pa(x)) reflects the strength of the local dependence of node X in class +, This measures the influence of X’s local dependence on the classification in class +. dd−G (x|pa(x)) is similar for the negative class. 28 Local Dependence Distribution 1. When X has no parent, then: dd+ G(x|pa(x)) = dd−G(x|pa(x)) = 1. 2. When dd+G(x|pa(x)) ≥ 1, X’s local dependence in class + supports the classification of C = +. Otherwise, it supports the classification of C = − 3. When dd−G(x|pa(x)) ≥ 1, X’s local dependence in class − supports the classification of C = −. Otherwise, it supports the classification of C = +. 29 Local Dependence Distribution When the local dependence derivatives in both classes support the different classifications, the local dependencies in the two classes cancel partially each other out, The final classification that the local dependence supports, is the class with the greater local dependence derivative. Another case is that the local dependence derivatives in the two classes support the same classification. Then, the local dependencies in the two classes work together to support the classification. 30 Local Dependence Derivative Ratio Definition 3 For a node X on ANB G, the local dependence derivative ratio at node X, denoted by ddrG(x) is defined by: ddrG(x) quantifies the influence of X’s local dependence on the classification. 31 Local Dependence Derivative Ratio We have: 1. If X has no parents, ddrG(x) = 1. 2. If dd+G(x|pa(x)) = dd−G (x|pa(x)), ddrG(x) = 1. This means that x’s local dependence distributes evenly in class + and class −. Thus, the dependence does not affect the classification, no matter how strong the dependence is. 3. If ddrG(x) > 1, X’s local dependence in class + is stronger than that in class −. ddrG(x) < 1 means the opposite. 32 Global Dependence Distribution Let us explore under what condition an ANB works exactly the same as its correspondent naive Bayes. Theorem 1 Given an ANB G and its correspondent naïve Bayes Gnb (i.e., remove all the arcs among attribute nodes from G) on attributes X1, X2, ..., Xn, assume that fb and fnb are the classifiers corresponding to G and Gnb, respectively. For a given example E = (x1, x2, · · ·, xn), the equation below is true. where the product of ddrG(xi) for i=1..N is called the dependence distribution factor at example E, denoted by DFG(E). 33 Global Dependence Distribution Proof: 34 Global Dependence Distribution Theorem 2 Given an example E = (x1, x2, ..., xn), an ANB G is equal to its correspondent naive Bayes Gnb under zero-one loss if and only if when fb(E) ≥ 1, DFG(E) ≤ fb(E); or when fb(E) < 1, DFG(E) > fb(E). 35 Global Dependence Distribution Applying the theorem 2 we have the following results: 1. When DFG(E) = 1, the dependencies in ANB G has no influence on the classification. The classification of G is exactly the same as that of its correspondent naïve Bayes Gnb. There exist three cases for DFG(E) = 1. no dependence exists among attributes. for each attribute X on G, ddrG(x) = 1; that is, the local distribution of each node distributes evenly in both classes. the influence that some local dependencies support classifying E into C = +is canceled out by the influence that other local dependences support classifying E into C = −. 36 Global Dependence Distribution 2. fb(E) = fnb(E) does not require that DFG(E) = 1. The precise condition is given by Theorem 2. That explains why naive Bayes still produces accurate classification even in the datasets with strong dependencies among attributes (Domingos & Pazzani 1997). 3. The dependencies in an ANB flip (change) the classification of its correspondent naive Bayes, only if the condition given by Theorem 2 is no longer true. 37 Conditions of the optimality of the Naïve Bayes Naive Bayes classifier is optimal if the dependencies among attributes cancel each other out. The classifier is still optimal even though the dependencies do exist 38 Optimality of the Naïve Bayes Example: We have two attributes X1 and X2, and assume that the class density is a multivariate Gaussian in both the positive and negative classes. That is: where x = (x1, x2) ∑+ and ∑ − are the covariance matrices in the positive and negative classes respectively, | ∑ − | and | ∑ + | are the determinants of ∑ − and ∑ +, ∑ −1 + and ∑−1 − are the inverses of ∑ − and ∑ + μ+ = (μ+1 , μ+2 ) and μ− = (μ−1 , μ−2 ), μ+ i and μ−i are the means of attribute Xi in the positive and negative classes respectively, (x−μ+)T and (x−μ−)T are the transposes of (x−μ+) and (x−μ−). 39 Optimality of the Naïve Bayes We assume: The two classes have a common covariance matrix ∑+ = ∑− = ∑ , X1 and X2 have the same variance σ in both classes. Then, when applying a logarithm to the Bayesian classifier, defined previously, we obtain the following fb classifier 40 Optimality of the Naïve Bayes Then, because of the conditional independence assumption, we have the correspondent naive Bayesian classifier fnb Assume that X1 and X2 are independent if σ12 = 0. If σ ≠ σ12, we have: 41 Optimality of the Naïve Bayes An example E is classified into the positive class by fb, if and only if fb ≥ 0. fnb is similar. When fb or fnb is divided by a non-zero positive constant, the resulting classifier is the same as fb or fnb. Then 42 Optimality of the Naïve Bayes where a = − (1/σ2)(μ+ + μ−)Σ−1(μ+ − μ−), is a constant independent of x. For any x1 and x2, Naive Bayes has the same classification as that of the underlying classifier if: 43 Optimality of the Naïve Bayes This is: 1 44 Optimality of the Naïve Bayes Assuming that: We can simplify the equation to: 1 where 45 Optimality of the Naïve Bayes The shaded area of the figure shows the region in which the Naïve Bayes Classifier is optimal 46 Example with 2 predictors: CC, Online P(accept =1 | CC=1, online=1) = 50/286 286/3000 P(CC 1, Online 1 | accept 1) P(accept 1) P(CC 1, Online 1 | accept 1) P(accept 1) P(CC 1, Online 1 | accept 0) P(accept 0) Count of Personal Loan CreditCard Personal Loan 0 Online 0 1 0 Total 1 1 Total Grand Total 0 1 0 769 71 840 321 36 357 1197 1 Grand Total 1163 1932 129 200 1292 2132 461 782 50 86 511 868 1803 3000 47 P(CC=1, Online=1 | accept=0) is approx 20% 20% 20% 20% 20% 1. 2. 3. 4. 5. 50/286 1-50/286 461/3000 461/(3000-286) 129/(3000-286) Example with 2 predictors: CC, Online P(accept =1 | CC=1, online=1) = P(CC 1, Online 1 | accept 1) P(accept 1) P(CC 1, Online 1 | accept 1) P(accept 1) P(CC 1, Online 1 | accept 0) P(accept 0) 50 286 286 3000 0.0978 50 286 461 2714 286 3000 2714 3000 49 The practical difficulty We need to have ALL the combinations of predictor categories CC=1,Online=1 CC=1, Online=0 CC=0, Online=1 CC=0, Online=0 With many predictors, this is pretty unlikely 50 Example with (only) 3 predictors: CC, Online, CD account Count of Personal Loan CreditCard CD Account Personal Loan 0 0 1 0 Total 1 1 Total Grand Total 0 1 0 0 769 69 838 318 30 348 1186 Online 0 Total 1 1152 100 1252 363 363 1615 CD account=0, Online=1, CreditCard=1 1921 169 2090 681 30 711 2801 1 0 2 2 3 6 9 11 1 Total Grand Total 1 11 29 40 98 50 148 188 11 31 42 101 56 157 199 1932 200 2132 782 86 868 3000 51 A practical solution: From Bayes to Naïve Bayes Substitute P(CC=1,Online=1 | accept) with P(CC=1 | accept) x P(Online=1 | accept) This means that we are assuming independence between CC and Online! If the dependence is not extreme, it will work reasonably well 52 Example with 2 predictors: CC, Online P(accept =1 | CC=1, online=1) = P(CC 1 | accept 1) P(Online 1 | accept 1) P(CC 1, Online 1 | accept 1) P(accept 1) P( P(CC 1, Online 1 | accept 1) P(accept 1) P(CC 1, Online 1 | accept 0) P(accept 0) P(CC 1 | accept 0) P(Online 1 | accept 0) Count of Personal Loan CreditCard Personal Loan 0 Online 0 1 0 Total 1 1 Total Grand Total 0 1 0 769 71 840 321 36 357 1197 1 Grand Total 1163 1932 129 200 1292 2132 461 782 50 86 511 868 1803 3000 53 Naïve Bayes for CC, Online: P(accept =1 | CC=1, online=1) = P(CC 1 | accept 1) P(Online 1 | accept 1) P(CC 1, Online 1 | accept 1) P(accept 1) P( P(CC 1, Online 1 | accept 1) P(accept 1) P(CC 1, Online 1 | accept 0) P(accept 0) P(CC 1 | accept 0) P(Online 1 | accept 0) Count of Personal Loan CreditCard Personal Loan 0 0 1932 1 200 Grand Total 2132 Count of Personal Loan Online Personal Loan 0 0 1090 1 107 Grand Total 1197 1 Grand Total 782 2714 86 286 868 3000 1 Grand Total 1624 2714 179 286 1803 3000 86 179 286 286 286 3000 0.102 86 179 286 782 1642 2714 286 286 3000 2714 2714 3000 54 Naïve Bayes in XLMiner Classification> Naïve Bayes Prior class probabilities According to relative occurrences in training data Class 1 0 Prob. 0.095333333 <-- Success Class 0.904666667 P(CC=1| accept=1) = 86/286 Conditional probabilities Classes--> Input Variables Online CreditCard 1 Value 0 1 0 1 0 Prob 0.374125874 0.625874126 0.699300699 0.300699301 Value 0 1 0 1 Prob 0.401621223 0.598378777 0.711864407 0.288135593 55 Naïve Bayes in XLMiner Scoring the validation data XLMiner : Naive Bayes - Classification of Validation Data Data range ['UniversalBank KNN NBayes.xls']'Data_Partition1'!$C$3019:$O$5018 Cut off Prob.Val. for Success (Updatable) Row Id. 2 3 7 8 11 13 14 15 16 Predicted Class 0 0 0 0 0 0 0 0 0 0.5 Actual Class 0 0 0 0 0 0 0 0 0 Prob. for 1 (success) 0.08795125 0.08795125 0.097697987 0.092925663 0.08795125 0.08795125 0.097697987 0.08795125 0.10316131 Back to Navig ( Updating the value here will NOT update value in summary re Online CreditCard 0 0 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 1 56 Advantages and Disadvantages The good Simple Can handle large amount of predictors High performance accuracy, when the goal is ranking Pretty robust to independence assumption! The bad Requires large amounts of data Need to categorize continuous predictors Predictors with “rare” categories -> zero prob (if this category is important, this is a problem) Gives biased probability of class membership No insight about importance/role of each predictor 57