Classification I Lecturer: Dr. Bo Yuan LOGO E-mail: yuanb@sz.tsinghua.edu.cn Overview K-Nearest Neighbor Algorithm Naïve Bayes Classifier Thomas Bayes 3 Classification 4 Definition Classification is one of the fundamental skills for survival. Food vs. Predator A kind of supervised learning Techniques for deducing a function from data <Input, Output> Input: a vector of features Output: a Boolean value (binary classification) or integer (multiclass) “Supervised” means: A teacher or oracle is needed to label each data sample. We will talk about unsupervised learning later. 5 Classifiers Weight {boy, girl} Sam Peter Jack Z=f(x,y) Jane Tom Lisa Helen Mary Height Height 6 Weight Training a Classifier 7 Lazy Learners Truck Car 8 Neighborhood 9 K-Nearest Neighbor Algorithm The algorithm procedure: Given a set of n training data in the form of <x, y>. Given an unknown sample x′. Calculate the distance d(x′, xi) for i=1 … n. Select the K samples with the shortest distances. Assign x′ the label that dominates the K samples. It is the simplest classifier you will ever meet (I mean it!). No Training (literally) A memory of the training data is maintained. All computation is deferred until classification. Produces satisfactory results in many cases. Should give it a go whenever possible. 10 Properties of KNN Instance-Based Learning No explicit description of the target function Can handle complicated situations. 11 Properties of KNN K=7 Neighborhood ? K=1 Neighborhood Dependent of the data distributions. Can make mistakes at boundaries. 12 Challenges of KNN Non-monotonous impact on accuracy Too Big vs. Too Small Rule of thumbs Accuracy The Value of K Weights K Different features may have different impact … Distance There are many different ways to measure the distance. Euclidean, Manhattan … Complexity Need to calculate the distance between x′ and all training data. In proportion to the size of the training data. 13 Distance Metrics 1/ k d k Lk x, y xi yi i 1 1/ 2 d 2 L2 x, y xi yi i 1 d L1 x, y xi yi i 1 14 Distance Metrics The shortest path between two points … 15 Mahalanobis Distance Distance from a point to a point set 16 Mahalanobis Distance DM ( x) x T S 1 x For identity matrix S: DM ( x) x T x For diagonal matrix S: DM ( x) n i 1 xi i 2 i2 17 Voronoi Diagram perpendicular bisector 18 Voronoi Diagram 19 Structured Data 1 0.5 0 0.5 1 20 KD-Tree Point Set: {(2,3), (5,4), (9,6), (4,7), (8,1), (7,2)} 21 KD-Tree function kdtree (list of points pointList, int depth) { if pointList is empty return nil; else { // Select axis based on depth so that axis cycles through all valid values var int axis := depth mod k; // Sort point list and choose median as pivot element select median by axis from pointList; // Create node and construct subtrees var tree_node node; node.location := median; node.leftChild := kdtree(points in pointList before median, depth+1); node.rightChild := kdtree(points in pointList after median, depth+1); return node; } } 22 KD-Tree 23 Evaluation Accuracy Recall what we have learned in the first lecture … Confusion Matrix ROC Curve Training Set vs. Test Set N-fold Cross Validation Test Set Test Set Test Set Test Set Test Set 24 LOOCV Leave One Out Cross Validation An extreme case of N-fold cross validation N=number of available samples Usually very time consuming but okay for KNN Now, let’s try KNN+LOOCV … All students in this class are given one of two labels. Gender: Male vs. Female Major: CS vs. EE vs. Automation 25 26 Bayes Theorem P A B P A PB P A B A B P A B P A | BPB PB | AP A PB | AP A P A | B P B Bayes Theorem likelihood prior posterior evidence 27 Fish Example Salmon vs. Tuna P(ω1)=P(ω2) P(ω1)>P(ω2) Additional information Px | i Pi Pi | x P x 28 Shooting Example Probability of Kill P(A): 0.6 P(B): 0.5 The target is killed with: One shoot from A One shoot from B What is the probability that it is shot down by A? C: The target is killed. P(A C ) P(C A )P(A ) P(C ) 1 0.6 3 0.6 0.5 0.4 0.5 0.6 0.5 4 29 Cancel Example ω1: Cancer; ω2: Normal P(ω1)=0.008; P(ω2)=0.992 Lab Test Outcomes: + vs. – P(+|ω1)=0.98; P(-|ω1)=0.02 P(+|ω2)=0.03; P(-|ω2)=0.97 Now someone has a positive test result… Is he/she doomed? 30 Cancel Example P1 | P | 1 P1 0.98 0.008 0.0078 P2 | P | 2 P2 0.03 0.992 0.0298 P1 | P2 | 0.0078 P1 | 0.21 P1 0.0078 0.0298 31 Headache & Flu Example H=“Having a headache” F=“Coming down with flu” P(H)=1/10; P(F)=1/40; P(H|F)=1/2 What does this mean? One day you wake up with a headache … Since 50% flu cases are associated with headaches … I must have a 50-50 chance of coming down with flu! 32 Headache & Flu Example The truth is … P( H | F ) P( F ) 1 / 2 1 / 40 1 P( F | H ) P( H ) 1 / 10 8 Flu Headache 33 Naïve Bayes Classifier MAP arg max Pi | a1 , a2 ,...,an i MAP Pa1 , a2 ,...,an | i Pi arg max Pa1 , a2 ,...,an i MAP arg max Pa1 , a2 ,...,an | i Pi i Conditionally Independent MAP arg max Pi Pa j | i i MAP: Maximum A Posterior j 34 Independence P A B P APB | A PB | A PB P A B P APB P( A, B | G) P( A | G) P( B | G) P( A | G, B) P( A | G) Conditionally Independent P( A, B | G ) P( A, B,G ) / P(G ) P( A | B,G ) P( B,G ) / P(G ) P( A | B,G ) P( B | G ) 35 Conditional Independence P( R B | Y ) P( R | Y ) P( B | Y ) 36 Independent ≠ Uncorrelated 𝜌𝑋,𝑌 𝐸 𝑋 − 𝜇𝑋 𝑌 − 𝜇𝑌 𝑐𝑜𝑣 𝑋, 𝑌 = = 𝜎𝑋 𝜎𝑌 𝜎𝑋 𝜎𝑌 X [1, 1]; Cov (X,Y)=0 X and Y are uncorrelated Y X2 However, Y is completely determined by X. X Y 1 1 0.5 0.25 1 0.9 0.8 0.7 0.2 0.04 0 0 -0.2 0.04 -0.5 0.25 -1 1 Y 0.6 0.5 0.4 0.3 0.2 0.1 0 -1 -0.8 -0.6 -0.4 37 -0.2 0 X 0.2 0.4 0.6 0.8 1 Estimating P(αj|ωi) α1 α2 + α3 ω ω1 ω2 - ω1 + ω1 ω2 Laplace Smoothing P1 3 / 5; P2 2 / 5 Pa2 ''| 1 2 / 3 Pa2 ''| 1 1 / 3 Pa jk | i a j a jk i 1 i a j How about continuous variables? 38 Tennis Example Day Outlook Temperature Humidity Wind Play Tennis Day1 Day2 Sunny Sunny Hot Hot High High Weak Strong No No Day3 Overcast Hot High Weak Yes Day4 Rain Mild High Weak Yes Day5 Rain Cool Normal Weak Yes Day6 Rain Cool Normal Strong No Day7 Overcast Cool Normal Strong Yes Day8 Sunny Mild High Weak No Day9 Sunny Cool Normal Weak Yes Day10 Rain Mild Normal Weak Yes Day11 Sunny Mild Normal Strong Yes Day12 Overcast Mild High Strong Yes Day13 Overcast Hot Normal Weak Yes Day14 Rain Mild High Strong No 39 Tennis Example Given : Outlook sunny, T emperature cool, Humidity high, Wind strong Predict: PlayTennis( yes or no) Bayes Solution: PPlayTennis yes 9 / 14 PPlayTennis no 5 / 14 PWind strong | PlayTennis yes 3 / 9 PWind strong | PlayTennis no 3 / 5 ... P ( yes) P ( sunny| yes) P (cool | yes) P(high | yes) P ( strong | yes) 0.0053 P (no) P ( sunny| no) P (cool | no) P (high | no) P( strong | no) 0.0206 s wit h probability : T heconclusionis not t oplay t enni 40 0.0206 0.795 0.0206 0.0053 Text Classification Example Interesting? Boring? Politics? Entertainment? Sports? 41 Text Representation α1 α2 α3 α4 … αn ω Long long ago there … king 1 New sanctions will be … Iran 0 Hidden Markov models are … method 0 The Federal Court today … investigate 0 We need to estimate probabilities such as 𝑃 𝑎2 = 𝑘𝑖𝑛𝑔|𝜔 = 1 . However, there are 2×n×|Vocabulary| terms in total. For n=100 and a vocabulary of 50,000 distinct words, it adds up to 10 million terms! 42 Text Representation By only considering the probability of encountering a specific word instead of the specific word position, we can reduce the number of probabilities to be estimated. We only count the frequency of each word. Now, 2×50,000=100,000 terms need to be estimated. nk 1 PVK | i n | Vocabulary| n: the total number of word positions in all training samples whose target value is ωi. nk: the number of times word Vk is found among these n positions. 43 Case Study: Newsgroups Classification Joachims, 1996 20 newsgroups 20,000 documents Random Guess: 5% NB: 89% Recommendation Lang, 1995 NewsWeeder User rated articles Interesting vs. Uninteresting Top 10% selected articles 16% vs. 59% 44 Reading Materials C. C. Aggarwal, A. Hinneburg and D. A. Keim, “On the Surprising Behavior of Distance Metrics in High Dimensional Space,” Proc. the 8th International Conference on Database Theory, LNCS 1973, pp. 420-434, London, UK, 2001. J. H. Friedman, J. L. Bentley, and R. A. Finkel, “An Algorithm for Finding Best Matches in Logarithmic Expected Time,” ACM Transactions on Mathematical Software, 3(3):209–226, 1977. S. M. Omohundro, “Bumptrees for Efficient Function, Constraint, and Classification Learning,” Advances in Neural Information Processing Systems 3, pp. 693-699, Morgan Kaufmann, 1991. Tom Mitchell, Machine Learning (Chapter 6), McGraw-Hill. Additional reading about Naïve Bayes Classifier http://www-2.cs.cmu.edu/~tom/NewChapters.html Software for text classification using Naïve Bayes Classifier http://www-2.cs.cmu.edu/afs/cs/project/theo-11/www/naive-bayes.html 45 Review What is classification? What is supervised learning? What does KNN stand for? What are the major challenges of KNN? How to accelerate KNN? What is N-fold cross validation? What does LOOCV stand for? What is Bayes Theorem? What is the key assumption in Naïve Bayes Classifiers? 46 Next Week’s Class Talk Volunteers are required for next week’s class talk. Topic 1: Efficient KNN Implementations Hints: Ball Trees Metric Trees R Trees Topic 2: Bayesian Belief Networks Length: 20 minutes plus question time 47