Lecture 8: − Maximum Likelihood Estimation (MLE) (cont’d.) − Maximum a posteriori (MAP) estimation − Naïve Bayes Classifier Aykut Erdem March 2016 Hacettepe University Flipping a Coin Last time… Flipping a Coin I have a coin, if I flip it, what’s the probability that it will fall with the head up? Let us flip it a few times to estimate the probability: slide by Barnabás Póczos & Alex Smola The estimated probability is: 3/5 “Frequency of heads” 13 2 Flipping a Coin Last time… Flipping a Coin The estimated probability is: 3/5 “Frequency of heads” Questions: Questions: slide by Barnabás Póczos & Alex Smola (1) Why Why frequency frequencyofofheads??? heads??? (1) (2) (2) How How good good is isthis thisestimation??? estimation??? (3) Why is this a machine learning problem??? (3) Why is this a machine learning problem??? We are are going these questions We goingtotoanswer answer these questions 14 3 Question (1) Why frequency of heads??? • Frequency of heads is exactly the maximum likelihood estimator for this problem • MLE has nice properties (interpretation, statistical guarantees, simple) slide by Barnabás Póczos & Alex Smola 4 MLE distribution MLEfor forBernoulli Bernoulli distribution Data, D = P(Heads) = θ, P(Tails) = 1-θ Flips are i.i.d.: slide by Barnabás Póczos & Alex Smola – Independent events – Identically distributed according to Bernoulli distribution MLE: Choose θ that maximizes the probability of observed data 5 17 MLE distribution MLEfor forBernoulli Bernoulli distribution Data, D = P(Heads) = θ, P(Tails) = 1-θ Flips are i.i.d.: slide by Barnabás Póczos & Alex Smola – Independent events – Identically distributed according to Bernoulli distribution MLE: Choose θ that maximizes the probability of observed data 6 17 MLE distribution MLEfor forBernoulli Bernoulli distribution Data, D = P(Heads) = θ, P(Tails) = 1-θ Flips are i.i.d.: slide by Barnabás Póczos & Alex Smola – Independent events – Identically distributed according to Bernoulli distribution MLE: Choose θ that maximizes the probability of observed data 7 17 MLE distribution MLEfor forBernoulli Bernoulli distribution Data, D = P(Heads) = θ, P(Tails) = 1-θ Flips are i.i.d.: slide by Barnabás Póczos & Alex Smola – Independent events – Identically distributed according to Bernoulli distribution MLE: Choose θ that maximizes the probability of observed data 8 17 Maximum Likelihood Maximum Likelihood Estimation Estimation MLE: Choose θ that maximizes the probability of observed data Maximum Likelihood Estimation Independent independentdraws draws ose θ that maximizes the probability of observed data iden,cally Identically distributed distributed slide by Barnabás Póczos & Alex Smola 18 9 Maximum Likelihood Maximum Likelihood Estimation Estimation MLE: Choose θ that maximizes the probability of observed data Maximum Likelihood Estimation Independent draws independent draws ose θ that maximizes the probability of observed data iden,cally Identically distributed distributed slide by Barnabás Póczos & Alex Smola 18 10 Maximum Likelihood Maximum Likelihood Estimation Estimation MLE: Choose θ that maximizes the probability of observed data Maximum Likelihood Estimation Independent draws independent draws ose θ that maximizes the probability of observed data identically Identically distributed distributed slide by Barnabás Póczos & Alex Smola 18 11 Maximum Likelihood Maximum Likelihood Estimation Estimation MLE: Choose θ that maximizes the probability of observed data Maximum Likelihood Estimation Independent draws independent draws ose θ that maximizes the probability of observed data identically Identically distributed distributed slide by Barnabás Póczos & Alex Smola 18 12 Maximum Likelihood Maximum Likelihood Estimation Estimation MLE: Choose θ that maximizes the probability of observed data Maximum Likelihood Estimation Independent draws independent draws ose θ that maximizes the probability of observed data identically Identically distributed distributed slide by Barnabás Póczos & Alex Smola 18 13 Maximum Maximum Likelihood Likelihood Estimation Maximum Estimation Likelihood Estimation Estimation LE: Choose θ that maximizes the probability of observed data MLE: MLE:Choose Choose θthat thatmaximizes maximizesthe theprobability probability of of observed observed data data Independent draws Identically distributed slide by Barnabás Póczos & Alex Smola That’s exactly the “Frequency of heads” 4818 14 Maximum Maximum Likelihood Likelihood Estimation Maximum Estimation Likelihood Estimation Estimation LE: Choose θ that maximizes the probability of observed data MLE: MLE:Choose Choose θthat thatmaximizes maximizesthe theprobability probability of of observed observed data data Independent draws Identically distributed slide by Barnabás Póczos & Alex Smola That’s exactly the “Frequency of heads” 4818 15 Maximum Maximum Likelihood Likelihood Estimation Maximum Estimation Likelihood Estimation Estimation LE: Choose θ that maximizes the probability of observed data MLE: MLE:Choose Choose θthat thatmaximizes maximizesthe theprobability probability of of observed observed data data Independent draws Identically distributed slide by Barnabás Póczos & Alex Smola That’s exactly the “Frequency of heads” 4818 16 Question (2) • How good is this MLE estimation??? slide by Barnabás Póczos & Alex Smola 17 How many flips do I need? I flipped the coins 5 times: 3 heads, 2 tails What if I flipped 30 heads and 20 tails? slide by Barnabás Póczos & Alex Smola • • Which estimator should we trust more? The more the merrier??? 18 Simple Simplebound bound Let Let θ*be bethe thetrue trueparameter. parameter. * θ Fornn==αα , and H+α,Tand For +α H T For any ε>0: For any ε>0: Hoeffding’s inequality: Hoeffding’s inequality: slide by Barnabás Póczos & Alex Smola 22 19 Probably Approximate Correct For n = αH+αT, and (PAC) Learning I For want to know the coin parameter θ, within ε = 0.1 any ε>0: error with probability at least 1-δ = 0.95. Hoeffding’s inequality: How many flips do I need? 22 slide by Barnabás Póczos & Alex Smola Sample complexity: 20 Question (3) Why is this a machine learning problem??? • • • improve their performance (accuracy of the predicted prob. ) at some task (predicting the probability of heads) with experience (the more coins we flip the better we are) slide by Barnabás Póczos & Alex Smola 21 What continuous What about about continuous features? features? 3 4 5 6 7 8 9 Let us try Gaussians… slide by Barnabás Póczos & Alex Smola σ2 µ=0 σ2 µ=0 25 22 MLE mean MLEfor for Gaussian Gaussian mean and variance and variance Choose θ= (µ,σ2) that maximizes the probability of observed data Independent draws Identically distributed slide by Barnabás Póczos & Alex Smola 23 26 MLE mean MLEfor forGaussian Gaussian mean and variance variance and slide by Barnabás Póczos & Alex Smola Note: MLE for the variance of a Gaussian is biased Note: MLE for result the variance Gaussian is biased [Expected of estimationofis a not the true parameter!] [Expected result of estimation is not the true parameter!] Unbiased variance estimator: Unbiased variance estimator: 24 27 What about prior knowledge? (MAP Estimation) slide by Barnabás Póczos & Aarti Singh 25 What ? What about aboutprior priorknowledge knowledge? We know the coin is “close” to 50-50. What can we do now? The Bayesian way… Rather than estimating a single θ, we obtain a distribution over possible values of θ Before data After data slide by Barnabás Póczos & Aarti Singh 50-50 29 26 Prior distribution Prior distribution What What prior? What distribution do we want for a prior? What distribution do we want for a prior? prior? • • ! Represents expert knowledge (philosophical approach) Represents expert knowledge (philosophical approach) Simple posterior (engineer’s approach) Simple! posterior formform (engineer’s approach) Uninformative priors: Uninformative priors: • Uniform ! distribution Uniform distribution slide by Barnabás Póczos & Aarti Singh Conjugate priors: • • Conjugate representation priors: Closed-form of posterior P(θ) and have the same form ! P(θ|D) Closed-form representation of posterior ! P(θ) and P(θ|D) have the same form 27 In order to proceed wewe will In order to proceed willneed: need: Bayes Rule Bayes Rule slide by Barnabás Póczos & Aarti Singh 28 Chain Rule Rule & & Bayes Bayes Rule Chain Rule Chain rule: Bayes rule: slide by Barnabás Póczos & Aarti Singh Bayes ruleisisimportant important for Bayes rule forreverse reverseconditioning. conditioning. 32 29 Bayesian Learning BayesianLearning Learning Bayesian • Use Bayes rule: • Use Bayes rule: • Or equivalently: • Or equivalently: slide by Barnabás Póczos & Aarti Singh posterior posterior likelihood prior likelihood prior 33 50 30 MAP for Binomial Binomial MAP estimation estimation for distribution distribution Coin flip problem Likelihood is Binomial If the prior is Beta distribution, ) posterior is Beta distribution slide by Barnabás Póczos & Aarti Singh P( ) and P( | D) have the same form! [Conjugate prior] 3151 Beta distribution Beta distribution slide by Barnabás Póczos & Aarti Singh More concentrated as values of α, β increase More concentrated as values of α, β increase 37 32 Beta prior Beta conjugate conjugate prior slide by Barnabás Póczos & Aarti Singh As n = αH + αT As n = α H +increases αT increases As we get more samples, effect of prior is “washed out” As we get more samples, effect of prior is “washed out” 38 33 Han Solo and Bayesian Priors C3PO: Sir, the possibility of successfully navigating an asteroid field is approximately 3,720 to 1! Beta conjugate prior Han: Never tell me the odds! As n = αH + αT increases 34 https://www.countbayesie.com/blog/2015/2/18/hans-solo-and-bayesian-priors As we get more samples, effect of prior is “washed out” MLEvs. vs. MAP MAP MLE ! Maximum Likelihood estimation (MLE) Choose value that maximizes the probability of observed data ! Maximum a posteriori (MAP) estimation Choose value that is most probable given observed data and prior belief slide by Barnabás Póczos & Aarti Singh When is MAP same as MLE? 34 35 From Binomial to Multinomial From Binomial to Multinomial From Binomial to Multinomial From Binomial to Multinomial From Binomial to Multinomial Example: Dice roll problem (6 outcomes instead of 2) xample: Dice roll problem (6 outcomes instead of 2) Example: Dice roll problem (6 outcomes instead of 2) Example: Dice roll problem (6 outcomes instead of 2) Likelihood is ~ Multinomial(θ = {θ , θ , … , θ }) Example: Dice roll problem (6 outcomes 1 2 insteadk of 2) kelihood is ~ Multinomial(θ = {θ1, θ2, … , θk}) Likelihoodisis~~ Multinomial(θ Multinomial(θ = ={θ{θ θk}), θ }) 1, θ,2,θ..., ,… Likelihood 1 2, … 2 , θk}) k ikelihood is ~ Multinomial(θ = {θ1, θ prior is Dirichlet distribution, If prior is Dirichlet distribution, prior is Dirichlet distribution, fprior priorisisDirichlet Dirichletdistribution, distribution, slide by Barnabás Póczos & Aarti Singh Then posterior is Dirichlet distribution Then posterior isDirichlet Dirichlet distribution hen posterior is distribution Then Thenposterior posteriorisisDirichlet Dirichletdistribution distribution For Multinomial, prior Dirichlet distribution For Multinomial, conjugate conjugate prior is is Dirichlet distribution. For Multinomial, conjugate prior is Dirichlet distribution. For Multinomial, conjugate prior is Dirichlet distribution. For Multinomial, conjugate prior is Dirichlet distribution. http://en.wikipedia.org/wiki/Dirichlet_distribution http://en.wikipedia.org/wiki/Dirichlet_distribution 36 Bayesians vs.Frequentists Bayesians vs. Frequentists You are no good when sample is small slide by Barnabás Póczos & Aarti Singh You give a different answer for different priors 37 Recap: What about prior knowledge? (MAP Estimation) slide by Barnabás Póczos & Aarti Singh 38 Recap: What about knowledge? What about priorprior knowledge ? We know the coin is “close” to 50-50. What can we do now? The Bayesian way… Rather than estimating a single θ, we obtain a distribution over possible values of θ Before data After data slide by Barnabás Póczos & Aarti Singh 50-50 29 39 Chain Rule & Bayes Rule Recap: Chain Rule & Bayes Rule Chain rule: Bayes rule: slide by Barnabás Póczos & Aarti Singh 4 40 Bayesian Learning Recap: Bayesian Learning Bayesian Learning D is the measured data Our goal is to estimate parameter θ Use Bayes rule: Use Bayes Bayes rule: • Use rule: slide by Barnabás Póczos & Aarti Singh • Or equivalently: equivalently: Or equivalently: posterior posterior likelihood likelihood prior prior 5 5 41 Recap: MAP estimation for MAP estimation for Binomia MAP estimation for Binomial distribution Binomial distribution distribution Coin flip problem: Likelihood is Binomial Coin flip problem: Likelihood is Binomial In the coin flip problem: Likelihood is Binomial: If the prior If the priorisisBeta Betadistribution, distribution, If the prior is Beta: slide by Barnabás Póczos & Aarti Singh posteriorisisBeta Beta distribution distribution ⇒⇒ posterior then the posterior is Beta distribution Beta function: Beta function: 42 BetaBeta conjugate prior Recap: conjugate prior slide by Barnabás Póczos & Aarti Singh As n = αH + αT As n = α H +increases αT increases As we get more samples, effect of prior is “washed out” As we get more samples, effect of prior is “washed out” 38 43 Application of Bayes Rule slide by Barnabás Póczos & Aarti Singh 44 AIDS test (Bayes rule) Data AIDS test (Bayes rule) Data Approximately 0.1% are infected • • • Approximately 0.1% are infected Test detects all infections Test detects all infections TestTest reports positive for 1% healthy people reports positive for 1% healthy people Probability of having Probability having AIDS AIDS ifif test testisispositive: positive slide by Barnabás Póczos & Aarti Singh Only 9%!... 10 45 Improving the diagnosis Improving the diagnosis Use a weaker follow-up test! UseApproximately a weaker follow-up test! 0.1% are infected • • • Approximately arefor infected Test 2 reports0.1% positive 90% infections Test 2 reports positive infections Test 2 reports positivefor for 90% 5% healthy people Test 2 reports positive for 5% healthy people = slide by Barnabás Póczos & Aarti Singh 64%!... 11 46 AIDS test (Bayes rule) Improving the diagnosis Why can’t we use Test 1 twice? •Why can’t we use Test 1 twice? Outcomes are not independent, • but tests 1 and 2 conditionally independent Outcomes are not independent, (by assumption): but tests 1 and 2 are conditionally independent (by assumption slide by Barnabás Póczos & Aarti Singh 47 The Naïve Bayes Classifier slide by Barnabás Póczos & Aarti Singh 48 Data for Data for spam filtering spam filtering • date • time • recipient path • IP number • sender slide by Barnabás Póczos & Aarti Singh • encoding • many more features Delivered-To: alex.smola@gmail.com Received: by 10.216.47.73 with SMTP id s51cs361171web; Tue, 3 Jan 2012 14:17:53 -0800 (PST) Received: by 10.213.17.145 with SMTP id s17mr2519891eba.147.1325629071725; Tue, 03 Jan 2012 14:17:51 -0800 (PST) Return-Path: <alex+caf_=alex.smola=gmail.com@smola.org> Received: from mail-ey0-f175.google.com (mail-ey0-f175.google.com [209.85.215.175]) Delivered-To: alex.smola@gmail.com by mx.google.com with ESMTPS id n4si29264232eef.57.2012.01.03.14.17.51 Received: by 10.216.47.73 cipher=OTHER); with SMTP id s51cs361171web; (version=TLSv1/SSLv3 Tue, 3 Jan 2012 14:17:53 -0800 (PST) Tue, 03 Jan 2012 -0800 (PST) Received: by 10.213.17.145 with 14:17:51 SMTP id s17mr2519891eba.147.1325629071725; Tue,(google.com: 03 Jan 2012 14:17:51 -0800 (PST) Received-SPF: neutral 209.85.215.175 is neither permitted nor denied by best Return-Path: <alex+caf_=alex.smola=gmail.com@smola.org> guess record for domain of alex+caf_=alex.smola=gmail.com@smola.org) clientReceived: from mail-ey0-f175.google.com (mail-ey0-f175.google.com [209.85.215.175]) ip=209.85.215.175; by mx.google.com with ESMTPS id n4si29264232eef.57.2012.01.03.14.17.51 (version=TLSv1/SSLv3 cipher=OTHER); Authentication-Results: mx.google.com; spf=neutral (google.com: 209.85.215.175 is neither Tue, 03 Jan 2012 14:17:51 -0800 (PST) permitted nor denied by best guess record for domain of alex Received-SPF: neutral (google.com: 209.85.215.175 is neither permitted nor denied by best guess record for domain of +caf_=alex.smola=gmail.com@smola.org) smtp.mail=alex+caf_=alex.smola=gmail.com@smola.org; alex+caf_=alex.smola=gmail.com@smola.org) client-ip=209.85.215.175; Authentication-Results: mx.google.com; spf=neutral (google.com: 209.85.215.175 is neither permitted nor denied by best dkim=pass (test mode) header.i=@googlemail.com guess record for domain of alex+caf_=alex.smola=gmail.com@smola.org) Received: by eaal1 with SMTP id l1so15092746eaa.6 smtp.mail=alex+caf_=alex.smola=gmail.com@smola.org; dkim=pass (test mode) header.i=@googlemail.com Received: by eaal1 with SMTP id l1so15092746eaa.6 for <alex.smola@gmail.com>; Tue, 03 Jan 2012 14:17:51 -0800 (PST) for <alex.smola@gmail.com>; Tue, 03 Jan 2012ie18mr5325064bkc.72.1325629071362; 14:17:51 -0800 (PST) Received: by 10.205.135.18 with SMTP id Received: by 10.205.135.18 with SMTP id ie18mr5325064bkc.72.1325629071362; Tue, 03 Jan Tue, 2012 14:17:51 -0800 03 Jan 2012 14:17:51 -0800(PST) (PST) X-Forwarded-To: alex.smola@gmail.com X-Forwarded-To: alex.smola@gmail.com X-Forwarded-For: alex@smola.org alex.smola@gmail.com X-Forwarded-For: alex@smola.org alex.smola@gmail.com Delivered-To: alex@smola.org Delivered-To: alex@smola.org Received: by 10.204.65.198 with SMTP id k6cs206093bki; Tue, 3 Jan with 2012 14:17:50 (PST) Received: by 10.204.65.198 SMTP -0800 id k6cs206093bki; Received: by 10.52.88.179 with SMTP id bh19mr10729402vdb.38.1325629068795; Tue, 3 Jan 2012 -0800 (PST) Tue, 0314:17:50 Jan 2012 14:17:48 -0800 (PST) Received: by 10.52.88.179 with SMTP id bh19mr10729402vdb.38.1325629068795; Return-Path: <althoff.tim@googlemail.com> Received: from mail-vx0-f179.google.com (mail-vx0-f179.google.com Tue, 03 Jan 2012 14:17:48 -0800 (PST) [209.85.220.179]) by mx.google.com with ESMTPS id dt4si11767074vdb.93.2012.01.03.14.17.48 Return-Path: <althoff.tim@googlemail.com> (version=TLSv1/SSLv3 cipher=OTHER); Tue, 03 Jan 2012 14:17:48 -0800(mail-vx0-f179.google.com (PST) Received: from mail-vx0-f179.google.com [209.85.220.179]) Received-SPF: pass (google.com: domain of althoff.tim@googlemail.com designates 209.85.220.179 as permitted sender) by mx.google.com with ESMTPS id dt4si11767074vdb.93.2012.01.03.14.17.48 client-ip=209.85.220.179; (version=TLSv1/SSLv3 cipher=OTHER); Received: by vcbf13 with SMTP id f13so11295098vcb.10 <alex@smola.org>; Tue, 03 Jan-0800 2012 14:17:48 Tue,for03 Jan 2012 14:17:48 (PST)-0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; Received-SPF: pass (google.com: domain of althoff.tim@googlemail.com designates d=googlemail.com; s=gamma; 209.85.220.179 as permitted sender) client-ip=209.85.220.179; h=mime-version:sender:date:x-google-sender-auth:message-id:subject :from:to:content-type; Received: by vcbf13 with SMTP id f13so11295098vcb.10 bh=WCbdZ5sXac25dpH02XcRyDOdts993hKwsAVXpGrFh0w=; for <alex@smola.org>; Tue, 03 Jan 2012 14:17:48 -0800 (PST) b=WK2B2+ExWnf/gvTkw6uUvKuP4XeoKnlJq3USYTm0RARK8dSFjyOQsIHeAP9Yssxp6O 7ngGoTzYqd+ZsyJfvQcLAWp1PCJhG8AMcnqWkx0NMeoFvIp2HQooZwxSOCx5ZRgY+7qX DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; uIbbdna4lUDXj6UFe16SpLDCkptd8OZ3gr7+o= d=googlemail.com; s=gamma; MIME-Version: 1.0 h=mime-version:sender:date:x-google-sender-auth:message-id:subject Received: by 10.220.108.81 with SMTP id e17mr24104004vcp.67.1325629067787; Tue, 03 Jan 2012 14:17:47 -0800 (PST) :from:to:content-type; Sender: althoff.tim@googlemail.com bh=WCbdZ5sXac25dpH02XcRyDOdts993hKwsAVXpGrFh0w=; Received: by 10.220.17.129 with HTTP; Tue, 3 Jan 2012 14:17:47 -0800 (PST) b=WK2B2+ExWnf/gvTkw6uUvKuP4XeoKnlJq3USYTm0RARK8dSFjyOQsIHeAP9Yssxp6O Date: Tue, 3 Jan 2012 14:17:47 -0800 X-Google-Sender-Auth: 6bwi6D17HjZIkxOEol38NZzyeHs 7ngGoTzYqd+ZsyJfvQcLAWp1PCJhG8AMcnqWkx0NMeoFvIp2HQooZwxSOCx5ZRgY+7qX Message-ID: <CAFJJHDGPBW+SdZg0MdAABiAKydDk9tpeMoDijYGjoGO-WC7osg@mail.gmail.com> uIbbdna4lUDXj6UFe16SpLDCkptd8OZ3gr7+o= Subject: CS 281B. Advanced Topics in Learning and Decision Making MIME-Version: 1.0 From: Tim Althoff <althoff@eecs.berkeley.edu> To: alex@smola.org Received: by 10.220.108.81 with SMTP id e17mr24104004vcp.67.1325629067787; Content-Type: multipart/alternative; boundary=f46d043c7af4b07e8d04b5a7113a Tue, 03 Jan 2012 14:17:47 -0800 (PST) --f46d043c7af4b07e8d04b5a7113a Sender: althoff.tim@googlemail.com Content-Type: charset=ISO-8859-1 Received: by 10.220.17.129text/plain; with HTTP; Tue, 3 Jan 2012 14:17:47 -0800 14(PST) Date: Tue, 3 Jan 2012 14:17:47 -0800 X-Google-Sender-Auth: 6bwi6D17HjZIkxOEol38NZzyeHs Message-ID: <CAFJJHDGPBW+SdZg0MdAABiAKydDk9tpeMoDijYGjoGO-WC7osg@mail.gmail.com> Subject: CS 281B. Advanced Topics in Learning and Decision Making Naïve Assumption NaïveBayes Bayes Assumption Naïve Bayes assumption: Features X1 and X2 are conditionally independent given the class label Y: More generally: slide by Barnabás Póczos & Aarti Singh 50 15 Naïve Bayes Assumption, Example Naïve Bayes Assumption, Example Task: Predict whether or not a picnic spot is enjoyable Training Data: X = (X1 X2 n rows Naïve Bayes assumption: X3 … … Xd) Y Naïv Task: Pre Training slide by Barnabás Póczos & Aarti Singh How many parameters to estimate? (X is composed of d binary features, Y has K possible class labels) Naïve B (2d-1)K vs (2-1)dK 16 How m (X is51co Naïve Bayes Classifier Naïve Bayes Classifier Naïve Bayes Classifier Given: Given: – – – Class prior P(Y) – d conditionally independent features X1,…Xd given the Class priorclass P(Y)label Y – For each Xi feature, wefeatures have the conditional likelihood d conditionally independent X1,…Xd given the class labelP(X Y i|Y) – For each Xi feature, we have the conditional likelihood Naïve Bayes Decision rule: P(Xi|Y) Naïve Bayes Decision rule: slide by Barnabás Póczos & Aarti Singh 17 17 52 Naïve for NaïveBayes Bayes Algorithm Algorithm for discrete features features discrete Training data: n d-dimensional discrete features + K class labels We need to estimate these probabilities! slide by Barnabás Póczos & Aarti Singh Estimate them with MLE (Relative Frequencies)! Estimate them with MLE (Relative Frequencies)! 18 53 Naïve for NaïveBayes Bayes Algorithm Algorithm for discrete features features discrete We need to estimate these probabilities! Estimators For Class Prior For Likelihood slide by Barnabás Póczos & Aarti Singh NB Prediction for test data: 19 54 Subtlety: Insufficient training Subtlety: Insufficient training data data For example, slide by Barnabás Póczos & Aarti Singh What now??? 21 55 Naïve Discretefeatures features Naïve Bayes Bayes Alg Alg — – Discrete Training data: Use your expert knowledge & apply prior distributions: Add m “virtual” examples Same as assuming conjugate priors Assume priors: MAP Estimate: slide by Barnabás Póczos & Aarti Singh called Laplace smoothing # virtual examples with Y = b 56 22 Case Study: Text Classification 57 Is this spam? 58 Positive or negative movie review? • unbelievably disappointing • Full of zany characters and richly applied satire, and some great plot twists • this is the greatest screwball comedy ever filmed • It was pathetic. The worst part about it was the boxing scenes. slide by Dan Jurafsky 59 What is the subject of this article? MEDLINE Article MeSH Subject Category Hierarchy • ? • • • • • slide by Dan Jurafsky • Antogonists and Inhibitors Blood Supply Chemistry Drug Therapy Embryology Epidemiology … 60 Text Classification • Assigning subject categories, topics, or genres • Spam detection • Authorship identification • Age/gender identification • Language Identification • Sentiment analysis • … slide by Dan Jurafsky 61 Text Classification: definition • Input: - a document d - a fixed set of classes C = {c1, c2,…, cJ} • Output: a predicted class c ∈ C slide by Dan Jurafsky 62 Hand-coded rules • Rules based on combinations of words or other features - spam: black-list-address OR (“dollars” AND“have been selected”) • Accuracy can be high - If rules carefully refined by expert • But building and maintaining these rules is expensive slide by Dan Jurafsky 63 Text Classification and Naive Bayes • Classify emails - Y = {Spam, NotSpam} • Classify news articles - Y = {what is the topic of the article?} slide by Barnabás Póczos & Aarti Singh What are the features X? The text! Let Xi represent ith word in the document 64 th th i word XXii represents i wordin indocument document slide by Barnabás Póczos & Aarti Singh 6524 NB NBfor forText Text Classification Classification A problem: The support of P(X|Y) is huge! – Article at least 1000 words, X={X1,…,X1000} – Xi represents ith word in document, i.e., the domain of Xi is the entire vocabulary, e.g., Webster Dictionary (or more). Xi 2 {1,…,50000} ) K(100050000 -1) parameters to estimate without the NB assumption…. slide by Barnabás Póczos & Aarti Singh 6625 NB NBfor forText Text Classification Classification Xi 2 {1,…,50000} ) K(100050000 -1) parameters to estimate…. NB assumption helps a lot!!! If P(Xi=xi|Y=y) is the probability of observing word xi at the ith position in a document on topic y ) 1000K(50000-1) parameters to estimate with NB assumption slide by Barnabás Póczos & Aarti Singh NB assumption helps, but still lots of parameters to estimate. 26 67 Bag model Bagof of words words model Typical additional assumption: Position in document doesn’t matter: P(Xi=xi|Y=y) = P(Xk=xi|Y=y) – – “Bag of words” model – order of words on the page ignored The document is just a bag of words: i.i.d. words Sounds really silly, but often works very well! ) K(50000-1) parameters to estimate slide by Barnabás Póczos & Aarti Singh The probability of a document with words x1,x2,… 27 68 The bag of words representation γ( I love this movie! It's sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun… It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about anyone. I've seen it several times, and I'm always happy to see it again whenever I have a friend who hasn't seen it yet. )=c slide by Dan Jurafsky 69 The bag of words representation γ( I love this movie! It's sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun… It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about anyone. I've seen it several times, and I'm always happy to see it again whenever I have a friend who hasn't seen it yet. )=c slide by Dan Jurafsky 70 The bag of words representation: using a subset of words γ( x love xxxxxxxxxxxxxxxx sweet xxxxxxx satirical xxxxxxxxxx xxxxxxxxxxx great xxxxxxx xxxxxxxxxxxxxxxxxxx fun xxxx xxxxxxxxxxxxx whimsical xxxx romantic xxxx laughing xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxx recommend xxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx x several xxxxxxxxxxxxxxxxx xxxxx happy xxxxxxxxx again xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxx )=c slide by Dan Jurafsky 71 The bag of words representation γ( great 2 love 2 recommend 1 laugh 1 happy 1 ... )=c ... slide by Dan Jurafsky 72 Nc P̂(c) = N count(w, c) +1 P̂(w | c) = count(c)+ | V | Priors: P(c)= 3 4 P(j)= 1 4 Training Test Doc Words Class 1 Chinese Beijing Chinese c 2 Chinese Chinese Shanghai c 3 Chinese Macao c 4 Tokyo Japan Chinese j 5 Chinese Chinese Chinese Tokyo Japan ? Choosing a class: P(c|d5) ∝ 3/4 * (3/7)3 * 1/14 * 1/14 ≈ 0.0003 slide by Dan Jurafsky Conditional Probabilities: P(Chinese|c) = (5+1) / (8+6) = 6/14 = 3/7 P(Tokyo|c) = (0+1) / (8+6) = 1/14 P(j|d5) ∝ 1/4 * (2/9)3 * 2/9 * 2/9 P(Japan|c) = (0+1) / (8+6) = 1/14 ≈ 0.0001 P(Chinese|j) = (1+1) / (3+6) = 2/9 P(Tokyo|j) = (1+1) / (3+6) = 2/9 P(Japan|j) = (1+1) / (3+6) = 2/9 73 What if features are if features What What if features areare continuous? Bayes (GNB): continuous? continuous? th pixel th pixel th intensity Eg., character recognition: Xi is intensity at i e.g., character recognition: Xat is at i Eg., character recognition: Xi is intensity i pixel i riance for each class k and each pixel i. e Naïve Bayes (GNB): Gaussian Naïve Bayes (GNB): ., Gaussian ), Gaussian Naïve Bayes (GNB): i e., k) 31 Different mean and variance for each class k and each Different mean and variance for each class k and each pixel i . pixel i. Different mean and variance for each class k and each pixel i. slide by Barnabás Póczos & Aarti Singh assume variance SometimesSometimes assume variance • is independent of), Y (i.e., i), • is independent of Y (i.e., i Sometimes assume variance • or independent of • or independent of Xi (i.e., k)Xi (i.e., k) • (i.e., or both • • isorindependent of )Y (i.e., σi), both ) (i.e., • • or independent of Xi (i.e., σk) or both (i.e., σ) 31 31 74 Estimating parameters: Estimating Estimating parameters: parameters: Estimating parameters: discrete, X continuous YYYdiscrete, Xiii continuous continuous Y discrete, Xi continuous Maximum likelihood estimates: ith pixel in jth training image kth class jth training image slide by Barnabás Póczos & Aarti Singh 32 33 75 Estimating parameters: Estimating parameters: Estimating parameters: Y discrete, X continuous Y discrete, X continuous i i Y discrete, X continuous i Maximum likelihood estimates: Maximum likelihood estimates: Maximum likelihood estimates: ith pixel in ith pixel in jth training image jth training image kth class th class k th j training image jth training image slide by Barnabás Póczos & Aarti Singh 33 76 33 Twenty news groups result Twenty news groups results slide by Barnabás Póczos & Aarti Singh Naïve Bayes: 89% accuracy 77 Case Study: Classifying Mental States 78 Example: GNB for classifying Example: GNB for classifying Example: GNB for classifying mental states mental mental states states mmresolution resolution ~1~1 mm ~1 mm resolution images per sec. ~2~2 images per sec. ~2 images per sec. slide by Barnabás Póczos & Aarti Singh 15,000voxels/image voxels/image 15,000 15,000 voxels/image non-invasive, safe non-invasive, safe non-invasive, safe measures Oxygen measures Blood Blood Oxygen measures Blood Oxygen Level Dependent (BOLD) Level Dependent (BOLD) Level Dependent (BOLD) response response response [Mitchell etal.] al.] [Mitchellet et al.] [Mitchell 34 79 3 Brain scans can Brain can track scans activation with precision track activation and withsensitivity precision and sensitivity slide by Barnabás Póczos & Aarti Singh [Mitchell et al.] Learned Naïve Bayes Models Learned Naïve Bayes Models – – Means for P(BrainActivity | WordCategory) Means for P(BrainActivity | WordCategory) [Mitchell et al.] Pairwise classification accuracy: Pairwise classification accuracy: [Mitchell et al.] 78-99%, 12 participants 78-99%, 12 participants Tool words Building Tool words Building words slide by Barnabás Póczos & Aarti Singh 81 35 What you should know… Naïve Bayes classifier • • • • What’s the assumption Why we use it How do we learn it Why is Bayesian (MAP) estimation important Text classification • Bag of words model slide by Barnabás Póczos & Aarti Singh Gaussian NB • • Features are still conditionally independent Each feature has a Gaussian distribution given class 82