Uploaded by ryanj2758

MAP vs MLE bayes

advertisement
Lecture 8:
− Maximum Likelihood Estimation (MLE) (cont’d.)
− Maximum a posteriori (MAP) estimation
− Naïve Bayes Classifier
Aykut Erdem
March 2016
Hacettepe University
Flipping
a Coin
Last time…
Flipping
a Coin
I have a coin, if I flip it, what’s the probability that it will fall with the head up?
Let us flip it a few times to estimate the probability:
slide by Barnabás Póczos & Alex Smola
The estimated probability is: 3/5 “Frequency of heads”
13
2
Flipping
a Coin
Last time…
Flipping
a Coin
The estimated probability is: 3/5 “Frequency of heads”
Questions:
Questions:
slide by Barnabás Póczos & Alex Smola
(1) Why
Why frequency
frequencyofofheads???
heads???
(1)
(2)
(2) How
How good
good is
isthis
thisestimation???
estimation???
(3) Why is this a machine learning problem???
(3) Why is this a machine learning problem???
We are
are going
these
questions
We
goingtotoanswer
answer
these
questions
14
3
Question (1)
Why frequency of heads???
•
Frequency of heads is exactly the
maximum likelihood estimator for this problem
•
MLE has nice properties
(interpretation, statistical guarantees, simple)
slide by Barnabás Póczos & Alex Smola
4
MLE
distribution
MLEfor
forBernoulli
Bernoulli distribution
Data, D =
P(Heads) = θ, P(Tails) = 1-θ
Flips are i.i.d.:
slide by Barnabás Póczos & Alex Smola
–
Independent events
– Identically distributed according to Bernoulli distribution
MLE: Choose θ that maximizes the probability of observed data
5 17
MLE
distribution
MLEfor
forBernoulli
Bernoulli distribution
Data, D =
P(Heads) = θ, P(Tails) = 1-θ
Flips are i.i.d.:
slide by Barnabás Póczos & Alex Smola
–
Independent events
– Identically distributed according to Bernoulli distribution
MLE: Choose θ that maximizes the probability of observed data
6 17
MLE
distribution
MLEfor
forBernoulli
Bernoulli distribution
Data, D =
P(Heads) = θ, P(Tails) = 1-θ
Flips are i.i.d.:
slide by Barnabás Póczos & Alex Smola
–
Independent events
– Identically distributed according to Bernoulli distribution
MLE: Choose θ that maximizes the probability of observed data
7 17
MLE
distribution
MLEfor
forBernoulli
Bernoulli distribution
Data, D =
P(Heads) = θ, P(Tails) = 1-θ
Flips are i.i.d.:
slide by Barnabás Póczos & Alex Smola
–
Independent events
– Identically distributed according to Bernoulli distribution
MLE: Choose θ that maximizes the probability of observed data
8 17
Maximum Likelihood
Maximum Likelihood
Estimation
Estimation
MLE: Choose θ that maximizes the probability of observed data
Maximum Likelihood
Estimation
Independent
independentdraws
draws
ose θ that maximizes the probability of observed data
iden,cally
Identically
distributed
distributed
slide by Barnabás Póczos & Alex Smola
18
9
Maximum Likelihood
Maximum Likelihood
Estimation
Estimation
MLE: Choose θ that maximizes the probability of observed data
Maximum Likelihood
Estimation
Independent
draws
independent
draws
ose θ that maximizes the probability of observed data
iden,cally
Identically
distributed
distributed
slide by Barnabás Póczos & Alex Smola
18
10
Maximum Likelihood
Maximum Likelihood
Estimation
Estimation
MLE: Choose θ that maximizes the probability of observed data
Maximum Likelihood
Estimation
Independent
draws
independent
draws
ose θ that maximizes the probability of observed data
identically
Identically
distributed
distributed
slide by Barnabás Póczos & Alex Smola
18
11
Maximum Likelihood
Maximum Likelihood
Estimation
Estimation
MLE: Choose θ that maximizes the probability of observed data
Maximum Likelihood
Estimation
Independent
draws
independent
draws
ose θ that maximizes the probability of observed data
identically
Identically
distributed
distributed
slide by Barnabás Póczos & Alex Smola
18
12
Maximum Likelihood
Maximum Likelihood
Estimation
Estimation
MLE: Choose θ that maximizes the probability of observed data
Maximum Likelihood
Estimation
Independent
draws
independent
draws
ose θ that maximizes the probability of observed data
identically
Identically
distributed
distributed
slide by Barnabás Póczos & Alex Smola
18
13
Maximum
Maximum Likelihood
Likelihood
Estimation
Maximum Estimation
Likelihood
Estimation
Estimation
LE: Choose θ that maximizes the probability of observed data
MLE:
MLE:Choose
Choose θthat
thatmaximizes
maximizesthe
theprobability
probability of
of observed
observed data
data
Independent draws
Identically
distributed
slide by Barnabás Póczos & Alex Smola
That’s exactly the “Frequency of heads”
4818
14
Maximum
Maximum Likelihood
Likelihood
Estimation
Maximum Estimation
Likelihood
Estimation
Estimation
LE: Choose θ that maximizes the probability of observed data
MLE:
MLE:Choose
Choose θthat
thatmaximizes
maximizesthe
theprobability
probability of
of observed
observed data
data
Independent draws
Identically
distributed
slide by Barnabás Póczos & Alex Smola
That’s exactly the “Frequency of heads”
4818
15
Maximum
Maximum Likelihood
Likelihood
Estimation
Maximum Estimation
Likelihood
Estimation
Estimation
LE: Choose θ that maximizes the probability of observed data
MLE:
MLE:Choose
Choose θthat
thatmaximizes
maximizesthe
theprobability
probability of
of observed
observed data
data
Independent draws
Identically
distributed
slide by Barnabás Póczos & Alex Smola
That’s exactly the “Frequency of heads”
4818
16
Question (2)
•
How good is this MLE estimation???
slide by Barnabás Póczos & Alex Smola
17
How many flips do I need?
I flipped the coins 5 times: 3 heads, 2 tails
What if I flipped 30 heads and 20 tails?
slide by Barnabás Póczos & Alex Smola
•
•
Which estimator should we trust more?
The more the merrier???
18
Simple
Simplebound
bound
Let
Let θ*be
bethe
thetrue
trueparameter.
parameter.
*
θ
Fornn==αα
,
and
H+α,Tand
For
+α
H
T
For any ε>0:
For any ε>0:
Hoeffding’s inequality:
Hoeffding’s inequality:
slide by Barnabás Póczos & Alex Smola
22
19
Probably
Approximate
Correct
For n = αH+αT, and
(PAC) Learning
I For
want
to
know
the
coin
parameter
θ,
within
ε
=
0.1
any ε>0:
error with probability at least 1-δ = 0.95.
Hoeffding’s inequality:
How many flips do I need?
22
slide by Barnabás Póczos & Alex Smola
Sample complexity:
20
Question (3)
Why is this a machine learning problem???
•
•
•
improve their performance (accuracy of the
predicted prob. )
at some task (predicting the probability of heads)
with experience (the more coins we flip the better
we are)
slide by Barnabás Póczos & Alex Smola
21
What
continuous
What about
about continuous
features?
features?
3
4
5
6
7
8
9
Let us try Gaussians…
slide by Barnabás Póczos & Alex Smola
σ2
µ=0
σ2
µ=0
25
22
MLE
mean
MLEfor
for Gaussian
Gaussian mean
and
variance
and variance
Choose θ= (µ,σ2) that maximizes the probability of observed data
Independent draws
Identically
distributed
slide by Barnabás Póczos & Alex Smola
23
26
MLE
mean
MLEfor
forGaussian
Gaussian mean
and variance
variance
and
slide by Barnabás Póczos & Alex Smola
Note: MLE for the variance of a Gaussian is biased
Note: MLE
for result
the variance
Gaussian
is biased
[Expected
of estimationofis a
not
the true parameter!]
[Expected result of estimation is not the true parameter!]
Unbiased variance estimator:
Unbiased variance estimator:
24
27
What about prior knowledge?
(MAP Estimation)
slide by Barnabás Póczos & Aarti Singh
25
What
?
What about
aboutprior
priorknowledge
knowledge?
We know the coin is “close” to 50-50. What can we do now?
The Bayesian way…
Rather than estimating a single θ, we obtain a distribution over possible
values of θ
Before data
After data
slide by Barnabás Póczos & Aarti Singh
50-50
29
26
Prior
distribution
Prior
distribution
What What
prior?
What
distribution
do
we
want
for
a
prior? What distribution do we want for a prior?
prior?
•
•
!
Represents expert knowledge (philosophical approach)
Represents expert knowledge (philosophical approach)
Simple posterior
(engineer’s approach)
Simple! posterior
formform
(engineer’s
approach)
Uninformative
priors:
Uninformative
priors:
•
Uniform
! distribution
Uniform distribution
slide by Barnabás Póczos & Aarti Singh
Conjugate priors:
•
•
Conjugate representation
priors:
Closed-form
of posterior
P(θ) and
have
the same
form
! P(θ|D)
Closed-form
representation
of posterior
!
P(θ) and P(θ|D) have the same form
27
In order
to proceed
wewe
will
In order
to proceed
willneed:
need:
Bayes
Rule
Bayes Rule
slide by Barnabás Póczos & Aarti Singh
28
Chain Rule
Rule &
& Bayes
Bayes Rule
Chain
Rule
Chain rule:
Bayes rule:
slide by Barnabás Póczos & Aarti Singh
Bayes
ruleisisimportant
important for
Bayes
rule
forreverse
reverseconditioning.
conditioning.
32
29
Bayesian
Learning
BayesianLearning
Learning
Bayesian
• Use Bayes rule:
• Use Bayes rule:
• Or equivalently:
• Or equivalently:
slide by Barnabás Póczos & Aarti Singh
posterior
posterior
likelihood prior
likelihood prior
33
50
30
MAP
for Binomial
Binomial
MAP estimation
estimation for
distribution
distribution
Coin flip problem
Likelihood is Binomial
If the prior is Beta distribution,
) posterior is Beta distribution
slide by Barnabás Póczos & Aarti Singh
P( ) and P( | D) have the same form! [Conjugate prior]
3151
Beta
distribution
Beta distribution
slide by Barnabás Póczos & Aarti Singh
More concentrated as values of α, β increase
More concentrated as values of α, β increase
37
32
Beta
prior
Beta conjugate
conjugate prior
slide by Barnabás Póczos & Aarti Singh
As n = αH + αT
As n = α H +increases
αT increases
As we get more samples, effect of prior is “washed out”
As we get more samples, effect of prior is “washed out”
38
33
Han Solo and Bayesian Priors
C3PO: Sir, the possibility of successfully navigating an asteroid
field is approximately 3,720 to 1! Beta conjugate prior
Han: Never tell me the odds!
As n = αH + αT
increases
34
https://www.countbayesie.com/blog/2015/2/18/hans-solo-and-bayesian-priors
As we get more samples, effect of prior is “washed out”
MLEvs.
vs. MAP
MAP
MLE
!
Maximum Likelihood estimation (MLE)
Choose value that maximizes the probability of observed
data
!
Maximum a posteriori (MAP) estimation
Choose value that is most probable given observed data and
prior belief
slide by Barnabás Póczos & Aarti Singh
When is MAP same as MLE?
34
35
From
Binomial
to
Multinomial
From
Binomial
to
Multinomial
From
Binomial
to
Multinomial
From
Binomial
to
Multinomial
From
Binomial
to
Multinomial
Example: Dice roll problem (6 outcomes instead of 2)
xample:
Dice
roll
problem
(6
outcomes
instead
of
2)
Example: Dice roll problem (6 outcomes instead of 2)
Example:
Dice
roll
problem
(6
outcomes
instead
of
2)
Likelihood
is
~
Multinomial(θ
=
{θ
,
θ
,
…
,
θ
})
Example: Dice roll problem (6 outcomes
1
2 insteadk of 2)
kelihood is ~ Multinomial(θ = {θ1, θ2, … , θk})
Likelihoodisis~~ Multinomial(θ
Multinomial(θ = ={θ{θ
θk}), θ })
1, θ,2,θ..., ,…
Likelihood
1 2, …
2 , θk}) k
ikelihood is ~ Multinomial(θ = {θ1, θ
prior is Dirichlet distribution,
If
prior
is
Dirichlet
distribution,
prior is Dirichlet distribution,
fprior
priorisisDirichlet
Dirichletdistribution,
distribution,
slide by Barnabás Póczos & Aarti Singh
Then posterior is Dirichlet distribution
Then
posterior
isDirichlet
Dirichlet distribution
hen
posterior
is
distribution
Then
Thenposterior
posteriorisisDirichlet
Dirichletdistribution
distribution
For
Multinomial,
prior
Dirichlet
distribution
For
Multinomial, conjugate
conjugate prior
is is
Dirichlet
distribution.
For
Multinomial,
conjugate
prior
is
Dirichlet
distribution.
For
Multinomial,
conjugate
prior
is
Dirichlet
distribution.
For
Multinomial,
conjugate
prior
is
Dirichlet
distribution.
http://en.wikipedia.org/wiki/Dirichlet_distribution
http://en.wikipedia.org/wiki/Dirichlet_distribution
36
Bayesians vs.Frequentists
Bayesians vs. Frequentists
You are no
good when
sample is
small
slide by Barnabás Póczos & Aarti Singh
You give a
different
answer for
different
priors
37
Recap: What about prior knowledge?
(MAP Estimation)
slide by Barnabás Póczos & Aarti Singh
38
Recap:
What about
knowledge?
What about
priorprior
knowledge
?
We know the coin is “close” to 50-50. What can we do now?
The Bayesian way…
Rather than estimating a single θ, we obtain a distribution over possible
values of θ
Before data
After data
slide by Barnabás Póczos & Aarti Singh
50-50
29
39
Chain
Rule
&
Bayes
Rule
Recap: Chain Rule & Bayes Rule
Chain rule:
Bayes rule:
slide by Barnabás Póczos & Aarti Singh
4
40
Bayesian
Learning
Recap: Bayesian Learning
Bayesian Learning
D is the measured data
Our goal is to estimate parameter θ
Use Bayes rule:
Use Bayes
Bayes rule:
• Use
rule:
slide by Barnabás Póczos & Aarti Singh
• Or equivalently:
equivalently:
Or equivalently:
posterior
posterior
likelihood
likelihood
prior
prior
5
5
41
Recap:
MAP
estimation
for
MAP
estimation
for Binomia
MAP estimation for Binomial
distribution
Binomial distribution
distribution
Coin flip problem: Likelihood is Binomial
Coin
flip problem: Likelihood is Binomial
In the coin flip
problem:
Likelihood is Binomial:
If the
prior
If the
priorisisBeta
Betadistribution,
distribution,
If the prior is Beta:
slide by Barnabás Póczos & Aarti Singh
posteriorisisBeta
Beta distribution
distribution
⇒⇒
posterior
then the posterior is Beta distribution
Beta function:
Beta function:
42
BetaBeta
conjugate
prior
Recap:
conjugate
prior
slide by Barnabás Póczos & Aarti Singh
As n = αH + αT
As n = α H +increases
αT increases
As we get more samples, effect of prior is “washed out”
As we get more samples, effect of prior is “washed out”
38
43
Application of Bayes Rule
slide by Barnabás Póczos & Aarti Singh
44
AIDS test (Bayes rule)
Data
AIDS test (Bayes rule)
Data Approximately 0.1% are infected
•
•
•
Approximately 0.1% are infected
Test detects all infections
Test detects all infections
TestTest
reports
positive
for 1%
healthy
people
reports
positive
for 1%
healthy
people
Probability of having
Probability
having AIDS
AIDS ifif test
testisispositive:
positive
slide by Barnabás Póczos & Aarti Singh
Only 9%!...
10
45
Improving the diagnosis
Improving the diagnosis
Use a weaker follow-up test!
UseApproximately
a weaker follow-up
test!
0.1% are infected
•
•
•
Approximately
arefor
infected
Test 2 reports0.1%
positive
90% infections
Test
2 reports
positive
infections
Test
2 reports
positivefor
for 90%
5% healthy
people
Test 2 reports positive for 5% healthy people
=
slide by Barnabás Póczos & Aarti Singh
64%!...
11
46
AIDS
test
(Bayes
rule)
Improving the diagnosis
Why can’t we use Test 1 twice?
•Why can’t we use Test 1 twice?
Outcomes are not independent,
•
but tests 1 and 2 conditionally independent
Outcomes
are
not
independent,
(by assumption):
but tests 1 and 2 are conditionally independent (by assumption
slide by Barnabás Póczos & Aarti Singh
47
The Naïve Bayes Classifier
slide by Barnabás Póczos & Aarti Singh
48
Data for
Data for spam
filtering
spam
filtering
• date
• time
• recipient path
• IP number
• sender
slide by Barnabás Póczos & Aarti Singh
• encoding
• many more features
Delivered-To: alex.smola@gmail.com
Received: by 10.216.47.73 with SMTP id s51cs361171web;
Tue, 3 Jan 2012 14:17:53 -0800 (PST)
Received: by 10.213.17.145 with SMTP id s17mr2519891eba.147.1325629071725;
Tue, 03 Jan 2012 14:17:51 -0800 (PST)
Return-Path: <alex+caf_=alex.smola=gmail.com@smola.org>
Received: from mail-ey0-f175.google.com (mail-ey0-f175.google.com [209.85.215.175])
Delivered-To:
alex.smola@gmail.com
by mx.google.com
with
ESMTPS id n4si29264232eef.57.2012.01.03.14.17.51
Received: by 10.216.47.73 cipher=OTHER);
with SMTP id s51cs361171web;
(version=TLSv1/SSLv3
Tue, 3 Jan 2012 14:17:53 -0800 (PST)
Tue,
03 Jan 2012
-0800 (PST)
Received:
by 10.213.17.145
with 14:17:51
SMTP id s17mr2519891eba.147.1325629071725;
Tue,(google.com:
03 Jan 2012 14:17:51
-0800 (PST)
Received-SPF: neutral
209.85.215.175
is neither permitted nor denied by best
Return-Path: <alex+caf_=alex.smola=gmail.com@smola.org>
guess record for domain of alex+caf_=alex.smola=gmail.com@smola.org) clientReceived: from mail-ey0-f175.google.com (mail-ey0-f175.google.com [209.85.215.175])
ip=209.85.215.175;
by mx.google.com with ESMTPS id n4si29264232eef.57.2012.01.03.14.17.51
(version=TLSv1/SSLv3
cipher=OTHER);
Authentication-Results:
mx.google.com;
spf=neutral (google.com: 209.85.215.175 is neither
Tue, 03 Jan 2012 14:17:51 -0800 (PST)
permitted
nor
denied
by
best
guess
record
for domain of alex
Received-SPF: neutral (google.com: 209.85.215.175 is neither permitted nor denied by best guess record for domain of
+caf_=alex.smola=gmail.com@smola.org)
smtp.mail=alex+caf_=alex.smola=gmail.com@smola.org;
alex+caf_=alex.smola=gmail.com@smola.org) client-ip=209.85.215.175;
Authentication-Results:
mx.google.com;
spf=neutral
(google.com: 209.85.215.175 is neither permitted nor denied by best
dkim=pass
(test mode)
header.i=@googlemail.com
guess record for domain of alex+caf_=alex.smola=gmail.com@smola.org)
Received: by eaal1 with SMTP id l1so15092746eaa.6
smtp.mail=alex+caf_=alex.smola=gmail.com@smola.org;
dkim=pass (test mode) header.i=@googlemail.com
Received: by eaal1 with SMTP id l1so15092746eaa.6
for <alex.smola@gmail.com>;
Tue, 03 Jan 2012 14:17:51 -0800 (PST)
for
<alex.smola@gmail.com>;
Tue,
03
Jan
2012ie18mr5325064bkc.72.1325629071362;
14:17:51 -0800 (PST)
Received: by 10.205.135.18 with SMTP id
Received: by 10.205.135.18 with SMTP id ie18mr5325064bkc.72.1325629071362;
Tue, 03 Jan Tue,
2012
14:17:51
-0800
03 Jan
2012 14:17:51
-0800(PST)
(PST)
X-Forwarded-To:
alex.smola@gmail.com
X-Forwarded-To: alex.smola@gmail.com
X-Forwarded-For: alex@smola.org alex.smola@gmail.com
X-Forwarded-For: alex@smola.org alex.smola@gmail.com
Delivered-To: alex@smola.org
Delivered-To:
alex@smola.org
Received:
by 10.204.65.198 with SMTP id k6cs206093bki;
Tue, 3 Jan with
2012 14:17:50
(PST)
Received: by 10.204.65.198
SMTP -0800
id k6cs206093bki;
Received: by 10.52.88.179 with SMTP id bh19mr10729402vdb.38.1325629068795;
Tue, 3 Jan 2012
-0800
(PST)
Tue, 0314:17:50
Jan 2012 14:17:48
-0800
(PST)
Received: by 10.52.88.179
with SMTP id bh19mr10729402vdb.38.1325629068795;
Return-Path: <althoff.tim@googlemail.com>
Received:
from mail-vx0-f179.google.com
(mail-vx0-f179.google.com
Tue,
03 Jan 2012 14:17:48
-0800 (PST) [209.85.220.179])
by mx.google.com with ESMTPS id dt4si11767074vdb.93.2012.01.03.14.17.48
Return-Path: <althoff.tim@googlemail.com>
(version=TLSv1/SSLv3 cipher=OTHER);
Tue, 03 Jan 2012 14:17:48 -0800(mail-vx0-f179.google.com
(PST)
Received: from mail-vx0-f179.google.com
[209.85.220.179])
Received-SPF: pass (google.com: domain of althoff.tim@googlemail.com designates 209.85.220.179 as permitted sender)
by mx.google.com with ESMTPS id dt4si11767074vdb.93.2012.01.03.14.17.48
client-ip=209.85.220.179;
(version=TLSv1/SSLv3
cipher=OTHER);
Received: by vcbf13 with SMTP
id f13so11295098vcb.10
<alex@smola.org>;
Tue, 03 Jan-0800
2012 14:17:48
Tue,for03
Jan 2012 14:17:48
(PST)-0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
Received-SPF: pass (google.com:
domain of althoff.tim@googlemail.com designates
d=googlemail.com; s=gamma;
209.85.220.179
as permitted sender) client-ip=209.85.220.179;
h=mime-version:sender:date:x-google-sender-auth:message-id:subject
:from:to:content-type;
Received: by vcbf13 with SMTP
id f13so11295098vcb.10
bh=WCbdZ5sXac25dpH02XcRyDOdts993hKwsAVXpGrFh0w=;
for
<alex@smola.org>;
Tue,
03 Jan 2012 14:17:48 -0800 (PST)
b=WK2B2+ExWnf/gvTkw6uUvKuP4XeoKnlJq3USYTm0RARK8dSFjyOQsIHeAP9Yssxp6O
7ngGoTzYqd+ZsyJfvQcLAWp1PCJhG8AMcnqWkx0NMeoFvIp2HQooZwxSOCx5ZRgY+7qX
DKIM-Signature:
v=1; a=rsa-sha256; c=relaxed/relaxed;
uIbbdna4lUDXj6UFe16SpLDCkptd8OZ3gr7+o=
d=googlemail.com;
s=gamma;
MIME-Version: 1.0
h=mime-version:sender:date:x-google-sender-auth:message-id:subject
Received:
by 10.220.108.81 with SMTP id e17mr24104004vcp.67.1325629067787;
Tue, 03 Jan 2012 14:17:47 -0800 (PST)
:from:to:content-type;
Sender: althoff.tim@googlemail.com
bh=WCbdZ5sXac25dpH02XcRyDOdts993hKwsAVXpGrFh0w=;
Received: by 10.220.17.129 with HTTP; Tue, 3 Jan 2012 14:17:47 -0800 (PST)
b=WK2B2+ExWnf/gvTkw6uUvKuP4XeoKnlJq3USYTm0RARK8dSFjyOQsIHeAP9Yssxp6O
Date: Tue, 3 Jan 2012 14:17:47 -0800
X-Google-Sender-Auth: 6bwi6D17HjZIkxOEol38NZzyeHs
7ngGoTzYqd+ZsyJfvQcLAWp1PCJhG8AMcnqWkx0NMeoFvIp2HQooZwxSOCx5ZRgY+7qX
Message-ID: <CAFJJHDGPBW+SdZg0MdAABiAKydDk9tpeMoDijYGjoGO-WC7osg@mail.gmail.com>
uIbbdna4lUDXj6UFe16SpLDCkptd8OZ3gr7+o=
Subject: CS 281B. Advanced Topics in Learning and Decision Making
MIME-Version: 1.0
From: Tim Althoff <althoff@eecs.berkeley.edu>
To: alex@smola.org
Received: by 10.220.108.81
with SMTP id e17mr24104004vcp.67.1325629067787;
Content-Type: multipart/alternative; boundary=f46d043c7af4b07e8d04b5a7113a
Tue, 03 Jan 2012 14:17:47 -0800 (PST)
--f46d043c7af4b07e8d04b5a7113a
Sender: althoff.tim@googlemail.com
Content-Type:
charset=ISO-8859-1
Received: by 10.220.17.129text/plain;
with HTTP;
Tue, 3 Jan 2012 14:17:47 -0800
14(PST)
Date: Tue, 3 Jan 2012 14:17:47 -0800
X-Google-Sender-Auth: 6bwi6D17HjZIkxOEol38NZzyeHs
Message-ID: <CAFJJHDGPBW+SdZg0MdAABiAKydDk9tpeMoDijYGjoGO-WC7osg@mail.gmail.com>
Subject: CS 281B. Advanced Topics in Learning and Decision Making
Naïve
Assumption
NaïveBayes
Bayes Assumption
Naïve Bayes assumption: Features X1 and X2 are
conditionally independent given the class label Y:
More generally:
slide by Barnabás Póczos & Aarti Singh
50
15
Naïve
Bayes
Assumption,
Example
Naïve Bayes Assumption, Example
Task: Predict whether or not a picnic spot is enjoyable
Training Data:
X = (X1
X2
n rows
Naïve Bayes assumption:
X3
… … Xd)
Y
Naïv
Task: Pre
Training
slide by Barnabás Póczos & Aarti Singh
How many parameters to estimate?
(X is composed of d binary features,
Y has K possible class labels)
Naïve B
(2d-1)K vs (2-1)dK
16
How m
(X is51co
Naïve
Bayes
Classifier
Naïve Bayes Classifier
Naïve Bayes Classifier
Given:
Given:
–
–
– Class prior P(Y)
– d conditionally independent features X1,…Xd given the
Class priorclass
P(Y)label Y
– For each
Xi feature, wefeatures
have the conditional
likelihood
d conditionally
independent
X1,…Xd given
the
class labelP(X
Y i|Y)
– For each Xi feature, we have the conditional likelihood
Naïve Bayes Decision rule:
P(Xi|Y)
Naïve Bayes Decision rule:
slide by Barnabás Póczos & Aarti Singh
17
17
52
Naïve
for
NaïveBayes
Bayes Algorithm
Algorithm for
discrete features
features
discrete
Training data:
n d-dimensional discrete features + K class labels
We need to estimate these probabilities!
slide by Barnabás Póczos & Aarti Singh
Estimate them with MLE (Relative Frequencies)!
Estimate them with MLE (Relative Frequencies)!
18
53
Naïve
for
NaïveBayes
Bayes Algorithm
Algorithm for
discrete features
features
discrete
We need to estimate these probabilities!
Estimators
For Class Prior
For Likelihood
slide by Barnabás Póczos & Aarti Singh
NB Prediction for test data:
19
54
Subtlety: Insufficient training
Subtlety: Insufficient
training
data
data
For example,
slide by Barnabás Póczos & Aarti Singh
What now???
21
55
Naïve
Discretefeatures
features
Naïve Bayes
Bayes Alg
Alg —
– Discrete
Training data:
Use your expert knowledge & apply prior distributions:
Add m “virtual” examples Same as assuming conjugate priors
Assume priors:
MAP Estimate:
slide by Barnabás Póczos & Aarti Singh
called Laplace smoothing
# virtual examples
with Y = b
56
22
Case Study:
Text Classification
57
Is this spam?
58
Positive or negative movie review?
•
unbelievably disappointing
•
Full of zany characters and richly applied satire,
and some great plot twists
•
this is the greatest screwball comedy ever
filmed
•
It was pathetic. The worst part about it was the
boxing scenes.
slide by Dan Jurafsky
59
What is the subject of this article?
MEDLINE Article
MeSH Subject Category
Hierarchy
•
?
•
•
•
•
•
slide by Dan Jurafsky
•
Antogonists and Inhibitors
Blood Supply
Chemistry
Drug Therapy
Embryology
Epidemiology
…
60
Text Classification
•
Assigning subject categories, topics, or genres
•
Spam detection
•
Authorship identification
•
Age/gender identification
•
Language Identification
•
Sentiment analysis
•
…
slide by Dan Jurafsky
61
Text Classification: definition
•
Input:
- a document d
- a fixed set of classes C = {c1, c2,…, cJ}
•
Output: a predicted class c ∈ C
slide by Dan Jurafsky
62
Hand-coded rules
•
Rules based on combinations of words or other
features
- spam: black-list-address OR (“dollars” AND“have
been selected”)
•
Accuracy can be high
- If rules carefully refined by expert
•
But building and maintaining these rules is
expensive
slide by Dan Jurafsky
63
Text Classification and Naive Bayes
•
Classify emails
- Y = {Spam, NotSpam}
•
Classify news articles
- Y = {what is the topic of the article?}
slide by Barnabás Póczos & Aarti Singh
What are the features X?
The text!
Let Xi represent ith word in the document
64
th
th
i word
XXii represents i wordin
indocument
document
slide by Barnabás Póczos & Aarti Singh
6524
NB
NBfor
forText
Text Classification
Classification
A problem: The support of P(X|Y) is huge!
– Article at least 1000 words, X={X1,…,X1000}
– Xi represents ith word in document, i.e., the domain of Xi is the
entire vocabulary, e.g., Webster Dictionary (or more).
Xi 2 {1,…,50000} ) K(100050000 -1)
parameters to estimate without the NB assumption….
slide by Barnabás Póczos & Aarti Singh
6625
NB
NBfor
forText
Text Classification
Classification
Xi 2 {1,…,50000} ) K(100050000 -1) parameters to estimate….
NB assumption helps a lot!!!
If P(Xi=xi|Y=y) is the probability of observing word xi at the ith
position in a document on topic y
) 1000K(50000-1) parameters to estimate with NB assumption
slide by Barnabás Póczos & Aarti Singh
NB assumption helps, but still lots of parameters to estimate.
26
67
Bag
model
Bagof
of words
words model
Typical additional assumption:
Position in document doesn’t matter:
P(Xi=xi|Y=y) = P(Xk=xi|Y=y)
–
–
“Bag of words” model – order of words on the page ignored
The document is just a bag of words: i.i.d. words
Sounds really silly, but often works very well!
) K(50000-1) parameters to estimate
slide by Barnabás Póczos & Aarti Singh
The probability of a document with words x1,x2,… 27
68
The bag of words
representation
γ(
I love this movie! It's sweet,
but with satirical humor. The
dialogue is great and the
adventure scenes are fun… It
manages to be whimsical and
romantic while laughing at the
conventions of the fairy tale
genre. I would recommend it to
just about anyone. I've seen it
several times, and I'm always
happy to see it again whenever
I have a friend who hasn't seen
it yet.
)=c
slide by Dan Jurafsky
69
The bag of words
representation
γ(
I love this movie! It's sweet,
but with satirical humor. The
dialogue is great and the
adventure scenes are fun… It
manages to be whimsical and
romantic while laughing at the
conventions of the fairy tale
genre. I would recommend it to
just about anyone. I've seen it
several times, and I'm always
happy to see it again whenever
I have a friend who hasn't seen
it yet.
)=c
slide by Dan Jurafsky
70
The bag of words representation:
using a subset of words
γ(
x love xxxxxxxxxxxxxxxx sweet
xxxxxxx satirical xxxxxxxxxx
xxxxxxxxxxx great xxxxxxx
xxxxxxxxxxxxxxxxxxx fun xxxx
xxxxxxxxxxxxx whimsical xxxx
romantic xxxx laughing
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxx recommend xxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
x several xxxxxxxxxxxxxxxxx
xxxxx happy xxxxxxxxx again
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxx
)=c
slide by Dan Jurafsky
71
The bag of words
representation
γ(
great
2
love
2
recommend
1
laugh
1
happy
1
...
)=c
...
slide by Dan Jurafsky
72
Nc
P̂(c) =
N
count(w, c) +1
P̂(w | c) =
count(c)+ | V |
Priors:
P(c)= 3
4
P(j)= 1
4
Training
Test
Doc Words
Class
1
Chinese Beijing Chinese
c
2
Chinese Chinese Shanghai
c
3
Chinese Macao
c
4
Tokyo Japan Chinese
j
5
Chinese Chinese Chinese Tokyo Japan ?
Choosing a class:
P(c|d5) ∝ 3/4 * (3/7)3 * 1/14 * 1/14
≈ 0.0003
slide by Dan Jurafsky
Conditional Probabilities:
P(Chinese|c) = (5+1) / (8+6) = 6/14 = 3/7
P(Tokyo|c) = (0+1) / (8+6) = 1/14
P(j|d5) ∝ 1/4 * (2/9)3 * 2/9 * 2/9
P(Japan|c) = (0+1) / (8+6) = 1/14
≈ 0.0001
P(Chinese|j) = (1+1) / (3+6) = 2/9
P(Tokyo|j) = (1+1) / (3+6) = 2/9
P(Japan|j)
= (1+1) / (3+6) = 2/9
73
What
if features
are
if features
What What
if features
areare
continuous?
Bayes (GNB): continuous?
continuous?
th pixel
th pixel
th intensity
Eg., character
recognition:
Xi is intensity
at
i
e.g.,
character
recognition:
Xat
is
at
i
Eg., character
recognition:
Xi is intensity
i
pixel
i
riance for each class k and each pixel i.
e
Naïve
Bayes
(GNB):
Gaussian
Naïve
Bayes
(GNB):
., Gaussian
),
Gaussian
Naïve
Bayes
(GNB):
i
e., k)
31
Different
mean
and variance
for
each
class
k and
each
Different
mean
and
variance
for
each
class
k
and
each
pixel
i
. pixel i.
Different mean and variance for each class k and each pixel i.
slide by Barnabás Póczos & Aarti Singh
assume variance
SometimesSometimes
assume variance
•
is independent
of), Y (i.e., i),
•
is
independent
of
Y
(i.e.,
i
Sometimes
assume
variance
•
or
independent
of
•
or independent of Xi (i.e., k)Xi (i.e., k)
• (i.e.,
or both
• • isorindependent
of )Y (i.e., σi),
both
) (i.e.,
•
•
or independent of Xi (i.e., σk)
or both (i.e., σ)
31
31
74
Estimating
parameters:
Estimating
Estimating parameters:
parameters:
Estimating
parameters:
discrete, X
continuous
YYYdiscrete,
Xiii continuous
continuous
Y discrete, Xi continuous
Maximum likelihood estimates:
ith pixel in
jth training image
kth class
jth training image
slide by Barnabás Póczos & Aarti Singh
32
33
75
Estimating
parameters:
Estimating
parameters:
Estimating parameters:
Y
discrete,
X
continuous
Y
discrete,
X
continuous
i
i
Y discrete, X continuous
i
Maximum
likelihood
estimates:
Maximum likelihood estimates:
Maximum likelihood estimates:
ith pixel in
ith pixel
in
jth training
image
jth training image
kth class
th class
k
th
j training image
jth training image
slide by Barnabás Póczos & Aarti Singh
33
76
33
Twenty
news
groups
result
Twenty news groups results
slide by Barnabás Póczos & Aarti Singh
Naïve Bayes: 89% accuracy
77
Case Study:
Classifying Mental States
78
Example: GNB for classifying
Example:
GNB
for
classifying
Example:
GNB
for
classifying
mental states
mental
mental states
states
mmresolution
resolution
~1~1
mm
~1 mm resolution
images per
sec.
~2~2
images
per
sec.
~2 images per sec.
slide by Barnabás Póczos & Aarti Singh
15,000voxels/image
voxels/image
15,000
15,000 voxels/image
non-invasive,
safe
non-invasive, safe
non-invasive, safe
measures
Oxygen
measures Blood
Blood Oxygen
measures Blood Oxygen
Level
Dependent
(BOLD)
Level
Dependent
(BOLD)
Level Dependent (BOLD)
response
response
response
[Mitchell
etal.]
al.]
[Mitchellet
et
al.]
[Mitchell
34
79
3
Brain scans can
Brain
can
track scans
activation
with precision
track
activation
and
withsensitivity
precision
and sensitivity
slide by Barnabás Póczos & Aarti Singh
[Mitchell et al.]
Learned Naïve Bayes Models
Learned Naïve Bayes Models –
– Means
for
P(BrainActivity
|
WordCategory)
Means for P(BrainActivity | WordCategory)
[Mitchell et al.]
Pairwise
classification
accuracy:
Pairwise classification accuracy: [Mitchell et al.]
78-99%,
12
participants
78-99%, 12 participants
Tool words
Building
Tool
words
Building
words
slide by Barnabás Póczos & Aarti Singh
81
35
What you should know…
Naïve Bayes classifier
•
•
•
•
What’s the assumption
Why we use it
How do we learn it
Why is Bayesian (MAP) estimation important
Text classification
•
Bag of words model
slide by Barnabás Póczos & Aarti Singh
Gaussian NB
•
•
Features are still conditionally independent
Each feature has a Gaussian distribution given class
82
Download