Uploaded by Reetu Gupta

ai14

advertisement
Artificial Intelligence:
Reasoning Under Uncertainty/Bayes
Nets
Bayesian Learning
Conditional Probability
• Probability of an event given the occurrence of
some other event.
P( X  Y ) P( X , Y )
P( X | Y ) 

P(Y )
P(Y )
Example
You’ve been keeping track of the last 1000 emails you received. You find that
100 of them are spam. You also find that 200 of them were put in your junk
folder, of which 90 were spam.
Example
You’ve been keeping track of the last 1000 emails you received. You find that
100 of them are spam. You also find that 200 of them were put in your junk
folder, of which 90 were spam.
What is the probability an email you receive is spam?
Example
You’ve been keeping track of the last 1000 emails you received. You find that
100 of them are spam. You also find that 200 of them were put in your junk
folder, of which 90 were spam.
What is the probability an email you receive is spam?
P(X) =100 /1000 =.1
Example
You’ve been keeping track of the last 1000 emails you received. You find that
100 of them are spam. You also find that 200 of them were put in your junk
folder, of which 90 were spam.
What is the probability an email you receive is spam?
P(X) =100 /1000 =.1
What is the probability an email you receive is put in your junk folder?
Example
You’ve been keeping track of the last 1000 emails you received. You find that
100 of them are spam. You also find that 200 of them were put in your junk
folder, of which 90 were spam.
What is the probability an email you receive is spam?
P(X) =100 /1000 =.1
What is the probability an email you receive is put in your junk folder?
P(Y ) = 200 /1000 =.2
Example
You’ve been keeping track of the last 1000 emails you received. You find that
100 of them are spam. You also find that 200 of them were put in your junk
folder, of which 90 were spam.
What is the probability an email you receive is spam?
P(X) =100 /1000 =.1
What is the probability an email you receive is put in your junk folder?
P(Y ) = 200 /1000 =.2
Given that an email is in your junk folder, what is the probability it is spam?
Example
You’ve been keeping track of the last 1000 emails you received. You find that
100 of them are spam. You also find that 200 of them were put in your junk
folder, of which 90 were spam.
What is the probability an email you receive is spam?
P(X) =100 /1000 =.1
What is the probability an email you receive is put in your junk folder?
P(Y ) = 200 /1000 =.2
Given that an email is in your junk folder, what is the probability it is spam?
P(X ÇY )
P(X |Y ) =
= .09 /.2 = .45
P(Y )
Example
You’ve been keeping track of the last 1000 emails you received. You find that
100 of them are spam. You also find that 200 of them were put in your junk
folder, of which 90 were spam.
What is the probability an email you receive is spam?
P(X) =100 /1000 =.1
What is the probability an email you receive is put in your junk folder?
P(Y ) = 200 /1000 =.2
Given that an email is in your junk folder, what is the probability it is spam?
P(X ÇY )
P(X |Y ) =
= .09 /.2 = .45
P(Y )
Given that an email is spam, what is the probability it is in your junk folder?
Example
You’ve been keeping track of the last 1000 emails you received. You find that
100 of them are spam. You also find that 200 of them were put in your junk
folder, of which 90 were spam.
What is the probability an email you receive is spam?
P(X) =100 /1000 =.1
What is the probability an email you receive is put in your junk folder?
P(Y ) = 200 /1000 =.2
Given that an email is in your junk folder, what is the probability it is spam?
P(X ÇY )
P(X |Y ) =
= .09 /.2 = .45
P(Y )
Given that an email is spam, what is the probability it is in your junk folder?
P(X ÇY )
P(Y | X) =
= .09 /.1 = .9
P(X)
Deriving Bayes Rule
P(X ÇY )
P(X | Y ) =
P(Y )
P(X ÇY )
P(Y | X) =
P(X)
Bayes rule :
P(Y | X)P(X)
P(X | Y ) =
P(Y )
General Application to Data
Models
• In machine learning we have a space H of
hypotheses:
h1 , h2 , ... , hn (possibly infinite)
• We also have a set D of data
• We want to calculate P(h | D)
• Bayes rule gives us:
P ( D | h) P ( h )
P ( h | D) 
P ( D)
Terminology
– Prior probability of h:
• P(h): Probability that hypothesis h is true given our prior
knowledge
• If no prior knowledge, all h  H are equally probable
– Posterior probability of h:
• P(h | D): Probability that hypothesis h is true, given the data D.
– Likelihood of D:
• P(D | h): Probability that we will see data D, given hypothesis h is
true.
– Marginal likelihood of D
• P(D) = P(D | h)P(h)
å
h
A Bayesian Approach to the
Monty Hall Problem
You are a contestant on a game show.
There are 3 doors, A, B, and C. There is a new car behind one of them
and goats behind the other two.
Monty Hall, the host, knows what is behind the doors. He asks you to
pick a door, any door. You pick door A.
Monty tells you he will open a door, different from A, that has a goat
behind it. He opens door B: behind it there is a goat.
Monty now gives you a choice: Stick with your original choice A or
switch to C.
Bayesian probability
formulation
Hypothesis space H:
h1 = Car is behind door A
h2 = Car is behind door B
h3 = Car is behind door C
Prior probability:
P(h1) = 1/3 P(h2) =1/3 P(h3) =1/3
Data D: After you picked door A,
Monty opened B to show a goat
Likelihood:
P(D | h1) = 1/2
P(D | h2) = 0
P(D | h3) = 1
What is P(h1 | D)?
What is P(h2 | D)?
What is P(h3 | D)?
Marginal likelihood:
P(D) = p(D|h1)p(h1) + p(D|h2)p(h2) +
p(D|h3)p(h3) = 1/6 + 0 + 1/3 = 1/2
By Bayes rule:
P(D | h1 )P(h1 ) æ 1 öæ 1 ö
1
P(h1 | D) =
= ç ÷ç ÷ (2) =
è 2 øè 3 ø
P(D)
3
æ1ö
P(D | h2 )P(h2 )
P(h2 | D) =
= ( 0) ç ÷ (2) = 0
è 3ø
P(D)
æ1ö
P(D | h3 )P(h3 )
2
P(h3 | D) =
= (1) ç ÷ (2) =
è 3ø
P(D)
3
So you should switch!
MAP (“maximum a posteriori”) Learning
Bayes rule:
P ( D | h) P ( h )
P ( h | D) 
P ( D)
Goal of learning: Find maximum a posteriori hypothesis hMAP:
hMAP = argmax P(h | D)
hÎH
P(D | h)P(h)
= argmax
hÎH
P(D)
= argmax P(D | h)P(h)
hÎH
because P(D) is a constant independent of h.
Note: If every h  H is equally probable, then
hMAP  argmax P( D | h)
hH
hMAP is called the “maximum likelihood hypothesis”.
A Medical Example
Toby takes a test for leukemia. The test has two
outcomes: positive and negative. It is known that if
the patient has leukemia, the test is positive 98% of
the time. If the patient does not have leukemia, the
test is positive 3% of the time. It is also known that
0.008 of the population has leukemia.
Toby’s test is positive.
Which is more likely: Toby has leukemia or Toby
does not have leukemia?
• Hypothesis space:
h1 = T. has leukemia
h2 = T. does not have leukemia
• Prior: 0.008 of the population has leukemia. Thus
P(h1) = 0.008
P(h2) = 0.992
• Likelihood:
P(+ | h1) = 0.98, P(− | h1) = 0.02
P(+ | h2) = 0.03, P(− | h2) = 0.97
• Posterior knowledge:
Blood test is + for this patient.
• In summary
P(h1) = 0.008, P(h2) = 0.992
P(+ | h1) = 0.98, P(− | h1) = 0.02
P(+ | h2) = 0.03, P(− | h2) = 0.97
•h Thus:
= argmax P(D | h)P(h)
MAP
hÎH
P(+ | leukemia)P(leukemia) = (0.98)(0.008) = 0.0078
P(+ | Øleukemia)P(Øleukemia) = (0.03)(0.992) = 0.0298
hMAP = Øleukemia
• What is P(leukemia|+)?
P ( D | h) P ( h )
P( h | D) 
P( D)
So,
0.0078
P(leukemia | +) =
= 0.21
0.0078 + 0.0298
0.0298
P(Øleukemia | +) =
= 0.79
0.0078 + 0.0298
These are called the “posterior” probabilities.
Bayesianism vs. Frequentism
• Classical probability: Frequentists
– Probability of a particular event is defined relative to its frequency
in a sample space of events.
– E.g., probability of “the coin will come up heads on the next trial”
is defined relative to the frequency of heads in a sample space of
coin tosses.
• Bayesian probability:
– Combine measure of “prior” belief you have in a proposition with
your subsequent observations of events.
• Example: Bayesian can assign probability to statement “There was
life on Mars a billion years ago” but frequentist cannot.
Independence and Conditional
Independence
• Recall that two random variables, X and Y,
are independent if
P( X , Y )  P( X ) P(Y )
• Two random variables, X and Y, are
independent given C if
P( X , Y | C )  P( X | C ) P(Y | C )
Naive Bayes Classifier
Let f (x) be a target function for classification: f
(x)  {+1, −1}.
Let x = (x1, x2, ..., xn)
We want to find the most probable class value, hMAP,
given the data x:
classMAP = argmax P(class | D)
class Î {+1,-1}
= argmax P(class | x1, x2 ,..., xn )
class Î {+1,-1}
By Bayes Theorem:
P(x1, x2 ,..., xn | class)P(class)
classMAP = argmax
P(x1, x2 ,..., xn )
class Î {+1,-1}
= argmax P(x1, x2 ,..., xn | class)P(class)
class Î {+1,-1}
P(class) can be estimated from the training data.
How?
However, in general, not practical to use training data to
estimate P(x1, x2, ..., xn | class). Why not?
• Naive Bayes classifier: Assume
P(x1, x2,..., xn | class) = P(x1 | class)P(x2 | class) P(xn | class)
Is this a good assumption?
Given this assumption, here’s how to
classify an instance x = (x1, x2, ...,xn):
Naive Bayes classifier:
classNB (x) = argmax P(class)Õ P(xi | class)
class Î {+1,-1}
i
To train: Estimate the values of these various probabilities over the training set.
Training data:
Day
Outlook
Temp
Humidity
Wind
PlayTennis
D1
D2
D3
D4
D5
D6
D7
D8
D9
D10
D11
D12
D13
D14
Sunny
Sunny
Overcast
Rain
Rain
Rain
Overcast
Sunny
Sunny
Rain
Sunny
Overcast
Overcast
Rain
Hot
Hot
Hot
Mild
Cool
Cool
Cool
Mild
Cool
Mild
Mild
Mild
Hot
Mild
High
High
High
High
Normal
Normal
Normal
High
Normal
Normal
Normal
High
Normal
High
Weak
Strong
Weak
Weak
Weak
Strong
Strong
Weak
Weak
Weak
Strong
Strong
Weak
Strong
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
Cool
High
Strong
?
Test data:
D15
Sunny
Use training data to compute a probabilistic model:
P(Outlook = Sunny | Yes) = 2 / 9 P(Outlook = Sunny | No) = 3 / 5
P(Outlook = Overcast | Yes) = 4 / 9 P(Outlook = Overcast | No) = 0
P(Outlook = Rain | Yes) = 3 / 9 P(Outlook = Rain | No) = 2 / 5
P(Temperature = Hot | Yes) = 2 / 9 P(Temperature = Hot | No) = 2 / 5
P(Temperature = Mild | Yes) = 4 / 9 P(Temperature = Mild | No) = 2 / 5
P(Temperature = Cool | Yes) = 3 / 9 P(Temperature = Cool | No) = 1 / 5
P(Humidity = High | Yes) = 3 / 9 P(Humidity = High | No) = 4 / 5
P(Humidity = Normal | Yes) = 6 / 9 P(Humidity = Normal | No) = 1 / 5
P(Wind = Strong | Yes) = 3 / 9 P(Wind = Strong | No) = 3 / 5
P(Wind = Weak | Yes) = 6 / 9 P(Wind = Weak | No) = 2 / 5
Use training data to compute a probabilistic model:
P(Outlook = Sunny | Yes) = 2 / 9 P(Outlook = Sunny | No) = 3 / 5
P(Outlook = Overcast | Yes) = 4 / 9 P(Outlook = Overcast | No) = 0
P(Outlook = Rain | Yes) = 3 / 9 P(Outlook = Rain | No) = 2 / 5
P(Temperature = Hot | Yes) = 2 / 9 P(Temperature = Hot | No) = 2 / 5
P(Temperature = Mild | Yes) = 4 / 9 P(Temperature = Mild | No) = 2 / 5
P(Temperature = Cool | Yes) = 3 / 9 P(Temperature = Cool | No) = 1 / 5
P(Humidity = High | Yes) = 3 / 9 P(Humidity = High | No) = 4 / 5
P(Humidity = Normal | Yes) = 6 / 9 P(Humidity = Normal | No) = 1 / 5
P(Wind = Strong | Yes) = 3 / 9 P(Wind = Strong | No) = 3 / 5
P(Wind = Weak | Yes) = 6 / 9 P(Wind = Weak | No) = 2 / 5
Day
D15
Outlook
Sunny
Temp
Cool
Humidity Wind
High
Strong
PlayTennis
?
Use training data to compute a probabilistic model:
P(Outlook = Sunny | Yes) = 2 / 9 P(Outlook = Sunny | No) = 3 / 5
P(Outlook = Overcast | Yes) = 4 / 9 P(Outlook = Overcast | No) = 0
P(Outlook = Rain | Yes) = 3 / 9 P(Outlook = Rain | No) = 2 / 5
P(Temperature = Hot | Yes) = 2 / 9 P(Temperature = Hot | No) = 2 / 5
P(Temperature = Mild | Yes) = 4 / 9 P(Temperature = Mild | No) = 2 / 5
P(Temperature = Cool | Yes) = 3 / 9 P(Temperature = Cool | No) = 1 / 5
P(Humidity = High | Yes) = 3 / 9 P(Humidity = High | No) = 4 / 5
P(Humidity = Normal | Yes) = 6 / 9 P(Humidity = Normal | No) = 1 / 5
P(Wind = Strong | Yes) = 3 / 9 P(Wind = Strong | No) = 3 / 5
P(Wind = Weak | Yes) = 6 / 9 P(Wind = Weak | No) = 2 / 5
Day
D15
Outlook
Sunny
Temp
Cool
Humidity Wind
High
Strong
classNB (x) = argmax P(class)Õ P(xi | class)
class Î {+1,-1}
i
PlayTennis
?
Estimating probabilities /
Smoothing
• Recap: In previous example, we had a training set and a new example,
(Outlook=sunny, Temperature=cool, Humidity=high, Wind=strong)
• We asked: What classification is given by a naive Bayes classifier?
• Let nc be the number of training instances with class c.
xi =ak
be the number of training instances with attribute
c
n
• Let
xi=ak and class c.
ncxi =ak
Then: P(xi = ai | c) =
nc
value
• Problem with this method: If nc is very
small, gives a poor estimate.
• E.g., P(Outlook = Overcast | no) = 0.
• Now suppose we want to classify a new
instance:
(Outlook=overcast, Temperature=cool, Humidity=high, Wind=strong)
Then:
P(no)Õ P(xi | no) = 0
i
This incorrectly gives us zero probability due to small
sample.
One solution: Laplace smoothing (also
called “add-one” smoothing)
For each class c and attribute xi with value
ak, add one “virtual” instance.
That is, for each class c, recalculate:
ncxi =ak +1
P(xi = ai | c) =
nc + K
where K is the number of possible values of
attribute a.
Training data:
Day
Outlook
Temp
Humidity
Wind
PlayTennis
D1
D2
D3
D4
D5
D6
D7
D8
D9
D10
D11
D12
D13
D14
Sunny
Sunny
Overcast
Rain
Rain
Rain
Overcast
Sunny
Sunny
Rain
Sunny
Overcast
Overcast
Rain
Hot
Hot
Hot
Mild
Cool
Cool
Cool
Mild
Cool
Mild
Mild
Mild
Hot
Mild
High
High
High
High
Normal
Normal
Normal
High
Normal
Normal
Normal
High
Normal
High
Weak
Strong
Weak
Weak
Weak
Strong
Strong
Weak
Weak
Weak
Strong
Strong
Weak
Strong
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
Laplace smoothing: Add the following virtual instances for Outlook:
Outlook=Sunny: Yes
Outlook=Sunny: No
Outlook=Overcast: Yes
Outlook=Overcast: No
0
ncxi =ak +1 0 +1 1
P(Outlook = overcast | No) =
®
=
=
5
nc + K 5+ 3 8
4
ncxi =ak +1 4 +1 5
P(Outlook = overcast | Yes) =
®
=
=
9
nc + K 9 + 3 12
Outlook=Rain: Yes
Outlook=Rain: No
P(Outlook = Sunny | Yes) = 2 / 9 ® 3 /12 P(Outlook = Sunny | No) = 3 / 5 ® 4 / 8
P(Outlook = Overcast | Yes) = 4 / 9 ® 5 /12 P(Outlook = Overcast | No) = 0 / 5 ®1 / 8
P(Outlook = Rain | Yes) = 3 / 9 ® 4 /12 P(Outlook = Rain | No) = 2 / 5 ® 3 / 8
Etc.
In-class exercise
1. Recall the Naïve Bayes Classifier
classNB (x) = argmax P(class)Õ P(xi | class)
class Î {+1,-1}
i
Consider this training set, in which each instance has four binary features and a binary
class:
Instance
x1
x2
x3
x4
Class
x1
1
1
1
1
POS
x2
1
1
0
1
POS
x3
0
1
1
1
POS
x4
1
0
0
1
POS
x5
1
0
0
0
NEG
x6
1
0
1
0
NEG
x7
0
1
0
1
NEG
2. Recall the formula for Laplace smoothing:
where K is the number of possible values of attribute a.
(a) Apply Laplace smoothing to all the probabilities from the training set in
question 1.
(b) Use the smoothed probabilities to determine classNB for the following
new instances:
Instance
x1
x2
x3
x4
x10
0
1
0
0
11
0
0
0
0
(a) Create a probabilistic model that you could use to classify new instances. That is,
calculate P(class) and P(xi | class) for each class. No smoothing is needed (yet).
x
(b) Use your probabilistic model to determine classNB for the following new instance:
Instance
x1
x2
x3
x4
x8
0
0
0
1
Class
Class
Naive Bayes on continuousvalued attributes
• How to deal with continuous-valued
attributes?
Two possible solutions:
– Discretize
– Assume particular probability distribution of
classes over values (estimate parameters from
training data)
Discretization: Equal-Width
Binning
For each attribute xi , create k equal-width bins in interval
from min(xi ) to max(xi).
The discrete “attribute values” are now the bins.
Questions: What should k be? What if some bins have very
few instances?
Problem with balance between discretization bias and
variance.
The more bins, the lower the bias, but the higher the variance,
due to small sample size.
Discretization: Equal-Frequency
Binning
For each attribute xi , create k bins so that each bin contains an equal number of
values.
Also has problems: What should k be? Hides outliers. Can group together
instances that are far apart.
Gaussian Naïve Bayes
Assume that within each class, values of each
numeric feature are normally distributed:
where μi,c is the mean of feature i given the class c,
and σi,c is the standard deviation of feature i given the
class c
We estimate μi,c and σi,c from training data.
Example
x1
x2
Class
3.0
5.1
POS
4.1
6.3
POS
7.2
9.8
POS
2.0
1.1
NEG
4.1
2.0
NEG
8.1
9.4
NEG
Example
x1
x2
Class
3.0
5.1
POS
4.1
6.3
POS
7.2
9.8
POS
2.0
1.1
NEG
4.1
2.0
NEG
8.1
9.4
NEG
P(POS) = 0.5
P(NEG) = 0.5
http://homepage.stat.uiowa.edu/~mbognar/applets/normal.html
N1,POS = N(x; 4.8, 1.8)
N2,POS = N(x; 7.1, 2.0)
N1,NEG = N(x; 4.7, 2.5)
N2,NEG = N(x; 4.2, 3.7)
Now, suppose you have a new example x, with x1 = 5.2, x2 = 6.3.
What is classNB (x) ?
Now, suppose you have a new example x, with x1 = 5.2, x2 = 6.3.
What is classNB (x) ?
classNB (x) = argmax P(class)Õ P(xi | class)
class Î {+1,-1}
i
Note: N is the probability density function, but can be used analogously to probability
in Naïve Bayes calculations.
Now, suppose you have a new example x, with x1 = 5.2, x2 = 6.3.
What is classNB (x) ?
classNB (x) = argmax P(class)Õ P(xi | class)
class Î {+1,-1}
i
Positive :
P(POS)P(x1 | POS)P(x2 | POS) = (.5)(.22)(.18) = .02
Negative :
P(NEG)P(x1 | NEG)P(x2 | NEG) = (.5)(.16)(.09) = .0072
classNB (x) = POS
Use logarithms to avoid
underflow
classNB (x) = argmax P(class)Õ P(xi | class)
class Î {+1,-1}
i
æ
ö
= argmax log ç P(class)Õ P(xi | class)÷
class Î {+1,-1}
è
ø
i
æ
ö
= argmax ç log P(class) + å log P(xi | class)÷
class Î {+1,-1} è
ø
i
Bayes Nets
Another example
A patient comes into a doctor’s office with a bad cough and a
high fever.
Hypothesis space H:
Data D:
h1: patient has flu
coughing = true, fever = true
h2: patient does not have flu
Prior probabilities: Likelihoods
p(h1) = .1
p(D | h1) = .8
p(h2) = . 9
p(D | h2) = .4
Posterior probabilities:
P(h1|D) =
P(h2|D) =
Prob. of data
P(D) =
• Let’s say we have the following random
variables:
cough
fever
flu
smokes
Full joint probability distribution
smokes
cough
 cough
Fever
 Fever
Fever
 Fever
flu
p1
p2
p3
p4
flu
p5
p6
p7
p8
 smokes
 cough
cough
fever
 fever
fever
 fever
flu
p9
p10
p11
p12
flu
p13
p14
p15
p16
Sum of all boxes
is 1.
In principle, the full joint
distribution can be used to
answer any question about
probabilities of these
combined parameters.
However, size of full joint
distribution scales
exponentially with
number of parameters so is
expensive to store and to
compute with.
Bayesian networks
• Idea is to represent dependencies (or
causal relations) for all the variables so
that space and computation-time
requirements are minimized.
flu
smokes
cough
“Graphical Models”
fever
cough
flu
smoke true
false
True
True
0.95
0.05
True
False 0.8
0.2
smoke
true
false
0.2
0.8
False True
0.6
0.4
false
0.05
0.95
false
flu
flu
smoke
cough
Conditional probability
tables for each node
true
0.01
false
0.99
fever
fever
flu
true
false
true
0.9
0.1
false
0.2
0.8
Semantics of Bayesian networks
• If network is correct, can calculate full joint
probability distribution from network.
P(( X 1  x1 )  ( X 2  x2 )...  ( X n  xn ))
n
  P( X i  xi | parents( X i ))
i 1
where parents(Xi) denotes specific values of parents
of Xi.
Example
• Calculate
P[(cough  t )  ( fever  f )  ( flu  f )  ( smoke  f )]
Example
• Calculate
P[(cough  t )  ( fever  f )  ( flu  f )  ( smoke  f )]
Different types of inference in Bayesian Networks
Causal inference
Evidence is cause, inference is probability of effect
Example:
Instantiate evidence flu = true. What is P(fever | flu)?
P( fever | flu )  .9 (up from .207)
Diagnostic inference
Evidence is effect, inference is probability of
cause
Example: Instantiate evidence fever = true. What is P(flu | fever)?
P( flu | fever) 
P( fever | flu) P( flu) (.9)(.01)

 .043 (up from .01)
P( fever)
.207
Example: What is P(flu|cough)?
P(cough | flu ) P( flu )

P(cough)
[ P(cough | flu, smoke) p( smoke)
 P(cough | flu, smoke) p (smoke)]P( flu )

p(cough)
[(.95)(.2)  (.8)(.8)](. 01)

 .0497
.167
P( flu | cough) 
Inter-causal inference
Explain away different possible causes of
effect
Example: What is P(flu|cough,smoke)?
P( flu | cough, smoke) 
p( flu  cough  smoke)
p(cough  smoke)
p(cough | flu, smoke) p( flu ) p( smoke)

p(cough | flu, smoke) p( flu ) p( smoke)  p(cough | smoke, flu ) p( smoke) p(flu )
 (.95)(.01)(.2) /[(. 95)(.01)(.2)  (.6)(.2)(.99)]
 0.016
Why is P(flu|cough,smoke) < P(flu|cough)?
Complexity of Bayesian Networks
For n random Boolean variables:
• Full joint probability distribution: 2n
entries
• Bayesian network with at most k
parents per node:
– Each conditional probability table: at most
2k entries
– Entire network: n 2k entries
What are the advantages
of Bayesian networks?
• Intuitive, concise representation of joint
probability distribution (i.e., conditional
dependencies) of a set of random variables.
• Represents “beliefs and knowledge” about a
particular class of situations.
• Efficient (?) (approximate) inference algorithms
• Efficient, effective learning algorithms
Issues in Bayesian Networks
• Building / learning network topology
• Assigning / learning conditional probability
tables
• Approximate inference via sampling
Real-World Example:
The Lumière Project at Microsoft Research
• Bayesian network approach to answering user
queries about Microsoft Office.
• “At the time we initiated our project in Bayesian
information retrieval, managers in the Office division
were finding that users were having difficulty finding
assistance efficiently.”
• “As an example, users working with the Excel
spreadsheet might have required assistance with
formatting “a graph”. Unfortunately, Excel has no
knowledge about the common term, “graph,” and
only considered in its keyword indexing the term
“chart”.
• Networks were developed by experts
from user modeling studies.
Offspring of project was Office Assistant in
Office 97.
flu
smoke
fever
cough
nausea
headache
P( flu  smoke  fever  cough  nausea  headache)
 P( flu ) P( smoke | flu ) P( fever | flu ) P(cough | flu )
P(nausea | flu ) P(headache | flu )
flu
smoke
cough
fever
nausea
More generally, for classifica tion :
P(C  c j | X 1  x1 ,..., X 1  xn )
 P(C  c j ) P( X i  xi | C  c j )
i
“Naive Bayes”
headache
Learning network topology
• Many different approaches, including:
– Heuristic search, with evaluation based on
information theory measures
– Genetic algorithms
– Using “meta” Bayesian networks!
Learning conditional
probabilities
• In general, random variables are not
binary, but real-valued
• Conditional probability tables
conditional probability distributions
• Estimate parameters of these distributions
from data
Approximate inference via sampling
• Recall: We can calculate full joint probability distribution
from network.
d
P( X 1 ,..., X d )   P( X i | parents( X i ))
i 1
where parents(Xi) denotes specific values of parents of Xi.
• We can do diagnostic, causal, and inter-causal inference
• But if there are a lot of nodes in the network, this can be
very slow!
Need efficient algorithms to do approximate calculations!
A Précis of Sampling Algorithms
Gibbs Sampling
𝑑
• Suppose that we want to sample from: 𝑝 𝑥|𝜃 , 𝑥 ∈ ℝ
Basic Idea: Sample, sequentially from “full conditionals.”
• Initialize: 𝑥𝑖 : 𝑖 = 1, . . . , 𝐷
• For t=1…T
 sample : x1(t 1) ~ p  x1 | x((t )1) 

 sample : x2(t 1) ~ p  x2 | x((t )2) 



( t 1)
(t )
 sample : xD ~ p  xD | x(  D ) 
• Complications: (i) How to “order” samples (can be random, but
be careful); (ii) Need full conditionals (can use approximation).
• Under “nice” assumptions (ergodicity), we get sample: x ~ p  x |  
A Précis of Sampling Algorithms
Sampling Algorithms More Generally:
• How do we perform posterior inference more generally?
• Oftentimes, we rely on strong parametric assumptions (e.g.
conjugacy, exponential family structures).
• Monte Carlo Approximation/Inference can get around this.
• Basic Idea: (1) Draw samples xs~p(x|θ).
(2) Compute quantity of interest, e.g., (marginals):
1
E  p  x1 | D     x1,i , etc.
S S
1
In general: E  f  x     f  x  p  x |   dx   f  x s 
S S
A Précis of Sampling Algorithms
Sampling Algorithms More Generally:
CDF Method (MC technique)
Steps: (1) sample u~U(0,1)
(2) F-1(U)~F
A Précis of Sampling Algorithms
Sampling Algorithms More Generally:
Rejection Sampling (MC technique)
One can show that x~p(x)
Issues: need a “good” q(x), c and rejection rate can grow astronomically!
Pros of MC sampling: samples are independent; Cons: very inefficient in
high dimensions. Alternatively, one can use MCMC methods.
Markov Chain Monte Carlo
Sampling
• One of most common methods used in real applications.
• Recall that: By construction of Bayesian network, a node
is conditionally independent of its non-descendants, given
its parents.
• Also recall that: a node can be conditionally dependent
on its children and on the other parents of its children.
(Why?)
• Definition: The Markov blanket of a variable Xi is Xi’s
parents, children, and children’s other parents.
Example
• What is the Markov blanket of cough?
of flu?
flu
smokes
cough
fever
• Theorem: A node Xi is conditionally
independent of all other nodes in the
network, given its Markov blanket.
Markov Chain Monte Carlo
(MCMC) Sampling
• Start with random sample from variables:
(x1, ..., xn). This is the current “state” of
the algorithm.
• Next state: Randomly sample value for
one non-evidence variable Xi , conditioned
on current values in “Markov Blanket” of
Xi.
Example
• Query: What is P(cough| smoke)?
• MCMC:
–
Random sample, with evidence variables fixed:
flu
smoke fever cough
truetrue
false true
–
Repeat:
1. Sample flu probabilistically, given current values of its
Markov blanket: smoke = true, fever = false, cough =
true
Suppose result is false. New state:
flu
false
smoke fever cough
true
false true
2. Sample cough, given current values of its Markov blanket:
smoke = true , flu = false
Suppose result is true.
New state:
flu
false
smoke fever cough
true
false true
3. Sample fever, given current values of its Markov blanket:
flu = false
Suppose result is true.
New state:
flu
false
smoke fever cough
true
true
true
• Each sample contributes to estimate for query
P(cough | smoke)
• Suppose we perform 100 such samples, 20 with cough = true
and 80 with cough= false.
• Then answer to the query is
P(cough | smoke) = .20
• Theorem: MCMC settles into behavior in which each state is
sampled exactly according to its posterior probability, given
the evidence.
Applying Bayesian Reasoning
to
Speech Recognition
• Task: Identify sequence of words uttered by speaker, given
acoustic signal.
• Uncertainty introduced by noise, speaker error, variation in
pronunciation, homonyms, etc.
• Thus speech recognition is viewed as problem of probabilistic
inference.
• So far, we’ve looked at probabilistic
reasoning in static environments.
• Speech: Time sequence of “static
environments”.
– Let X be the “state variables” (i.e., set of
non-evidence variables) describing the
environment (e.g., Words said during time
step t)
– Let E be the set of evidence variables
(e.g., S = features of acoustic signal).
– The E values and X joint probability
distribution changes over time.
t1: X1, e1
t2: X2 , e2
etc.
• At each t, we want to compute P(Words | S).
• We know from Bayes rule:
P(Words | S)   P(S | Words ) P(Words )
• P(S | Words), for all words, is a previously learned
“acoustic model”.
– E.g. For each word, probability distribution over
phones, and for each phone, probability distribution
over acoustic signals (which can vary in pitch, speed,
volume).
• P(Words), for all words, is the “language model”, which
specifies prior probability of each utterance.
– E.g. “bigram model”: probability of each word
following each other word.
•
Speech recognition typically makes three assumptions:
1. Process underlying change is itself “stationary”
i.e., state transition probabilities don’t change
2. Current state X depends on only a finite history of
previous states (“Markov assumption”).
– Markov process of order n: Current state
depends only on n previous states.
3. Values et of evidence variables depend only on
current state Xt. (“Sensor model”)
Hidden Markov Models
• Markov model: Given state Xt, what is probability of
transitioning to next state Xt+1 ?
• E.g., word bigram probabilities give
P (wordt+1 | wordt )
• Hidden Markov model: There are observable states (e.g.,
signal S) and “hidden” states (e.g., Words). HMM
represents probabilities of hidden states given observable
states.
Example: “I’m firsty, um, can I have something to dwink?”
Graphical Models and
Computer Vision
Download