Learning with Bayesian Networks

advertisement
Author:
David Heckerman
Presented By:
Yan Zhang - 2006
Jeremy Gould – 2013
Chip Galusha -2014
1
Outline
• Bayesian Approach
• Bayesian vs. classical probability methods
• Bayes Theorm
• Examples
• Bayesian Network
• Structure
• Inference
• Learning Probabilities
• Dealing with Unknowns
• Learning the Network Structure
• Two coin toss – an example
• Conclusions
• Exam Questions
2
Bayesian vs. the Classical Approach
 Bayesian statistical methods start with existing 'prior'
beliefs, and update these using data to give 'posterior'
beliefs, which may be used as the basis for inferential
decisions and probability assessments.
 Classical probability refers to the true or actual
probability of the event and is not concerned with
observed behavior.
3
Example – Is this Man a Martian
Spy?
4
Example
We start with two concepts:
1. Hypothesis (H) – He either is or is not a Martian
spy.
2. Data (D) – Some set of information about the
subject. Perhaps financial data, phone records,
maybe we bugged his office…
5
Example
Frequentist Says
Given a hypothesis (He IS a
Martian) there is a
probability P of seeing this
data:
P( D | H )
(Considers absolute ground
truth, the uncertainty/noise
is in the data.)
Bayesian Says
Given this data there is a
probability P of this
hypothesis being true:
P( H | D )
(This probability indicates
our level of belief in the
hypothesis.)
6
Bayesian vs. the Classical Approach
 Bayesian approach restricts its prediction to the next
(N+1) occurrence of an event given the observed
previous (N) events.
 Classical approach is to predict likelihood of any given
event regardless of the number of occurrences.
NOTE: The Bayesian approach can be
updated as new data is observed.
7
Bayes Theorem
p( )p(y |  )
p( | y) 
p(y)
where
 p( | y) p(y)dy for continuous data
p(y)   p( | y ) p(y ) for discrete data
p(y) 

i
i
In both cases, p(y) is a marginal distribution and can be thought of as a
normalizing constant which allows us to rewrite the above as:

p( | y)  p(y |  ) p( )
8
Example – Coin Toss
• I want to toss a coin n = 100 times. Let’s denote the random
variable X as the outcome of one flip:
•
•
p(X=head) = θ
p(X=tail) =1- θ
• Before doing this experiment we have some belief in our
mind: the prior probability. Let’s assume that this event will
have a Beta distribution (a common assumption):
Example – Coin Toss
Beta Distribution
p( | , ) 



E[ ] 

(  )  1
 (1   )  1
( )( )

Var[ ] 
(   )2  (   1)
If we assume a fair coin we can fix
α = β = 5 which gives:
 5
E[ ] 
  .5
   10
(Hopefully, what you
were expecting!)
Example – Coin Toss
Now I can run my experiment. As I go I can update my beliefs based on
the observed heads (h) and tails (t) by applying Bayes Law to posterior,
we have:
(  )  1  1
 (1  )
 (1  )
( )()
p( | y,, ) 
p(y)
h
t
11
Example – Coin Toss
Since we’re assuming a Beta distribution this becomes:
p( | y,  , )   y (1   ) y  n   1 (1   )  1
p( | y,  , )   y  1 (1   ) y  n   1
p( | y,  , )  Beta( | y   ,   n  y)
…our posterior probability. Supposing that we observed
h = 45, t = 55, we would get:

 h
5  45
E[ | y, , ] 

 .46
  h    t 5  45  5  55
12
Example – Coin Toss
13
Integration
To find the probability that Xn+1= heads, we
could also integrate over all possible values of
θ to find the average value of θ which yields:
𝑃 𝑋𝑛+1 = ℎ𝑒𝑎𝑑𝑠 | 𝐷, 𝜉 =
=
𝑃 𝑋𝑛+1 = ℎ𝑒𝑎𝑑𝑠 | 𝐷, 𝜉 𝑃 𝜃 𝐷, 𝜉 𝑑𝜃
𝜃𝑃 𝜃|𝐷, 𝜉 𝑑𝜃 = 𝐸 𝜃|𝐷, 𝜉
This might be necessary if we were working with a
distribution with a less obvious Expected Value.
14
More than Two Outcomes
 In the previous example, we used a Beta distribution to encode the
states of the random variable. This was possible because there were
only 2 states/outcomes of the variable X.
 In general, if the observed variable X is discrete, having r possible states
{1,…,r}, the likelihood function is given by:
𝑃 𝑋 = 𝑥𝑘 | 𝜃, 𝜉 = 𝜃𝑘 𝑤ℎ𝑒𝑟𝑒:
𝑘 = 1, 2, … . , 𝑟
𝜃 ∈ 𝜃1 , … . , 𝜃𝑟 and 𝜃𝑘 = 1
 In this general case we can use a Dirichlet distribution instead:
Γ 𝛼
𝛼𝑘 −1
𝑟
𝜃
𝑘=1 𝑘
Prior
𝑃 𝜃|𝜉 = Dir 𝜃| 𝛼1 , 𝛼2 , … , 𝛼𝑟 =
Posterior
𝑃 𝜃|𝐷, 𝜉 = Dir 𝜃| 𝛼1 + 𝑛1 , 𝛼2 + 𝑛2 , … , 𝛼𝑟 + 𝑛𝑟
𝑟
𝑘=1 Γ
𝛼𝑘
15
Vocabulary Review
 Prior Probability, P( θ | y ): Prior Probability of a
particular value of θ given no observed data (our
previous “belief”)
 Posterior Probability, P(θ | D): Probability of a
particular value of θ given that D has been observed
(our final value of θ).
 Observed Probability or “Likelihood”, P(D|θ):
Likelihood of sequence of coin tosses D being
observed given that θ is a particular value.
 P(D): Raw probability of D
16
Bayesian Advantages
It turns out that the Bayesian technique permits us to do
some very useful things from a mining perspective!
1. We can use the Chain Rule with Bayesian Probabilities:
𝑛
𝑛
𝑃
𝐴𝑘 =
𝑘=1
Ex.
𝑘−1
𝑃 𝐴𝑘 |
𝑘=1
𝑗=1
𝐴𝑗
P(A,B,C)=P(A|B,C)+P(B|C)+P(C)
This isn’t
something we can
easily do with
classical
probability!
2. As we’ve already seen using the Bayesian model
permits us to update our beliefs based on new data.
17
Outline
• Bayesian Approach
• Bayes Therom
• Bayesian vs. classical probability methods
Coin toss – an example
• Bayesian Network
• Structure
• Inference
• Learning Probabilities
• Dealing with Unknowns
• Learning the Network Structure
• Two coin toss – an example
• Conclusions
• Exam Questions
18
Bayesian Network
• To create a Bayesian network we will ultimately need
3 things:
• A set of Variables X={X1,…, Xn}
• A Network Structure S
• Conditional Probability Table (CPT)
• Note that when we start we may not have any of
these things or a given element may be incomplete!
• Probabilities encoded by a Bayesian network may be
Bayesian of physical or both
19
A Little Notation
• S: The network structure
• Sh: The hypothesis corresponding to network
•
•
•
•
structure
Sc: A complete network structure
Xi: A variable and corresponding node
Pai: The variable or node corresponding to the
parents of node Xi
D: A data set
20
Bayesian Network: Detecting Credit Card
Fraud
Let’s start with a simple case where we are given all three things: a credit
fraud network designed to determine the probability of credit fraud.
21
Bayesian Network: Setup
 Correctly identify goals
 Identify many possible relevant observations
 Determine what subset of those observations is worth
modeling
 Organize observations into variables and order
22
Set of Variables
Each node
represents a
random variable.
(Let’s assume
discrete for now.)
23
Network Structure
Each edge/arch
represents a
conditional
dependence
between variables.
24
Conditional Probability Table
Each rule represents
the quantification of
a conditional
dependency.
25
Conditional Dependencies
Since we’ve been given the network structure we can easily see
the conditional dependencies:
P(A|F,A,S,G) = P(A)
P(S|F,A,S,G) = P(S)
P(G|F,A,S,G) = P(G|F)
P(J|F,A,S,G) = P(J|F,A,S)
Need to be careful with the order!
26
Note that the absence of an edge indicates
conditional independence:
P(A|G) = P(A)
27
Important Note:
The presence of a of cycle will render one or more of the relationships
intractable!
28
Inference
Now suppose we want to calculate (infer) our confidence level
in a hypothesis on the fraud variable f given some knowledge
about the other variables. This can be directly calculated via:
𝑃 𝑓, 𝑎, 𝑠, 𝑔, 𝑗
𝑃 𝑓|𝑎, 𝑠, 𝑔, 𝑗 =
=
𝑃 𝑎, 𝑠, 𝑔, 𝑗
𝑃 𝑓, 𝑎, 𝑠, 𝑔, 𝑗
′ , 𝑎, 𝑠, 𝑔, 𝑗
𝑃
𝑓
𝑓′
(Kind of messy…)
29
Inference
Fortunately, we can use the Chain Rule to simplify!
𝑃 𝑓|𝑎, 𝑠, 𝑔, 𝑗 =
𝑃 𝑓 𝑃 𝑎 𝑃 𝑠 𝑃 𝑔|𝑓 𝑃 𝑗|𝑓,𝑎,𝑠
𝑓′ 𝑃 𝑓′ 𝑃 𝑎 𝑃 𝑠 𝑃 𝑔|𝑓′ 𝑃 𝑗|𝑓′,𝑎,𝑠
=
𝑃 𝑓 𝑃 𝑔|𝑓 𝑃 𝑗|𝑓,𝑎,𝑠
𝑓′ 𝑃 𝑓′ 𝑃 𝑔|𝑓′ 𝑃 𝑗|𝑓′,𝑎,𝑠
This Simplification is especially powerful when the
network is sparse which is frequently the case in real
world problems.
This shows how we can use a Bayesian Network to
infer a probability not stored directly in the model.
30
Now for the Data Mining!
• So far we haven’t added much value to the data. So let’s take
advantage of the Bayesian model’s ability to update our beliefs and
learn from new data.
• First we’ll rewrite our joint probability distribution in a more compact
form:
𝑛
𝑃 𝑥|𝜃𝑠 , 𝑆 ℎ =
𝑃 𝑥𝑖 |𝑝𝑎𝑖, 𝜃𝑖 , 𝑆 ℎ
𝑖=1
𝑆 ℎ 𝑖𝑠 𝑡ℎ𝑒 𝑐𝑢𝑟𝑟𝑒𝑛𝑡 ℎ𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠
(𝐵𝑎𝑠𝑖𝑐𝑎𝑙𝑙𝑦 𝑙𝑖𝑘𝑒 𝜉 𝑓𝑟𝑜𝑚 𝑏𝑒𝑓𝑜𝑟𝑒)
𝜃𝑖 𝑖𝑠 𝑡ℎ𝑒 𝑣𝑒𝑐𝑡𝑜𝑟 𝑜𝑓 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 𝑓𝑜𝑟 𝑃 𝑥𝑖 |𝑝𝑎𝑖, 𝜃𝑖 , 𝑆 ℎ
𝑥 ∈ 𝑥1 , 𝑥2 , … , 𝑥𝑛
𝜃𝑠 = 𝜃1 , 𝜃2 , … , 𝜃𝑛
𝑝𝑎𝑖 𝑖𝑠 𝑡ℎ𝑒 𝑐𝑜𝑛𝑓𝑖𝑔𝑢𝑟𝑎𝑡𝑖𝑜𝑛 𝑣𝑒𝑐𝑡𝑜𝑟:
𝑗
𝑃 𝑥𝑖𝑘 |𝑝𝛼𝑖 , 𝜃𝑖 , 𝑆 ℎ = 𝜃𝑖𝑗𝑘 > 0
𝑭𝒓𝒐𝒎 𝒉𝒆𝒓𝒆 𝒘𝒆 𝒄𝒂𝒏 𝒇𝒊𝒏𝒅 𝑷 𝜽𝒔 |𝑺𝒉
31
Learning Probabilities in a Bayesian
Network
•
First we need to make two assumptions:
1.
There is no missing data (i.e. the data accurately
describes the distribution)
1.
The parameter vectors are independent (generally a
good assumption, at least locally).
32
Learning Probabilities in a Bayesian
Network
If these assumptions hold we can express the probabilities as:
Prior 𝑃 𝜃𝑠 |𝑆 ℎ =
𝑛
𝑖=1
Posterior 𝑃 𝜃𝑠 |𝐷, 𝑆 ℎ =
𝑞𝑖
𝑗=1 𝑃
𝑛
𝑖=1
𝜃𝑖𝑗 |𝑆 ℎ
𝑞𝑖
𝑗=1 𝑃
𝜃𝑖𝑗 |𝐷, 𝑆 ℎ
This means we can update each vector of parameters θij
independently, just as one-variable case!
•
If each vector θij has the prior distribution Dir(θij |aij1,…,
aijri)…
•
…Then the posterior distribution is:
𝑃 𝜃𝑠 |𝐷, 𝑆 ℎ = 𝐷𝑖𝑟 𝜃𝑖𝑗 |𝛼𝑖𝑗1 + 𝑛𝑖𝑗1 , … , 𝛼𝑖𝑗𝑟𝑖 + 𝑛𝑖𝑗𝑟𝑖
Where nijk is the number of cases in D in which Xi=xik and Pai=paij
33
What if we are missing data?
• Distinguish if missing data is dependent on variable
states or independent of state.
• If independent of state:
1. Monte Carlo Methods -> Gibbs Sampling
2. Gaussian Approximations
34
Dealing with Unknowns
Now we know how to use our network to infer
conditional relationships and how to update our
network with new data. But what if we aren’t given a well
defined network? We could start with missing or
incomplete:
1. Set of Variables
2. Conditional Relationship Data
3. Network Structure
35
Unknown Variable Set
Our goal when choosing variables is to:
“Organize…into variables having mutually exclusive and
collectively exhaustive states.”
This is a problem shared by all data mining algorithms: What
should we measure and why? There is not and probably
cannot be an algorithmic solution to this problem as arriving
at any solution requires intelligent and creative thought.
36
Unknown Conditional Relationships
This can be easy.
So long as we can generate a plausible initial belief about
a conditional relationship we can simply start with our
assumption and let our data refine our model via the
mechanism shown in the Learning Probabilities in a
Bayesian Network slide.
37
Unknown Conditional Relationships
However, when our ignorance becomes serious
enough that we no longer even know what is
dependent on what we segue into the Unknown
Structure scenario.
38
Learning the Network Structure
Sometimes the conditional relationships are
not obvious. In this case we are uncertain with
the network structure: we don’t know where
the edges should be.
39
Learning the Network Structure
Theoretically, we can use a Bayesian approach to get the
posterior distribution of the network structure:
𝑃 𝑆 ℎ |𝐷 =
𝑃 𝐷|𝑆 ℎ 𝑃 𝑆 ℎ
𝑚
ℎ
ℎ𝑖 𝑃 𝑆ℎ𝑖
𝑖=1 𝑃 𝐷|𝑆 = 𝑆
Unfortunately, the number of possible network
structure increase exponentially with n – the number of
nodes. We’re basically asking ourselves to consider every
possible graph with n nodes!
40
Learning the Network Structure
Two main methods for shortening the search for a network
model:
•
Model Selection

•
To select a “good” model (i.e. the network structure) from all
possible models, and use it as if it were the correct model.
Selective Model Averaging

To select a manageable number of good models from among all
possible models and pretend that these models are exhaustive.
The math behind both techniques is quite involved so I’m afraid
we’ll have to content ourselves with a toy example today.
41
Two Coin Toss Example
S
h
1
X1
X2
p(H)=p(T)=0.5
•
•
•
•
S
h
2
X1
p(H)=p(T)=0.5
X2
P(X2|X1)
p(H|H)
p(T|H)
p(H|T)
p(T|T)
=
=
=
=
0.1
0.9
0.9
0.1
Experiment: flip two coins and observe the outcome
Propose two network structures: Sh1 or Sh2
Assume P(Sh1)=P(Sh2)=0.5
After observing some data, which model is more
accurate for this collection of data?
42
Two Coin Toss Example
X1
X2
𝑃 𝐷|𝑆 ℎ 𝑃 𝑆 ℎ
ℎ
𝑃 𝑆 |𝐷 =
1
T
T
2
T
H
3
H
T
4
H
T
2
𝑖=1 𝑃
5
T
H
10
6
H
T
7
T
H
8
T
H
9
H
T
10
H
T
𝑃 𝑆 ℎ |𝐷 =
𝑚
𝑖=1 𝑃
𝐷|𝑆ℎ = 𝑆𝑖ℎ 𝑃 𝑆𝑖ℎ
𝑃 𝐷|𝑆 ℎ 𝑃 𝑆 ℎ
𝐷|𝑆ℎ = 𝑆𝑖ℎ 𝑃 𝑆𝑖ℎ
2
𝑃 𝐷|𝑆 ℎ =
𝑃 𝑋𝑑𝑖 |𝑃𝛼𝑖 , 𝑆 ℎ
𝑑=1 𝑖=1
43
Two Coin Toss Example
X1
X2
1
T
T
2
T
H
3
H
T
4
H
T
5
T
H
6
H
T
7
T
H
8
T
H
9
H
T
10
H
T
𝐹𝑜𝑟 𝑆1ℎ :
10
2
𝑃 𝐷|𝑆 ℎ =
𝑃 𝑋𝑑𝑖 |𝑃𝛼𝑖 , 𝑆 ℎ
𝑑=1 𝑖=1
= 𝑃 𝑋1 = 𝑇 𝑃 𝑋2 = 𝑇
=
0.5 0.5
= 0.52
𝑃 𝑋1 = 𝑇 𝑃 𝑋2 = 𝐻 ….
0.5 0.5 ….
10
44
Two Coin Toss Example
10
X1 X2
1
T
T
2
T
H
3
H
T
4
H
T
5
T
H
6
H
T
7
T
H
8
T
H
9
H
T
10
H
T
𝐹𝑜𝑟 𝑆2ℎ :
2
𝑃 𝐷|𝑆 ℎ =
𝑃 𝑋𝑑𝑖 |𝑃𝛼𝑖 , 𝑆 ℎ
𝑑=1 𝑖=1
𝑃 𝑋1 = 𝑇 𝑃 𝑋2 = 𝑇|𝑋1 = 𝑇
= 0.5 0.1
𝑃 𝑋1 = 𝑇 𝑃 𝑋2 = 𝐻|𝑋1 = 𝑇
= 0.5 0.9
𝑃 𝑋1 = 𝐻 𝑃 𝑋2 = 𝑇|𝑋1 = 𝐻
= 0.5 0.9
.
.
.
10
2
𝑃 𝐷|𝑆 ℎ =
𝑃 𝑋𝑑𝑖 |𝑃𝛼𝑖 , 𝑆 ℎ = 0.5
10
0.1
1
0.9
9
𝑑=1 𝑖=1
45
Two Coin Toss Example
𝑃 𝑆1ℎ |𝐷 =
𝑃 𝑆1ℎ |𝐷 =
𝑃 𝐷|𝑆1ℎ 𝑃 𝑆1ℎ
𝑃 𝐷|𝑆1ℎ 𝑃 𝑆1ℎ + 𝑃 𝐷|𝑆2ℎ 𝑃 𝑆2ℎ
0.52
10
0.52 10 0.5
0.5 + 0.5 10 0.1
10
0.5
𝑃 𝑆1ℎ |𝐷 =
0.5 10 + 0.1 1 0.9
1
0.9
9
0.5
9
𝑃 𝑆1ℎ |𝐷 ≈ 2.5%
46
Two Coin Toss Example
𝑃 𝑆2ℎ |𝐷 =
𝑃 𝑆2ℎ |𝐷 =
𝑃 𝐷|𝑆2ℎ 𝑃 𝑆2ℎ
𝑃 𝐷|𝑆1ℎ 𝑃 𝑆1ℎ + 𝑃 𝐷|𝑆2ℎ 𝑃 𝑆2ℎ
0.52
10
0.5 10 0.1 1 0.9 9 0.5
0.5 + 0.5 10 0.1 1 0.9
1
9
0.1
0.9
𝑃 𝑆2ℎ |𝐷 =
0.5 10 + 0.1 1 0.9
9
0.5
9
𝑃 𝑆2ℎ |𝐷 ≈ 97.5%
47
Two Coin Toss Example
𝑃 𝑆2ℎ |𝐷 ≈ 97.5% > 2.5% ≈ 𝑃 𝑆1ℎ |𝐷
𝑇ℎ𝑒𝑟𝑒𝑓𝑜𝑟𝑒 𝑤𝑒 𝑠ℎ𝑜𝑢𝑙𝑑 𝑝𝑟𝑒𝑓𝑒𝑟 𝑆2ℎ 𝑏𝑦 𝑠𝑐𝑜𝑟𝑒.
48
Outline
• Bayesian Approach
• Bayes Therom
• Bayesian vs. classical probability methods
• coin toss – an example
• Bayesian Network
• Structure
• Inference
• Learning Probabilities
• Learning the Network Structure
• Two coin toss – an example
• Conclusions
• Exam Questions
49
Conclusions
• Bayesian method
• Bayesian network
• Structure
• Inference
• Learning with Bayesian Networks
• Dealing with Unknowns
50
Question1: What are Bayesian Networks?
 A graphical model the encodes probabilistic
relationship among variables of interest
51
Question 2: Compare the Bayesian and classical
approaches to probability (any one point).
Bayesian Approach:
Classical Probability:
 +Reflects an expert’s
 +Objective and unbiased
 - Need repeated trials
 Wants P( D | H )
knowledge
 +The belief is kept updating
when new data item arrives
 - Arbitrary (More subjective)
 Wants P( H | D )
52
Question 3: Mention at least 1 Advantage of
Bayesian Networks
 Handle incomplete data sets by encoding
dependencies
 Learning about causal relationships
 Combine domain knowledge and data
 Avoid over fitting
53
The End
 Any Questions?
54
Download