Representation for Undirected Graphical Models Le Song

advertisement
Representation for Undirected
Graphical Models
Le Song
Machine Learning II: Advanced Topics
CSE 8803ML, Spring 2012
Summary of Directed Graphical Models
Local Markov Assumptions:
𝑋 ⊥ π‘π‘œπ‘›π·π‘’π‘ π‘π‘’π‘›π‘‘π‘Žπ‘›π‘‘π‘‹ | π‘ƒπ‘Žπ‘‹
I−map 𝐼𝑙 𝐺 ⊆ 𝐼 𝑃 ⇔ P X1 , … , 𝑋𝑛 factorizes as 𝑖 𝑃(𝑋𝑖 | π‘ƒπ‘Žπ‘‹π‘– )
Topological ordering, chain rule, local Markov assumptions
Less edge = stronger assumption
Delete any edges of G, no longer I-map ⇒ 𝐺 minimal I-map for 𝑃
P-map: 𝐼𝑙 𝐺 = 𝐼 𝑃
D-separation: Active trail for longer range dependence
No active trail between 𝑋𝑖 and 𝑋𝑗 given 𝑍 ⇒ 𝑋_𝑖 ⊥ 𝑋_𝑗 | 𝑍
𝐴
𝑆
¬ (𝐴 ⊥ 𝐻)
𝐴 ⊥𝐻|𝑆
𝐻
𝑆
𝑁
𝐻
𝑁 ⊥𝐻|𝑆
¬(𝑁 ⊥ 𝐻)
𝐹
𝐴
𝑆
𝐴⊥𝐹
¬ (𝐴 ⊥ 𝐹 | 𝑆)
2
How about other derived relations?
? 𝐹 ⊥ {𝐡, 𝐸, 𝐺, 𝐽}
𝐹 ⊥ {𝐡, 𝐸, 𝐺, 𝐽}
? 𝐴, 𝐢, 𝐹, 𝐼 ⊥ 𝐡, 𝐸, 𝐺, 𝐽
𝐴, 𝐢, 𝐹, 𝐼 ⊥ {𝐡, 𝐸, 𝐺, 𝐽}
? 𝐡 ⊥𝐽|𝐸
𝐡 ⊥ 𝐽|𝐸
? 𝐸 ⊥𝐹|𝐾
¬ (𝐸 ⊥ 𝐹 | 𝐾)
? 𝐸 ⊥ 𝐹 | {𝐾, 𝐼}
𝐸 ⊥ 𝐹 | {𝐾, 𝐼}
? 𝐹 ⊥𝐺|𝐷
𝐴
𝐡
𝐢
𝐷
𝐸
𝐹
𝐻
𝐺
𝐽
𝐼
¬ (𝐹 ⊥ 𝐺 | 𝐷)
? 𝐹 ⊥𝐺|𝐻
¬ (𝐹 ⊥ 𝐺 | 𝐻)
𝐾
? 𝐹 ⊥ 𝐺 | {𝐻, 𝐾}
¬ (𝐹 ⊥ 𝐺 | {𝐻, 𝐾})
? 𝐹 ⊥ 𝐺 | {𝐻, 𝐴}
𝐹 ⊥ 𝐺 | {𝐻, 𝐴}
3
Generate Samples from Bayesian Networks
BN describe a generative process for observations
First, sort the nodes in topological order
1
2
𝐹𝑙𝑒
π΄π‘™π‘™π‘’π‘Ÿπ‘”π‘¦
Then, generate sample using this order according
to the CPTs
𝑆𝑖𝑛𝑒𝑠 3
Generate a set of sample for (A, F, S, N, H):
Sample π‘Žπ‘– ∼ 𝑃 𝐴
Sample 𝑓𝑖 ∼ 𝑃 𝐹
Sample 𝑠𝑖 ∼ 𝑃 𝑆 π‘Žπ‘– , 𝑓𝑖
Sample 𝑛𝑖 ∼ 𝑃 𝑁 𝑠𝑖
Sample β„Žπ‘– ∼ 𝑃 𝐻 𝑠𝑖
π»π‘’π‘Žπ‘‘π‘Žπ‘β„Žπ‘’
π‘π‘œπ‘ π‘’
4
5
4
Reduction in representation
𝐷
𝐷
Easy
One parent each
𝐢
𝑋1
𝑋2
…
𝑋𝑛
𝑃 𝑋1 , 𝑋2 , … , 𝑋𝑛 , 𝐢, 𝐷 =
𝑃 𝐷 𝑃 𝐢𝐷
𝑃(𝑋𝑖 | 𝐢)
Difficult
Multiple parents for C
𝐢
𝑋1
𝑋2
…
𝑋𝑛
𝑃 𝑋1 , 𝑋2 , … , 𝑋𝑛 , 𝐢, 𝐷 =
𝑃 𝑋1 ) … 𝑃(𝑋𝑛 𝑃 𝐢 𝑋1 , … , 𝑋𝑛 𝑃 𝐷 𝐢
𝑖=1…𝑛
Only need small
two-way tables
Still need (n+1)-way
table, 2𝑛 parameters
5
Additional notation
Observed variable: filled circle
Hidden variable: open circle
𝐻1
𝐻2
𝐻3
𝑋1
𝑋2
𝑋3
A plate: repeat the variable 𝑛 times with the same CPTs
𝐢
𝑋1
𝑋2
When all 𝑃 𝑋𝑖 𝐢
are the same
...
𝑋𝑛
𝐢
𝑋𝑖
𝑛
6
Nested plate notation
Eg. Latent Dirichlet Allocation (LDA)
…
𝛽
πœƒπ‘š
π‘Š2𝑛
…
…
…
𝛼
𝑍2𝑛
πœƒ2
𝑍22
π‘Š22
πœƒ1
𝑍21
π‘Š21
πœƒπ‘– : topic mixing proportion for document 𝑖
𝑍𝑖𝑗 : topic indicator variable for word 𝑗
𝑀𝑖𝑗 : word 𝑗 in document 𝑖
𝛼: prior for mixing proportion
𝛽: topic parameters
𝑃 𝛼, πœƒπ‘– , 𝑍𝑖𝑗 , π‘Šπ‘–π‘— , 𝛽 =
…
𝑃 𝛼
𝑃 πœƒπ‘–
𝑖
𝑃 𝑍𝑖𝑗 πœƒπ‘— 𝑃 π‘Šπ‘–π‘— 𝑍𝑖𝑗 , 𝛽
𝑗
𝛽
𝛼
πœƒπ‘–
𝑍𝑖𝑗
Edge with the same color
has the same parameter
π‘Šπ‘–π‘—
𝑛
π‘š
7
Inexistence of P-maps for Bayesian Networks
XOR: A = B XOR C
𝐴 ⊥ 𝐡, ¬ (𝐴 ⊥ 𝐡 | 𝐢)
𝐡 ⊥ 𝐢, ¬ (𝐴 ⊥ 𝐢 | 𝐡)
𝐢 ⊥ 𝐴, ¬ (𝐡 ⊥ 𝐢 | 𝐴)
𝐡
𝐴
Not P-map
Can not read
𝐡⊥𝐢
𝐢
𝐴⊥C
Minimal I-map
𝐴⊥𝐡
Swinging couples of variables 𝑋1 , π‘Œ1 and 𝑋2 , π‘Œ2
¬ (𝑋1 ⊥ π‘Œ1 )
¬ (𝑋2 ⊥ π‘Œ2 )
𝑋1 ⊥ 𝑋2 | π‘Œ1 π‘Œ2
π‘Œ1 ⊥ π‘Œ2 | 𝑋1 𝑋2
No BN 𝑃-map, need new representation!
π‘Œ2
𝑋1
𝑋2
π‘Œ1
8
Undirected Graphical Models (UGM)
Eg. Grid models (image processing, physics)
Each node: a pixel, or an atom
The values of adjacent variables are dependent due to pattern
continuity or electro-magnetic force, etc.
Most likely joint configuration corresponds to “low-energy” state
𝑃 𝑋1 , … , 𝑋𝑛 =
1
exp(
𝑍
2. Factorize distribution
πœƒπ‘–π‘— 𝑋𝑖 𝑋𝑗 +
πœƒπ‘– 𝑋𝑖 )
𝑖∈𝑉
𝑖,𝑗 ∈𝐸
3. Representation power
𝑃
1. Read conditional independence
𝐡𝑁
𝑀𝑁
9
Read conditional independence from UGM
Global Markov Independence 𝐴 ⊥ 𝐡 | 𝐢
Independence based on separation
𝐡
𝐢
𝐴
Local Markov Independence 𝑋 ⊥ π‘‡β„Žπ‘’π‘…π‘’π‘ π‘‘|𝐴𝐡𝐢𝐷
ABCD Markov blanket
𝐴
𝐡
𝑋
𝐢
𝐷
10
Global Markov Independence
𝑠𝑒𝑝𝐺 𝐴; 𝐡 𝐢 : C separates A and B if every path from a node in
A to a node in B passes through a node in C
A distribution satisfies the global Markov property if for any
disjoint A, B, C, such that 𝑠𝑒𝑝𝐺 𝐴; 𝐡 𝐢 , then A is independent
of C given B:
𝐼(𝐺) = {𝐴 ⊥ 𝐡 | 𝐢: 𝑠𝑒𝑝𝐺 (𝐴; 𝐡|𝐢)}
𝐴
𝐢
𝐡
11
Soundness of separation criterion
The independence in 𝐼 𝐺 are precisely those that are
guaranteed to hold for every MN distribution 𝑃 over 𝐺
In other words, the separation criterion is sound for detecting
independence properties in MN distributions over 𝐺
In a sense, reading conditional independence from MN is
simpler than that from BN
𝐡
𝐴
𝐢
𝐢 separate 𝐴 and 𝐡
But ¬ (𝐴 ⊥ 𝐡 | 𝐢)
𝐢
𝐴
𝐡𝑁
𝐡
𝐢 separate 𝐴 and 𝐡
(𝐴 ⊥ 𝐡 | 𝐢)
𝑀𝑁
𝐢
𝐴
𝐡
𝐢 separate 𝐴 and 𝐡
(𝐴 ⊥ 𝐡 | 𝐢)
12
Local Markov Independence
For each node 𝑋𝑖 ∈ 𝑉, there is a unique Markov blanket of 𝑋𝑖 ,
denoted by 𝑀𝐡𝑋𝑖 , which is the set of immediate neighbors of
𝑋𝑖 in the graph
The local Markov independence associated with G is:
𝐼𝑙 𝐺 = {𝑋𝑖 ⊥ 𝑉 – 𝑋𝑖 – 𝑀𝐡𝑋𝑖 | 𝑀𝐡𝑋𝑖 : ∀𝑖}
In other words, 𝑋𝑖 is independence of the rest given its
immediate neighbors
𝐴
𝐡
𝑋
𝐢
𝐷
13
Pairwise Markov independence
Given a graph G=(V,E), pairwise Markov independence
associated with G are
𝐼𝑝 (𝐺) = {𝑋 ⊥ π‘Œ | 𝑉 βˆ– {𝑋, π‘Œ}: {𝑋, π‘Œ} ∉ 𝐸}
E.g., 𝑋1 ⊥ 𝑋5 | 𝑋2 , 𝑋3 , 𝑋4
𝑋1
𝑋2
𝑋3
𝑋4
𝑋5
BN: need to use active trail to judge it
E.g., ¬ (𝑋1 ⊥ 𝑋3 | 𝑋2 )
𝑋1
𝑋2
𝑋3
14
Markov Blanket example
Note: the local Markov independence in MN and BN can be
quite different!
𝑀𝑁
A
A
B
D
𝑋 ⊥ 𝑉 – 𝑋 – 𝐴𝐡𝐢𝐷 | 𝐴𝐡𝐢𝐷
B
𝑋
E
𝑋
C
𝐡𝑁
C
D
¬ (𝑋 ⊥ 𝑉 – 𝑋 – 𝐴𝐡𝐢𝐷 | 𝐴𝐡𝐢𝐷) !
¬ (𝑋 ⊥ 𝐸 | 𝐴𝐡𝐢𝐷)
15
Read conditional independence from UGM
Global Markov Independence 𝐴 ⊥ 𝐡 | 𝐢
Independence based on separation
𝐡
𝐢
𝐴
Edges give dependence
between variables, but
no causal relations and
generate sample is
more complicated
Local Markov Independence 𝑋 ⊥ π‘‡β„Žπ‘’π‘…π‘’π‘ π‘‘|𝐴𝐡𝐢𝐷
ABCD Markov blanket
𝐴
𝐡
How to factorize the
distribution?
𝑋
𝐢
𝐷
16
Maximal Cliques
For 𝐺 = 𝑉, 𝐸 , a complete subgraph (clique) is a subgraph
𝐺’ = 𝑉’ ⊆ 𝑉, 𝐸’ ⊆ 𝐸 s.t. nodes in 𝑉’ are fully connected
A (maximal) clique is a complete subgraph s.t. any superset 𝑉 ′′ ,
𝑉’ ⊂ 𝑉’’, is not fully connected
𝐢
𝐡
𝐴
𝐷
Example:
Maximal cliques = {A,B,C}, {A,B,D}
Sub-cliques = {A}, {B}, {A,B}, {C,D} … (all edges and singletons)
17
Distribution Factorization in Markov Networks
Given an undirected graph G over variables 𝒳 = 𝑋1 , … , 𝑋𝑛
A distribution 𝑃 factorizes over 𝐺 if there exist
subset of variables 𝐷1 ⊆ 𝒳, …, π·π‘š ⊆ 𝒳 (𝐷𝑖 are maximal cliques
in 𝐺)
non-negative potentials (factors/functions) Ψ1 𝐷1 ,…, Ψπ‘š π·π‘š
such that
𝑃 𝑋1 , 𝑋2 , … , 𝑋𝑛
1
=
𝑍
where
π‘š
Ψ𝑖 𝐷𝑖
𝑖=1
π‘š
𝑍 =
π‘₯1 ,π‘₯2 ,…,π‘₯𝑛
Ψ𝑖 𝐷𝑖
𝑖=1
Also know as Gibbs distributions, Markov random Fields, and
undirected graphical models
18
Interpretation of Potential Functions
𝑋
π‘Œ
𝑍
𝑃 𝑋, π‘Œ, 𝑍 =
1
Ψ π‘‹, π‘Œ Ψ(π‘Œ, 𝑍)
𝑍
The undirected graph implies 𝑋 ⊥ 𝑍 | π‘Œ. This independence
statement implies that the joint must factorizes as
𝑃 𝑋, π‘Œ, 𝑍 = 𝑃 π‘Œ 𝑃 𝑋, 𝑍 π‘Œ = 𝑃 π‘Œ 𝑃 𝑋 π‘Œ 𝑃 𝑍 π‘Œ
We can write as 𝑃 𝑋, π‘Œ, 𝑍 = 𝑃 𝑋, π‘Œ 𝑃 𝑍 π‘Œ or 𝑃 𝑋 π‘Œ 𝑃 𝑍, π‘Œ , but
cannot have all potentials be marginals
cannot have all potentials be conditionals
The clique potentials can be thought of as general
“compatibility”, “goodness” or “happiness” functions over
their variables, but not distributions/conditionals
19
Another example
𝑃 𝐴, 𝐡, 𝐢, 𝐷 =
𝑍 =
1
𝑍
π‘Ž,𝑏,𝑐,𝑑 Ψ1
Ψ1 𝐴, 𝐡, 𝐢 Ψ2 𝐴, 𝐡, 𝐷
𝐢
𝐴, 𝐡, 𝐢 Ψ2 𝐴, 𝐡, 𝐷
𝐡
𝐴
For discrete nodes, we can represent
𝑃(𝐴, 𝐡, 𝐢, 𝐷) as two 3-way tables instead of one
4-way tables
𝐷
𝑀𝑁
𝐡𝑁
BN: 𝑃 𝐴, 𝐡, 𝐢, 𝐷 = 𝑃 𝐢 𝑃 𝐴 𝐢 𝑃 𝐡 𝐴, 𝐢 𝑃 𝐷 𝐴, 𝐡
Two 3-way tables + one 2-way table + one
vector
𝐢
𝐡
𝐴
𝐷
Each table has meaning as conditional
distribution
20
Conditional Independence in Problem
World, Data, Reality:
BN
MN
True distribution 𝑃 Contains
conditional independence assertions
𝐼(𝑃)
local Markov
assumptions
𝐼𝑙 𝐺 ⊆ 𝐼 𝑃
global Markov
assumptions
I 𝐺 ⊆𝐼 𝑃
21
I-map: Bayesian Networks
BN encodes local Markov assumptions 𝐼𝑙
𝐺
If local conditional This BN is an I-map of P
independence in BN are
𝑃 factorizes according to BN
subset of
Then the joint probability
conditional independence in
𝑃 can be written as
obtain
𝑃
𝑃(𝑋1 , … , 𝑋𝑛 ) =
𝑃(𝑋𝑖 | π‘ƒπ‘Žπ‘‹π‘– )
𝐼𝑙 𝐺 ⊆ 𝐼 𝑃
𝑖
Every P has least one BN structure G
If the joint probability 𝑃 can
be written as
𝑃(𝑋1 , … , 𝑋𝑛 ) =
𝑃(𝑋𝑖 | π‘ƒπ‘Žπ‘‹π‘– )
obtain
𝑖
Read independence of P from BN structure G
Then local conditional
independence in BN are
subset of
conditional independence in
𝑃
𝐼𝑙 𝐺 ⊆ 𝐼 𝑃
22
I-map: Markov Networks
MN encodes global Markov assumptions 𝐼𝑙
𝐺
If global conditional This MN is an I-map of P
independence in MN are
𝑃 factorizes according to BN
subset of
Then the joint probability
conditional independence in
𝑃 can be written as
obtain
𝑃
1
𝑃(𝑋1 , … , 𝑋𝑛 ) =
Ψ π·π‘–
𝑍
𝐼𝑙 𝐺 ⊆ 𝐼 𝑃
𝑖
Every P has least one MN structure G
If the joint probability 𝑃 can
be written as
1
𝑃(𝑋1 , … , 𝑋𝑛 ) =
𝑍
Ψ π·π‘–
obtain
𝑖
Read independence of P from MN structure G
Then global conditional
independence in MN are
subset of
conditional independence in
𝑃
𝐼𝑙 𝐺 ⊆ 𝐼 𝑃
23
Counter Example
𝑋1 , … , 𝑋4 are binary, and only eight assignments have positive
probability (each with 1/8)
𝑋1
(0,0,0,0), (1,0,0,0), (1,1,0,0), (1,1,1,0),
(0,0,0,1), (0,0,1,1), (0,1,1,1), (1,1,1,1)
𝑋4
𝑋2
𝑋3
Eg., 𝑋1 ⊥ 𝑋3 | 𝑋2 , 𝑋4
π‘‹πŸ π‘‹πŸ’
00
01
10
11
π‘‹πŸ π‘‹πŸ‘
π‘‹πŸ π‘‹πŸ’
00
01
10
11
π‘‹πŸ
π‘‹πŸ π‘‹πŸ’
00
01
10
11
π‘‹πŸ‘
00
½
½
0
0
0
½
1
0
½
0
1
½
½
0
01
0
½
0
½
1
½
0
1
½
1
0
½
½
1
10
½
0
½
0
11
0
0
½
½
𝑃 𝑋1 , 𝑋3 𝑋2 , 𝑋4 = 𝑃 𝑋1 𝑋2 , 𝑋4 𝑃 𝑋3 𝑋2 , 𝑋4
But distribution does not factorize!
eg. 𝑃 0,0,1,0 = 0 =
1
𝑍
Ψ12 0, 0 Ψ23 0,1 Ψ34 1,0 Ψ14 (0,0)
24
Markov Network Representation
If global conditional This MN is an I-map of P
independence in MN are
𝑃 factorizes according to BN
subset of
Then the joint probability
conditional independence in
𝑃 can be written as
obtain
a strictly postive 𝑃
1
𝑃(𝑋1 , … , 𝑋𝑛 ) =
Ψ π·π‘–
𝑍
𝐼𝑙 𝐺 ⊆ 𝐼 𝑃
𝑖
Every P has least one MN structure G
For all π‘₯, 𝑃 𝑋 = π‘₯ > 0
Known as Hammersley-Clifford Theorem
25
Minimal I-maps and Markov networks
A fully connected graph is an I-map for all distribution
Remember minimal I-maps
Deleting an edge make it no longer I-map
In a Bayesian Network, there is no unique minimal I-map
For strictly positive distributions and Markov network, minimal
I-map is unique!
Many ways to find minimal I-map, eg.,
Take pairwise Markov assumption
If 𝑃 does not entail it, add edge
26
How about a perfect map?
Perfect maps?
Independence in the graph are exactly the same as those in 𝑃
For Bayesian networks, does not always exist
Counter example: swinging couple of variables
𝑃
How about for Markov networks?
Counter example: V-structure
𝐴
𝐹
𝑆
𝐴⊥𝐹
¬ (𝐴 ⊥ 𝐹 | 𝑆)
𝑀𝑁
𝐡𝑁
𝐹
𝐴
𝑆
Minimal I-map MN
Not a P-map
Can not code 𝐴 ⊥ 𝐹
27
Pairwise Markov Networks
All factors over single variables or pairs of variables
Node potentials Ψ𝑖
𝑋𝑖
Edge potentials Ψ𝑖𝑗 𝑋𝑖 , 𝑋𝑗
Factorization
𝑃 𝑋 =
𝑖∈𝑉 Ψ𝑖
𝑋𝑖
(𝑖,𝑗)∈𝐸 Ψ𝑖𝑗
𝑋𝑖 , 𝑋𝑗
Note that there may be bigger cliques in the graph, but only
consider pairwise potentials
𝐢
𝐡
𝐴
𝐷
28
An example
𝑃 𝐴, 𝐡, 𝐢, 𝐷 =
𝑍 =
1
𝑍
π‘Ž,𝑏,𝑐,𝑑 Ψ1
𝐢
Ψ1 𝐴, 𝐡, 𝐢 Ψ2 𝐴, 𝐡, 𝐷
𝐴, 𝐡, 𝐢 Ψ2 𝐴, 𝐡, 𝐷
𝐡
𝐴
𝐷
Use two 3-way tables
Maximal clique specification
1
𝑍
𝑃′ 𝐴, 𝐡, 𝐢, 𝐷 = Ψ π΄πΆ Ψ π΅πΆ Ψ π΄π΅ Ψ π΄π· Ψ(𝐡𝐷)
Pairwise Markov Networks
𝐢
Use five 2-way tables
𝐡
𝐴
What is the relation between 𝐼 𝑃 and 𝐼 𝑃’ ?
𝐹𝑒𝑙𝑙𝑦 π‘π‘œπ‘›π‘›π‘’π‘π‘‘π‘’π‘‘ ⊆ 𝐼 𝑃 ⊆ 𝐼 𝑃’ ⊆ π‘‘π‘–π‘ π‘π‘œπ‘›π‘›π‘’π‘‘π‘’π‘‘
𝐷
29
Applications of Pairwise Markov Networks
Image segmentation: separate foreground
from background
bg
Graph structure: Grid with one node per pixel
Pairwise Markov networks
fg
Node potential
Background color vs. foreground color
Ψ π‘‹π‘– = 𝑓𝑔, π‘šπ‘– = exp −||π‘šπ‘– − πœ‡π‘“π‘” ||2
Ψ π‘‹π‘– = 𝑏𝑔, π‘šπ‘– = exp −||π‘šπ‘– − πœ‡π‘π‘” ||2
Edge potential
Neighbors likely have the same label
𝑋𝑖 𝑋𝑗 Fg
Ψ(𝑋𝑖 , 𝑋𝑗 ) = Fg
Bg
Bg
10
1
1
10
𝑃 𝑋1 , … , 𝑋𝑛
1
=
𝑍
Ψ π‘‹π‘–
𝑖
Ψ π‘‹π‘– , 𝑋𝑗
𝑖𝑗
30
Exponential Form
Standard model:
1
𝑃(𝑋1 , … , 𝑋𝑛 ) =
𝑍
Ψ π·π‘–
𝑖
Assuming strictly positive potentials:
𝑃(𝑋1 , … , 𝑋𝑛 ) =
1
𝑍
𝑖Ψ
𝐷𝑖
1
exp log Ψ π·π‘–
𝑍 𝑖
1
= exp 𝑖 log Ψ π·π‘–
𝑍
1
= exp( 𝑖 Φ(𝐷𝑖 ) )
𝑍
=
We can maintain table Φ π·π‘– (can have negative entries)
rather than table Ψ π·π‘– (strictly positive entries)
31
Exponential Form—Log-linear Models
Features are some functions 𝑓(𝐷) for a subset of variables 𝐷
Log-linear model over a Markov network G:
A set of features 𝑓1 𝐷1 , … , π‘“π‘˜ π·π‘˜
Each 𝐷𝑖 is over a subset of a clique in G
Eg. Pairwise model 𝐷𝑖 = 𝑋𝑖 , 𝑋𝑗
Two f’s can be over the same variables
It’s ok for 𝐷𝑖 = 𝐷𝑗
A set of weights 𝑀1 , … , π‘€π‘˜
Usually learned from data
𝑃 𝑋1 , … , 𝑋𝑛 =
1
exp(
𝑍
π‘˜
𝑖=1 𝑀𝑖 𝑓𝑖 (𝐷𝑖 ) )
32
Factor Graph
𝐢
Maximal clique specification
𝑃 𝐴, 𝐡, 𝐢, 𝐷 =
1
𝑍
Ψ1 𝐴, 𝐡, 𝐢 Ψ2 𝐴, 𝐡, 𝐷
𝐡
𝐴
Pairwise Markov Networks
𝐷
1
𝑍
𝑃′ 𝐴, 𝐡, 𝐢, 𝐷 = Ψ π΄πΆ Ψ π΅πΆ Ψ π΄π΅ Ψ π΄π· Ψ(𝐡𝐷)
Can not look at the graph and tell what
potential is using
Factor graph is to make this clear in graphical
form
33
Factor Graph
Make factor dependency explicit
𝐢
Useful for later inference
𝐡
𝐴
Bipartite graph:
𝐷
Variable nodes (circle) for 𝑋1 , … , 𝑋𝑛
Factor nodes (square) for Ψ1 , … , Ψπ‘š
Edge 𝑋𝑖 − Ψ𝑗 if 𝑋𝑖 ∈ 𝐷𝑗 (Scope of Ψ𝑗 𝐷𝑗 )
𝐴
𝐡
Ψ1
𝐢
Ψ2
1
Ψ π΄, 𝐡, 𝐢 Ψ2 𝐴, 𝐡, 𝐷
𝑍 1
𝐴
𝐷
Ψ1
𝐢
𝐡
Ψ2
Ψ3
𝐷
Ψ4
Ψ5
1
Ψ π΄π΅ Ψ2 𝐴𝐢 Ψ3 𝐡𝐢 Ψ4 𝐴𝐷 Ψ5 (𝐡𝐷)
𝑍 1
34
Conditional random Fields
Focus on conditional distribution
𝑃(π‘Œ1 , … , π‘Œπ‘› |𝑋1 , … , 𝑋𝑛 , 𝑋)
Do not explicitly model dependence between
𝑋1 , … , 𝑋𝑛 , 𝑋
𝑋
Only model relation between 𝑋 − π‘Œ and π‘Œ − π‘Œ
𝑃(π‘Œ1 , π‘Œ2 , π‘Œ3 , π‘Œ4 |𝑋1 , 𝑋2 , 𝑋3 , 𝑋4 , 𝑋)
=𝑍
1
𝑋1 ,𝑋2 ,𝑋3 ,𝑋4 ,𝑋
Ψ π‘Œ1 , π‘Œ2 , 𝑋1 , 𝑋2 , 𝑋 Ψ(π‘Œ2 , π‘Œ3 , 𝑋2 , 𝑋3 , 𝑋)
π‘Œ1
π‘Œ2
π‘Œ3
𝑋1
𝑋2
𝑋3
𝑍 𝑋1 , 𝑋2 , 𝑋3 , 𝑋4 , 𝑋 =
𝑦1 𝑦2 𝑦3 Ψ π‘Œ1 , π‘Œ2 , 𝑋1 , 𝑋2 , 𝑋 Ψ(π‘Œ2 , π‘Œ3 , 𝑋2 , 𝑋3 , 𝑋)
35
Download