On Expressiveness of the AMP Chain Graph Interpretation Dag Sonntag

advertisement
On Expressiveness of the AMP Chain Graph
Interpretation
Dag Sonntag
ADIT, IDA, Linköping University, Sweden
dag.sonntag@liu.se
Abstract. In this paper we study the expressiveness of the AnderssonMadigan-Perlman interpretation of chain graphs. It is well known that all
independence models that can be represented by Bayesian networks also
can be perfectly represented by chain graphs of the Andersson-MadiganPerlman interpretation but it has so far not been studied how much more
expressive this second class of models is. In this paper we calculate the
exact number of representable independence models for the two classes,
and the ratio between them, for up to five nodes. For more than five
nodes the explosive growth of chain graph models does however make
such enumeration infeasible. Hence we instead present, and prove the
correctness of, a Markov chain Monte Carlo approach for sampling chain
graph models uniformly for the Andersson-Madigan-Perlman interpretation. This allows us to approximate the ratio between the numbers
of independence models representable by the two classes as well as the
average number of chain graphs per chain graph model for up to 20
nodes. The results show that the ratio between the numbers of representable independence models for the two classes grows exponentially
as the number of nodes increases. This indicates that only a very small
fraction of all independence models representable by chain graphs of the
Andersson-Madigan-Perlman interpretation also can be represented by
Bayesian networks.
Keywords: Chain graphs, Andersson-Madigan-Perlman interpretation,
MCMC sampling.
1
Introduction
Chain graphs (CGs) is a class of probabilistic graphical models (PGMs) that
can contain two types of edges, representing symmetric resp. asymmetric relationships between the variables in question. This allows CGs to represent more
independence models than for example Bayesian networks (BNs) and thereby
represent systems more accurately than this less expressive class of models. Today there do however exist several different ways of interpreting CGs and what
conditional independences they encode, giving rise to different so called CG interpretations. The most researched CG interpretations are the Lauritzen-WermuthFrydenberg (LWF) interpretation [6, 9], the Andersson-Madigan-Perlman (AMP)
interpretation [1] and the multivariate regression (MVR) interpretation [3, 4].
Each interpretation has its own way of determining conditional independences
in a CG and it can be noted that no interpretation subsumes another in terms
of representable independence models [5, 15].
Although it is well known that any independence model that can be represented by BNs also can be represented by CGs, and that the opposite does
not hold, it is unclear how much more expressive the latter class of models is.
This question has previously been studied for CGs of the LWF interpretation
(LWF CGs) and MVR interpretation (MVR CGs) [11, 16]. The results from
these studies show that the ratio of the number of BN models (independence
models representable by BNs) compared to the number of CG models (independence models representable by CGs) decreases exponentially as the number of
nodes in the graphs increases. In this article we carry out a similar study for the
AMP CG interpretation and investigate if the previous results also hold for this
interpretation of CGs.
Measuring the ratio of representable independence models should in principle be easy to do. We only have to enumerate every independence model representable by CGs of the AMP interpretation (AMP CGs) and check if they
also can be perfectly represented by BNs. The problem here is that the number
of AMP CG models grows superexponentially as the number of nodes in the
graphs increases and it is infeasible to enumerate all of them for more than five
nodes. Hence we instead use a Markov chain Monte Carlo (MCMC) approach
that previously has been shown to be successful when studying BN models, LWF
CG models and MVR CG models [11, 16]. The approach consists of creating a
Markov chain whose states are all possible AMP CG models with a certain number of nodes and whose stationary distribution is the uniform distribution over
these models. This does then allow us to sample the AMP CG models from approximately the uniform distribution over all AMP CG models by transitioning
through the Markov chain. With such a set of models we can thereafter approximate the ratio of BN models to AMP CG models for up to 20 nodes. Moreover,
since there exists an equation for calculating the exact number of AMP CGs for
a set number of nodes [17], we can also estimate the average number of AMP
CGs per AMP CG model. Finally we also make the AMP CG models publicly
available online and it is our intent that they can be used for further studies of
AMP CG models and to evaluate for example learning algorithms to get more
accurate evaluations than what is achieved with the randomly generated CGs
used today [10, 12].
The rest of the article is organised as follows. In the next section we will
cover the definitions and notations used in this article. This is then followed by
Section 3 where we discuss the theory of the MCMC sampling algorithm used
and also prove that its unique stationary distribution is the uniform distribution
of all AMP CG models. In Section 4 we then discuss the results found and in
Section 5 we give a short conclusion.
2
Notation
All graphs are defined over a finite set of variables V . If a graph G contains an
edge between two nodes V1 and V2 , we denote with V1 →V2 a directed edge and
with V1 −V2 an undirected edge. A set of nodes is said to be complete if there
exist edges between all pairs of nodes in the set.
The parents of a set of nodes X of G is the set paG (X) = {V1 ∣V1 →V2 is in G,
V1 ∉ X and V2 ∈ X}. The children of X is the set chG (X) = {V1 ∣V2 →V1 is in G,
V1 ∉ X and V2 ∈ X}. The neighbours of X is the set nbG (X) = {V1 ∣V1 −V2 is in G,
V1 ∉ X and V2 ∈ X}. The boundary of X is the set bdG (X) = paG (X) ∪ nbG (X).
The adjacents of X is the set adG (X) = {V1 ∣V1 →V2 ,V1 ←V2 or V1 −V2 is in G,
V1 ∉ X and V2 ∈ X}.
A route from a node V1 to a node Vn in G is a sequence of nodes V1 , . . . , Vn
such that Vi ∈ adG (Vi+1 ) for all 1 ≤ i < n. A path is a route containing only
distinct nodes. The length of a path is the number of edges in the path. A path
is called a cycle if Vn = V1 . A path is descending if Vi ∈ paG (Vi+1 ) ∪ nbG (Vi+1 ) for
all 1 ≤ i < n. The descendants of a set of nodes X of G is the set deG (X) = {Vn ∣
there is a descending path from V1 to Vn in G, V1 ∈ X and Vn ∉ X}. A path
is strictly descending if Vi ∈ paG (Vi+1 ) for all 1 ≤ i < n. The strict descendants
of a set of nodes X of G is the set sdeG (X) = {Vn ∣ there is a strict descending
path from V1 to Vn in G, V1 ∈ X and Vn ∉ X}. The ancestors (resp. strict
ancestors) of X is the set anG (X) = {V1 ∣Vn ∈ deG (V1 ), V1 ∉ X, Vn ∈ X} (resp.
sanG (X) = {V1 ∣Vn ∈ sdeG (V1 ), V1 ∉ X, Vn ∈ X}). A cycle is called a semi-directed
cycle if it is descending and Vi →Vi+1 is in G for some 1 ≤ i < n.
A Bayesian network (BN) is a graph containing only directed edges and no
semi-directed cycles while a Markov network (MN) is graph containing only
undirected edges. A CG under the Andersson-Madigan-Perlman (AMP) interpretation, denoted AMP CG, contains only directed and undirected edges but
no semi-directed cycles. Note that we in this article only will study the AMP
CG interpretation, but that we will use the term CG when the results or notion
generalizes to all CG interpretations. A connectivity component C of an AMP
CG is a maximal (wrt set inclusion) set of nodes such that there exists a path
between every pair of nodes in C containing only undirected edges. A subgraph
of G is a subset of nodes and edges in G. A subgraph of G induced by a set of
its nodes X is the graph over X that has all and only the edges in G whose both
ends are in X.
To illustrate these concepts we can study the AMP CG G with five nodes
shown in Fig. 1a. In the graph we can for example see that the only child of
A is B and that D is a neighbour of E. D is also a strict descendant of A due
to the strictly descending path A→B→D, while E is not. E is however in the
descendants of A together with B and D. A is therefore an ancestor of all nodes
except itself and C. We can also see that G contains no semi-directed cycles since
it contains no cycle at all. Moreover we can see that G contains four connectivity
components: {A}, {B}, {C} and {D, E}. In Fig. 1b we can see a subgraph of G
with the nodes B, D and E. This is also the induced subgraph of G with these
nodes since it contains all edges between the nodes B, D and E in G.
A
A
B
C
B
D
E
D
(a) An AMP CG G
E
(b) The subgraph of G induced by
{B, D, E}
B
C
D
E
(c) A LDG H
Fig. 1: Three different AMP CGs
Let X, Y and Z denote three disjoint subsets of V . We say that X is
conditionally independent from Y given Z if the value of X does not influence the value of Y when the values of the variables in Z are known, i.e.
pr(X, Y ∣Z) = pr(X∣Z)pr(Y ∣Z) holds and pr(Z) > 0. We say that X is separated from Y given Z, denoted X⊥G Y ∣Z, in an AMP CG, BN or MN G iff there
exists no S-open path between X and Y . A path is said to be S-open iff every
non-head-no-tail node on the path is not in Z and every head-no-tail node on
the path is in Z or sanG (Z). A node B is said to be a head-no-tail in an AMP
CG, BN or MN G between two nodes A and C on a path if one of the following
configurations exists in G: A→B←C, A→B−C or A−B←C. Moreover G is also
said to contain a triplex ({A, C}, B) iff one such configuration exists in G and A
and C are not adjacent in G. A triplex ({A, C}, B) is said to be a flag (resp. a
collider ) in an AMP CG or BN G iff G contains one following subgraphs induced
by A, B and C: A→B−C or A−B←C (resp. A→B←C). If an AMP CG G is said
to contain a biflag we mean that G contains the induced subgraph A→B−C←D
where A and D might be adjacent, for four nodes A, B, C and D. Given a graph
G we mean with G ∪ {A→B} the graph H with the same structure as G but
where H also contains the directed edge A→B in addition to any other edges in
G. Similarly we mean with G ∖ {A→B} the graph H with the same structure as
G but where H does not contain the directed edge A→B. Note that if G did not
contain the directed edge A→B then H is not a valid graph.
To illustrate these concepts we can once again look at the graph G shown
in Fig. 1a. G contains no colliders but two flags, B→D−E and C→E−D, that
together form the biflag B→D−E←C. This means that B⊥ G E holds but that
B⊥G E∣D does not hold since D is a head-no-tail node on the path B→D−E.
The independence model M induced by a graph G, denoted as I(G), is the
set of separation statements X⊥G Y ∣Z that hold in G. We say that two graphs G
and H are Markov equivalent or that they are in the same Markov equivalence
class iff I(G) = I(H). Moreover we say that G and H belong to the same strong
Markov equivalent class iff I(G) = I(H) and G and H contain the same flags.
By saying that an independence model is perfectly represented in a graph G we
mean that I(G) contains all and only the independences in the independence
model.
An AMP CG model (resp. BN model or MN model) is an independence model
representable by an AMP CG (resp. BN or MN). AMP CG models do today have
two possible unique graphical representations, largest deflagged graphs (LDGs)
[14] and AMP essential graphs [2]. In this article we will only use LDGs. A
LDG is the AMP CG of a Markov equivalence class that has the minimum
number number of flags while at the same time contains the maximum number
of undirected edges for that strong Markov equivalence class.
For AMP CGs there exists a set of operations that allows for changing the
structure of edges within an AMP CG without altering the Markov equivalence
class it belongs to. In this article we use the feasible split operation [15] and
the legal merging [14] operation. A split is said to be feasible for a connectivity
component C of an AMP CG G iff it can be divided into two disjoint sets U
and L such that U ∪ L = C and if replacing all undirected edges between U and
L with directed edges orientated towards L results in an AMP CG H such that
I(G) = I(H). It has been shown that if the following conditions hold in G then
such a split is possible: [15] (1) ∀A ∈ neG (L) ∩ U, L ⊆ neG (A), (2) neG (L) ∩ U is
complete and (3) ∀B ∈ L, paG (neG (L) ∩ U ) ⊆ paG (B).
A merging is on the other hand said to be legal for two connectivity components U and L in an AMP CG G, such that U ∈ paG (L), iff replacing all directed
edges between U and L with undirected edges results in an AMP CG H such that
G and H belongs to the same strong Markov equivalence class. It has been shown
that if the following conditions hold in G then such a merging is possible [14] (1)
paG (L) ∩ U is complete in G, (2) ∀B ∈ paG (L) ∩ U, paG (L) ∖ U = paG (B) and (3)
∀A ∈ L, paG (L) = paG (A). Note that a legal merging is not the reverse operator
of a feasible split since a feasible split handles Markov equivalence classes, while
a legal merging handles strong Markov equivalence classes.
If we once again look at the CGs in Fig. 1 we can see that G, shown in
Fig. 1a, and H, shown in Fig. 1c, are Markov equivalent since I(G) = I(H).
Moreover we can note that G and H must belong to the same strong Markov
equivalence class since they contain the same flags. This means that G cannot
be a LDG since H is larger than G. In G it exists no feasible split, but one
legal merging which is the merging of the connectivity components {A} and
{B} that results in a CG with the same structure as H. In H it does on the
other hand exist no legal merging but one feasible split which is the split of the
connectivity component {A, B} into either A→B or B→A. We can finally note
that this feasible split would result in no additional flags and hence, since no
legal mergings are possible in H, that H must be a LDG.
3
Markov Chain Monte Carlo Approach
In this section we cover the theory behind the MCMC approach used in this
article for sampling LDGs from the uniform distribution. Note that we only
cover the theory of the MCMC sampling method very briefly and for a more
complete introduction of the sampling method we instead refer the reader to the
work by Häggström [7].
The MCMC sampling approach consists of creating a Markov chain whose
unique stationary distribution is the desired distribution and then sample this
Markov chain after a number of transitions. The Markov chain is defined by a
set of operators that allows us to transition from one state to another. It can
then be shown that if these operators have certain properties and the number
of transitions goes to infinity then all states have the same probability of being
sampled [7]. Moreover, in practice it has also been shown that the number of
transitions can be relatively small and a good approximation of the uniform
distribution can still be achieved.
In our case the possible states of the Markov chain are all possible independence models representable by AMP CGs, i.e. AMP CG models, represented
by LDGs. The operators then add and remove certain edges in in these LDGs,
allowing the MCMC method to transition between all possible LDGs. For the
stationary distribution to be the unique uniform distribution of all possible LDGs
we only have to prove that the operators fulfill the following properties [7]: aperiodicity, irreducibility and reversibility. Aperiodicity, i.e. that the Markov chain
does not end up in the same state periodicly, and irreducibility, i.e. that any LDG
can be reached from any other LDG using only the defined operators, proves that
the Markov chain has a unique distribution. Reversibility then proves that this
distribution also is the stationary distribution.
The operators used are defined in Definition 1 and it follows from Lemma 1,
2 and 3 that they fulfill the properties described above for LDGs with at least
two nodes.
Definition 1. Markov Chain Operators
Choose uniformly and perform one of the following six operators to transition
from a LDG G to the next LDG H in the Markov chain.
– Add directed edge. Choose two nodes X, Y in G uniformly and with replacement. If X ∉ adG (Y ) and G ∪ {X→Y } is a LDG let H = G ∪ {X→Y },
otherwise let H = G.
– Remove directed edge. Choose two nodes X, Y in G uniformly and with
replacement. If X→Y is in G and G∖{X→Y } is a LDG let H = G∖{X→Y },
otherwise let H = G.
– Add undirected edge. Choose two nodes X, Y in G uniformly and with
replacement. If X ∉ adG (Y ) and G ∪ {X−Y } is a LDG let H = G ∪ {X−Y },
otherwise let H = G.
– Remove undirected edge. Choose two nodes X, Y in G uniformly and with
replacement. If X−Y is in G and G ∖ {X−Y } is a LDG let H = G ∖ {X−Y },
otherwise let H = G.
– Add two directed edges. Choose four nodes X, Y, Z, W in G uniformly and
with replacement. If X ∉ adG (Y ), Z ∉ adG (W ) and G ∪ {X→Y, Z→W } is a
LDG let H = G ∪ {X→Y, Z→W }, otherwise let H = G. Note that Y might be
equal to W in this operation.
– Remove two directed edges. Choose four nodes X, Y, Z, W in G uniformly
and with replacement. If X→Y and Z→W are in G and G ∖ {X→Y, Z→W }
is a LDG let H = G∖{X→Y, Z→W }, otherwise let H = G. Note that Y might
be equal to W in this operation.
Lemma 1. The operators in Definition 1 fulfill the aperiodicity property when
G contains at least two nodes.
Proof. Here it is enough to show that there exists at least one operator such that
H is equal to G, and at least one operator such that it is not, for any possible
G with at least two nodes. The latter follows directly from Lemma 3 since there
exist more than one possible state when G contains at least two nodes. To see
that the former must hold note that if the add directed edge operation results in
a LDG for some nodes X and Y in a LDG G, then clearly the remove directed
edge X→Y operation must result in a LDG H equal to G for that G. ◻
Lemma 2. The operators in Definition 1 fulfill the reversibility property for the
uniform distribution.
Proof. Since the desired distribution for the Markov chain is the uniform distribution proving that reversibility holds for the operators simplifies to proving
that symmetry holds for them. This means that we need to show that for any
LDG G the probability to transition to any LDG H is equal to the probability to
transition from H to G. Here it is simple to see that each operator has a reverse
operator such as remove directed edge for add directed edge etc. and that the
“forward” operator and reverse operator are chosen with equal probability for a
certain set of nodes. Moreover we can also see that if H is G with one operator
performed upon it, such that H ≠ G, then clearly there can exist no other operator that transition G to H. Hence the operators fulfill the symmetry property
and thereby the reversibility property. ◻
Lemma 3. Given a LDG G any other LDG H can be reached using the operators described in Definition 1 such that all intermediate graphs are LDGs.
Proof. Here it is enough to prove that any LDG H can be reached from the
empty graph G∅ since we, due to the reversibility property, then also know that
G∅ can be reached from H (or G). A procedure to reach H from G∅ is detailed
in Algorithm 1 and its correctness follows from Lemma 4. ◻
Lemma 4. Algorithm 1 is correct, i.e. defines a valid sequence of operations
such that any LDG H can be reached from the empty graph such that all intermediate graphs are LDGs.
Proof. To prove this lemma we need to show that all intermediate graphs are
LDGs and that G = H when the algorithm terminates. The proof is divided
into three separate parts. In the first part we will show that we can add one
connectivity component C at a time as long as the subgraph of G and H induced
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Given a LDG H the following algorithm constructs G from the the empty graph using the
operators in Definition 1 such that G = H when the algorithm terminates and all
intermediate graphs are LDGs.
Let G be the empty graph.
Repeat until G = H:
Let C be a connectivity component in G such that the subgraphs of G and H induced by
anG (C) have the same structure.
If all nodes in C have the same set of parents:
Repeat for all nodes X ∈ C:
Add the parents of X in H to X in G.1
For any two nodes X and Y in A st X−Y is in H but not in G, add X−Y to G.
Else: (when all nodes in C do not have the same set of parents):
Let H ′ be the maximally oriented AMP CG of H and C 1 the subcomponent with lowest
order.
Let A be and B be two empty sets of nodes. Let the order of a node be the size of A
when it was added to A if it is in A. If it is in B, but not in A, let the order be ∣C∣ plus the
size of A when it was added to B. If it is in neither set let it be larger than any node in A or
B.
If H ′ contains an induced subgraph of the form P →X−Y ←Q then:
Add first X−Y and then both P →X and Q→Y at once to G. Add both X and Y to A.
Else we know that H ′ must contain an induced subgraph of the form Y −X−Z st
∃P ∈ paH (X) and P ∉ paH (Z).
Add Y −X, X−Z and P →X to G sequentially. Add X to A and Y and Z to B.
Repeat until A = C 1 .
If there exist two nodes X ∈ A and Y ∈ nbH ′ (X) st either ∃P ∈ paH (X) ∖ paH (Y ) or
∃Z ∈ nbH ′ (Y ) ∖ nbH ′ (X) hold then choose X and Y in this way but also choose Y (and the
corresponding X) so that Y has the highest possible order:
If ∃P ∈ paH (X) ∖ paH (Y ): Add P →X to G if it is not in G, then add X−Y to G if it
is not in G. Add Y to A.
Else (when ∃Z ∈ nbH ′ (Y ) ∖ nbH ′ (X)): Add X−Y and Y −Z to G (sequentially) if they
are not already in G. Add Y to A and Z to B if they not already belong to these sets.
For each node ai ∈ A and parent pj ∈ paH (ai ) st pj ∉ paG (ai ) and pj is not a parent of
all nodes in A ∪ B in H, add pj →ai to G.
For any two nodes X and Y in A st X−Y is in H but not in G, add X−Y to G.
Let ai be the node with highest order in A st ∃cj ∈ nbH (ai ) ∖ A and add ai −cj to G and
cj to A.
Let ai be the node with highest order in A st paH (ai ) ≠ paG (ai ). Add the parents of ai
in H to ai in G. 1
For any two nodes X and Y in A st X−Y is in H but not in G, add X−Y to G.
Restart the loop in line 2.
1
With Add the parents of X in H to X in G we mean that the following lines should be
executed. Repeat the first applicable line until no case is applicable:
(1) If there exist two nodes Y, Z ∈ paH (X) ∖ paG (X) such that Y ∉ adH (Z) then add
Y →X, Z→X to G with the add two directed edges operation.
(2) If there exist two nodes Y, Z ∈ paH (X) such that Z ∈ paG (X), Y ∉ paG (X) and
Y ∉ adH (Z) then add Y →X to G with the add directed edge operation.
(3) If there exist a node Y ∈ paH (X) ∖ paG (X) such that G ∪ {Y →X} is a LDG then add
Y →X to G with the add directed edge operation.
Algorithm 1: Construction Algorithm
by anG (C) have the same structure (lines 1 to 3). In the second part we will
then cover the special cases when all nodes in the connectivity component C
have the same set of parents (lines 4 to 7) and in the third part we will cover
the more general case when all nodes do not have the same set of parents (lines
8 to 23).
In the proof we use the term strong edge in a LDG G to indicate an edge
that must exist in every AMP CG in the Markov equivalence class of G. For
example, if an undirected edge X−Y is strong an AMP CG G, then there exists
no feasible splits in G that orients this edge into X→Y or Y ←X. With strong
undirected path we mean a path of strong undirected edges. Moreover additional
terms will be defined in the parts they are used.
Part 1: One connectivity component C can be chosen at a time
(lines 2 and 3).
As can be seen in Algorithm 1 each iteration of the outer loop in line 2 consists
of taking one connectivity component C and add it to G until G and H have
the same structure. C is chosen in such a way, in line 3, that the subgraphs of G
and H induced by the anG (C) have the same structure. After each iteration we
then know, as shown in part 2 and 3, that the subgraphs of G and H induced
by anG (C) ∪ C have the same structure.
To see that we can add the chain components this way it is enough to note
that neither the conditions for when a split is feasible nor the conditions for
when a merging is legal takes the children or descendants of C resp. L into
consideration. This means that when we check if a chain component fulfills the
criteria for a LDG we only look at the chain component itself and its ancestors.
In turn this also means that a LDG G can only be a LDG if ∀C ∈ cc(G) the
subgraphs induced by C ∪ anG (C) are LDGs. Moreover, this also means that
when adding new directed edges to or undirected edges within a component C
chosen in line 3 in a graph G, we only have to check if a split, that removes some
flag, is feasible in C, not the whole graph G. This is because we already know
that the subgraph of G induced by the ancestors of C must be a LDG. Similarly
we only have to consider C and its parent components when we check if some
mergings are legal in G.
Finally, assuming that for each C chosen in line 3 the subgraphs of G and H
induced by C ∪ anH (C) have the same structure after the iteration, it is easy to
see that the loop must terminate when G = H due to the top down structure of
components in H.
Part 2: When all nodes in C have the same set of parents.
For this part of the proof we can note that C does not contain any flags. This
means that no feasible splits can exist in G that removes flags. Moreover, if we
study the conditions for a legal merging we can also note that if all nodes in a
chain component C share the same set of parents, then the conditions can only
fail due to other parents, not the other nodes in C.
From this it follows that we can add the parents to each node X ∈ C independently of the other nodes in C and that any merging must be illegal due
to the parents of X. To see that all intermediate graphs are LDGs and that all
parents are added, if parents are added as described in footnote 1, we can note
the following. First note that no mergings can be legal. For the first two lines in
footnote 1 this follows directly from the fact that parents added by these lines
are part of unshielded colliders, and hence no merging can be legal between C
and some parent-component of C. For line 3 in footnote 1 it does on the other
hand follow from the condition that the resulting graph must be a LDG. To see
that paG (X) = paH (X) must hold when line 3 is no longer applicable assume
the contrary, i.e. that there exists a node Y ∈ paH (X) ∖ paG (X). We then know
that paH (X) ⊆ adH (Y ) must hold, or line 1 or 2 would have been applicable,
and that paH (Y ) ⊆ paH (X), or the graph G ∪ {Y →X} must be a LDG. This
does however also mean that a merge is legal in H, since no node in C ∖ X can
be part of preventing a split in H and ∀ai ∈ anH (X) bdG (ai ) = bdH (ai ), which
is a contradiction. Hence we must have that line 5 and 6 can be executed st all
intermediate graphs are LDGs and ∀X ∈ C paG (X) = paH (X) holds after line
6.
For line 7 we can note that no flag-removing splits can exist in C since all
nodes in C have the same parents. Hence each undirected edge can be added
independently without taking the structure of C into consideration while no
mergings are legal as described above.
Part 3: When all nodes in C do not share the same parents.
In this part we will show that the subgraphs of C ∪ anH (C) must have the same
structure in H and G after lines 8 to 23 have been exectuted in algorithm 1 and
that all intermediate graphs are LDGs.
To do this we first have to define the terms subcomponents and subcomponentorder. Let a subcomponent of C be the different components that are subsets of
C in the maximally oriented AMP CG H ′ of H, i.e. H where all feasible splits
have been performed, iteratively. This means that H ′ is not the largest AMP
CG in the strong Markov equivalence class of H, but still in the same strong
Markov equivalence class as H. Hence we know that any component Ci in H
can consist of several subcomponents in H ′ and these subset of nodes are then
denoted Ci1 , Ci2 , ..., Cin , or C 1 , C 2 , ..., C n for short when Ci is implied. We also
define an order of these subcomponents such that C k must have a lower order
than C l if C k is an ancestor of C l in H ′ . In short this means that the subcomponent with lowest order has no parents in C in H ′ . Moreover, for the remainder
of the proof let H ′ denote the maximally oriented AMP CG of H and C 1 the
subcomponent with lowest order.
In addition to this we can also make some conclusions about the structure
of C and its subcomponents before we start the main part of the proof:
(1) For every non-empty subset of nodes Vi ⊂ C it must be the case that the
split with U = Vi and L = C ∖ Vi can either not be feasible in H or the split does
not remove any flag in H. If this would not hold then H can not be a LDG. Also
note that it must hold that the split with U = C ∖ Vi and L = Vi can either not
be feasible or not remove any flag according to the same argument.
(2) No splits are feasible within the subcomponents. This follows directly
from the way they are defined. This in turn means that the subgraph of H
induced by anH (C) ∪ C 1 must be a LDG. From this it follows that we can add
C 1 to G first, and then consider the rest of C separately.
(3) For every non-empty subset of nodes Vi ⊂ C 1 at least one of the conditions
for a feasible split with U = Vi and L = C 1 ∖Vi must fail. If we study the conditions
for a feasible split we can note that this can be for two reasons. Either there exist
two nodes cl ∈ Vi and ck ∉ Vi st cl −ck is in H ′ and paH ′ (cl ) ⊆/ paH ′ (ck ) or there
exist two nodes cl ∈ Vi and ck ∉ Vi st cl −ck is in H ′ and nbH ′ (ck ) ⊆/ nbH ′ (cl ).
(4) All nodes in C ∖ C 1 must be strict descendants of C 1 in H ′ . To see this
note that for a split to be feasible the adjacent nodes of L in U must be adjacent
of all nodes in L. This means that if there exists a subcomponent C i st ∃ck ∈ C i ,
ck ∈ nbH (C 1 ), ck ∈ chH ′ (C 1 ) then any node in C 1 ∩ nbH (C i ) must be a parent
of all nodes in C i . This in turn means that if there exists a subcomponent C j st
∃cl ∈ C j , cl ∈ nbH (C i ), cl ∈ chH ′ (C i ) then any node in C i ∩ nbH (C j ) must be a
parent of all nodes in C j . From this it then iteratively follows that all nodes in
C ∖ C 1 must be strict descendants of C 1 in H ′ .
(5) For every node ck ∈ C ∖ C 1 that is a strict descendant of some node
cl ∈ C 1 it must hold that paH (ck ) ⊆ paH (cl ). To see this assume this is not the
case and that ck is a node for which the statement does not hold. Then a feasible
split must have been performed when H was transformed into H ′ such that ck
belonged to L and some node cm , st paH (cm ) ⊆ paH (cl ), belonged to U . This
feasible split would then remove a flag from H, which contradicts that H is a
LDG. Hence we have a contradiction.
(6) For every node ck ∈ C ∖C 1 that is a strict descendant of some node cl ∈ C 1
it must hold that paH (cl ) ⊆ paH (ck ). To see this assume this is not the case and
that ck is a node for which the statement does not hold. Then a split must have
been performed when H was transformed into H ′ such that ck belonged to L
and some node cm , st paH (cl ) ⊆ paH (cm ), belonged to U . We can then see that
condition 3 for a feasible split is not fulfilled which contradicts that H and H ′
are in the same Markov equivalence class.
(7) For every node ck ∈ C ∖C 1 that is a strict descendant of some node cl ∈ C 1
it must hold that paH (cl ) = paH (ck ). This follows directly from (5) and (6). We
can also recall that all nodes not in C 1 must be a strict descendant of some node
in C 1 according to (4).
(8) Every set of parents that exists to some node in C must also exist to some
node in C 1 . This follows directly from (7). Additionally, since we know that all
nodes in C do not have the same parents, this also means that all nodes in C 1
do not have the same parents and hence that there exist two nodes ck , cl ∈ C 1 st
paH (ck ) ⊆/ paH (cl ).
(9) C 1 must be unique, i.e. consists of the same set of nodes no matter how
H is split. This follows directly from (7) and (8).
(10) ∀ck ∈ C i , i ≠ 1, paH (ck ) = paH (nbH (ck )). To see this note that all nodes
ck for which there exists a path to cl ∈ c1 in the subgraph of H induced by
(C ∖ C 1 ) ∪ cl must be a strict descendant of cl . This follows from the (4) and
that for a split to be feasible all nodes in U ∩ nbH (L) must be adjacent of all
nodes in L. Together with (7) it then follows that the statement must hold.
If we study the algorithm we can now see that it consists of two phases. First
lines 10 to 20 when C 1 is handled, and then lines 21 to 23 when the rest of C
is handled. In both cases we can note that an edge only is added to G if it also
exists in H. We will now cover line by line in the algorithm and prove that all
intermediate graphs are LDGs.
Line 11 and 13: To see that one of the induced subgraphs must exist in C 1
in H ′ assume the contrary. We know from (8) that all nodes in C 1 do not have
the same parents. This means that there must exist two nodes X, Y ∈ C 1 st
∃P ∈ paH (X) ∖ paH (Y ). Let L consist of all nodes in C 1 that have the parent
P and U of all nodes that does not. We then know that ∀ck ∈ U ∩ nbH ′ (L)
L ∈ nbH ′ (ck ) or the induced subgraph described in line 13 must exist in H ′ .
Similarly we know that U ∩ nbH ′ (L) must be complete. Finally we also know
that ∀cl ∈ L paH ′ (nbH ′ (cl ) ∩ U ) ⊆ paH ′ (cl ) or the induced subgraph described in
line 11 must exist in H ′ . This does however mean that a split, with the described
U and L is feasible in C 1 in H ′ which is a contradiction. We can also note that
the intermediate graphs for line 12 and 14 are LDGs since no splits are feasible
(and no mergings possible) when C has not yet received any parents in G. Once
it has received parents it is also easy to see that all parents are part of flags that
can not be removed with some feasible split.
For the loop in line 15 we can make the following observations:
(a) The loop must terminate when A = C 1 . This follows from (3) which gives
that the conditions for line 16 must be fulfilled for any A ⊂ C 1 .
(b) The only nodes in C that can have parents are those in A.
(c) The only nodes in C that can have any neighbours are those in A and B.
(d) All parents of C must be part of a flag. Hence no mergings are legal. This
follows from that a parent only is added to a node X in G if it is not a parent
of all nodes for which there exists an undirected path to X.
(e) For any edge X−Y added in lines 17 or 18, with the X and Y described
for those lines, that edge must be strong for the current G and all future G
if either paG (X) ≠ ∅ or paG (Y ) ≠ ∅. To see this assume the contrary and
first assume that there exists a feasible split with X ∈ U and Y ∈ L. We can
then see that the condition for line 17 clearly contradicts that such a split is
feasible. For line 18 we can however imagine that if a split is feasible st the edge
Y −Z is oriented into Y →Z and paG (X) ⊆ paG (Y ) then the edge X−Y might be
feasible split into X→Y . This would however mean that Y must belong to A and
hence that paG (Y ) ≠ ∅ which would prevent the split between Y and Z unless
paG (Y ) ⊆ paG (Z) which in turn would mean that Z also must belong to A. We
can now restart this part of the proof with the edge Y −Z instead and we can
then see that for every edge it has to exist some other edge that would have to
be split before the current edge for which the condition that either paG (X) ≠ ∅
or paG (Y ) ≠ ∅ must hold. To see that no split is feasible with X in L and Y ∈ U
note that this would require Y to be adjacent of all nodes in A∪B in G. Moreover
it would also have to hold that ∀ai ∈ A ∪ B paG (Y ) ⊆ paG (ai ). Now if A ∪ B = C 1
then this gives a contradiction since a split then would be feasible in H ′ . Hence
A ∪ B ⊂ C 1 must hold. However, for Y to be adjacent of all nodes in A ∪ B it
must hold that Y has been chosen as either X (with some node in A ∪ B as Y )
or Y (with some node in A∪B as X) in multiple iterations, even when it already
belonged to A. This then gives a contradiction since it must exist some node in
B or C 1 ∖ (A ∪ B) with higher order than Y that should have been possible to
choose as Y according to conclusion (3). Hence any added edge as X−Y must
be strong in all future G as long as X or Y have at least one parent not shared
by some neighbour of X resp. Y . Moreover, if neither node have any parent then
the split can not remove a flag in G.
(f) When the loop terminates all nodes in C 1 must have the same parents in
G and H with the exception of the parents that are parents of all nodes in C.
This follows from the way that parents are added in line 19 and conclusion (8).
(g) A strong undirected path must exist between any two nodes in C 1 after
the loop terminates. This follows directly from (e), (f), that all nodes are in A
when the loop terminates and that a strong undirected path exists between any
two nodes in C 1 in H.
With these observations we can now see that line 17 can be performed and
that the intermediate graphs must be LDGs. P →X can be added since all edges
W −X in G for some node W ∈ nbG (X) must be strong after X has received P
as a parent as described in (e). That X−Y also is strong also follows from (e).
For line 18 we can note that a split is feasible of X−Y to X→Y when only X−Y
have been added to G. Such a split does however not remove any flag. Later,
when Y −Z have been added we also know that the resulting graph must be a
LDG as described in (e).
For line 19 we can note that any added parent must be part of a flag in G
and hence no legal mergings can be possible. Moreover it follows from (e) that
no flag-removing splits can become feasible.
That line 20 must result in LDGs follows from the fact that a strong undirected path must exist between any two nodes in C 1 as described in (g) and
that this path can not be made weak by adding additional undirected edges that
exist in H. Hence, after line 20 the subgraphs of H and G induced by C 1 must
have the same structure. Note however that C 1 might have some parents in H
that does not exist in G, but only if those parents are parents of all nodes in C
as described in (f). These last parents are handled in line 22.
Line 21 then creates several tree structures, each with a root node in C 1 , that
extends A to consist of all nodes in C. Note that any two nodes ci and cj must
belong to the same tree structure if there exists an undirected path between ci
and cj in the subgraph of H induced by C ∖ C 1 . That no split, that removes a
flag, can be feasible in G follows from the fact that only nodes in C 1 can have
parents and no node not in C 1 can be adjacent of all nodes in C 1 in G before
line 23.
In line 22 the parents of the nodes in C ∖ C 1 are added to G. We can now
note that in each subtree all nodes must have the same parents according to
(10). Moreover the parents are added from the root and outwards, adding the
parents to the leafs last. This gives that no edge X−Y , st X has a lower order
than Y , can be oriented into X→Y or Y →X in a feasible flag-removing split.
If both X, Y ∈ C 1 this follows from (e). If X ∉ C 1 or Y ∉ C 1 the edge can not
be oriented to X→Y since Y only has a parent if X has the same parent. Nor
can it be oriented to Y →X since ∃Z ∈ nbG (X) ∖ nbG (Y ) st Z has a lower order
than X. That no mergings are legal before the last node receives its parents
follows from that all parents are part of flags. When the last node X to receive
parents, with the highest order in C, we can also note that no mergings can be
legal. For the first two lines in footnote 1 this follows directly from the fact that
every parent added by these lines are part of unshielded colliders, and hence
no merging can be legal between C and some parent-component of C. For line
3 in footnote 1 it does on the other hand follow from the condition that the
resulting graph must be a LDG. To see that paG (X) = paH (X) must hold when
line 3 is no longer applicable assume the contrary, i.e. that there exists a node
Y ∈ paH (X)∖paG (X). We then know that paH (X) ⊆ adH (Y ) must hold, or line
1 or 2 would have been applicable, and that paH (Y ) ⊆ paH (X), or the graph
G ∪ {Y →X} must be a LDG. This does however also mean that a merge is legal
in H, since ∀ai ∈ anH (X) ∪ C ∖ X bdG (ai ) = bdH (ai ), which is a contradiction.
Hence we must have that ∀vi ∈ C paH (vi ) = paG (vi ) after line 22.
Finally in line 23 the remaining undirected edges can be added to G since all
nodes adjacent in C in H but not in G have the same set of parents. Hence no
feasible splits can be flag-removing.
Finally we do also have to prove what the conditions are for when the independence model of a LDG can be represented as a BN resp. a MN.
Lemma 5. Given a LDG G there exists a BN H such that I(G) = I(H) iff G
contains no flags and G contains no chordless undirected cycles.
Proof. Since H is a BN and BNs is a subclass of AMP CGs we know that for
I(G) = I(H) to hold H must be in the same Markov equivalence class as G
if it is interpreted as an AMP CG. However, since G is a LDG and contains
a flag we know that that flag must exist in every AMP CG in the Markov
equivalence class of G. Hence I(G) = I(H) cannot hold if G contains a flag.
Moreover it is also well known that the independence model of a MN containing
chordless cycles cannot be represented as a BN. On the other hand, if G contains
no flag and no chordless undirected cycles, then clearly all triplexes in G must
be unshielded colliders. Hence, together with the fact that G can contain no
semi-directed cycles, we can orient all undirected edges in G to directed edges
without creating any semi-directed cycles as shown by Koller and Friedman [8,
Theorem 4.13]. This means that the resulting graph must be a BN H such that
I(G) = I(H). ◻
Lemma 6. Given a LDG G there exists a MN H such that I(G) = I(H) iff G
contains no directed edges.
Proof. This follows directly from the fact that G contains a triplex iff it also
contains a directed edge and MNs cannot represent any triplexes. It is also clear
that if G contains no directed edges than it can only contain undirected edges
and hence be interpreted as a MN. ◻
Table 1: Exact and approximate ratios of AMP CG models whose independence
mode can be represented as BNs, MNs, neither (in that order).
NODES
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
4
EXACT
1.0000 1.0000 0.0000
1.0000 0.7273 0.0000
0.8259 0.2857 0.1607
0.5987 0.0689 0.3958
APPROXIMATE
1.0000 1.0000 0.0000
1.0000 0.7275 0.0000
0.8253 0.2839 0.1609
0.6019 0.0632 0.3931
0.4292 0.0112 0.5694
0.2987 0.0019 0.7010
0.2021 0.0002 0.7979
0.1373 0.0000 0.8627
0.0924 0.0000 0.9076
0.0604 0.0000 0.9396
0.0394 0.0000 0.9606
0.0251 0.0000 0.9749
0.0152 0.0000 0.9849
0.0106 0.0000 0.9894
0.0065 0.0000 0.9935
0.0044 0.0000 0.9956
0.0028 0.0000 0.9972
0.0017 0.0000 0.9983
0.0010 0.0000 0.9990
Results
Using the Markov chain presented in the previous section we were able to sample
AMP CG models uniformly for a set number of nodes for up to 20 nodes. For each
number of nodes 105 samples were sampled with 105 transitions between each
sampled sample. The success rate for the transitions, i.e. the percentage of transitions such that H ≠ G, was for 20 nodes approximately 10% and this corresponds
well with earlier studies for LWF and MVR CGs [16]. To check if a graph G was
a LDG it was first checked whether it was an AMP CG. If so then the LDG H in
the Markov equivalence class of G was created whereafter it was checked whether
H and G had the same structure. To construct the LDG of G the algorithm defined by Roverato and Studený was used [14]. The implementation was carried
out in C++, run on an Intel Core i5 processor and it took approximately one
week to complete the sampling of all sample sets. Moreover we also enumerated
all LDGs for up to five nodes to allow a comparison between the approximated
and exact values. The sampled graphs and code are available for public access at
http ∶ //www.ida.liu.se/divisions/adit/data/graphs/CGSamplingResources.
The results are shown in Tables 1 and 2. If we start with Table 1 we can
here see the ratio of the number of BN models resp. MN models compared to
number of AMP CG models. We can note that the approximation seems to be
very accurate for up to five nodes but for more than five nodes we do not have
any exact values to compare the approximations to. We can however plot the
ratios of the number of BN models compared to the number of AMP CG models
in a graph with logarithmic scales, as seen in Fig. 2. We can then see that the
ratio follows the equation R = 5 ∗ 0.664n very closely, where n is the number of
nodes and R the ratio. This means that the ratio decreases exponentially as the
number of nodes increases and hence that the previous results seen for LWF and
MVR CGs also hold for AMP CGs [16].
Fig. 2: The ratios of the number of BN models compared to the number of AMP
CG models for different number of nodes. (Displayed with a logarithmic scale)
If we go back to the table we can also note that the ratio of the number
of MN models compared to the number of AMP CG models is almost zero for
more than eight nodes similarly as seen in previous studies for LWF and MVR
CGs. Moreover, if we compare the ratio of the number BN models to the number
of CG models for the different CG interpretations we can see that the ratio is
approximately 0.0017 for LWF CGs, 0.0011 for MVR CGs and 0.0010 for AMP
CGs for 20 nodes [16]. This means that LWF CGs can represent the smallest
amount of independence models while MVR and AMP CGs can represent about
the same amount. Another interesting observation that can be made here is that
the difference between the ratios of the number of BN models compared to the
number of AMP CG models resp. MVR CG models is very small for any number
of nodes.
If we continue to Table 2 we can here see the exact and approximate ratios of
AMP CG models to AMP CGs. This ratio can be calculated using the equation
#CGmodels #BN s #BN models #CGmodels
=
∗
∗
#CGs
#CGs
#BN s
#BN models
(1)
where #CGmodels represents the number of AMP CG models etc. The ratio
#BN s
can then be found using the iterative equations by Robinsson [13] and
#CGs
Steinsky [17] while
#BN models
#BN s
can be found in previous studies for BN models
#CGmodels
[11]. Finally we can also get the ratio #BN
by inverting the ratio found
models
in Table 1. If we study the values in Table 2 we can see that the ratio seems to
converge to somewhere around 0.06 similarly as seen for the average number of
BNs per BN model [16]. In the previous study estimating the average number
Table 2: Exact and approximate ratios of AMP CG models to AMP CGs.
NODES EXACT APPROXIMATE
2
0.5000
0.5074
0.2200
0.2235
3
4
0.1327
0.1312
0.1043
0.1008
5
6
0.0860
7
0.0775
0.0701
8
0.0668
9
10
0.0616
11
0.0594
12
0.0595
13
0.0608
0.0635
14
15
0.0559
16
0.0598
17
0.0557
18
0.0567
0.0596
19
20
0.0632
of LWF CGs per LWF CG model no such convergence has been seen [11]. We
can however note that this study only goes up to 13 nodes but also that the
approximated values for the LWF CG interpretation is considerably lower than
0.06. The ratio also shows that traversing the space of AMP CG models when
learning CG structures is considerably more efficient than traversing the space
of all AMP CGs.
5
Conclusion
In this article we have approximated the ratio of the number of AMP CG models compared to the number of BN models for up to 20 nodes. This has been
achieved by creating a Markov chain whose unique stationary distribution is the
uniform distribution of all AMP CG models. The results show that the ratio of
the number of independence models representable by AMP CGs compared to
the corresponding number for BNs grows exponentially as the number of nodes
increases. This confirms previous results for LWF and MVR CGs but a new, and
unexpected, result is also that AMP and MVR CGs seem to be able to represent
almost the same amount of independence models. It has previously been shown
that there exist some independence models perfectly representable by both MVR
and AMP CGs, but not BNs [15], and it would now be interesting to see how
large this set is compared to the set of all AMP resp. MVR CG models.
In addition to the comparison of the number of AMP CG models to the
number of BN models we have also approximated the average number of AMP
CGs per CG model. The results indicate that the average ratio converges to
somewhere around 0.06. Such a convergence has not previously been seen for
any other CG interpretation but it has been found to be ≈ 0.25 for BNs [11].
With these new results sampling methods for CG models of all CG interpretations are now available [11, 16]. This opens up for MCMC based structure
learning algorithms for the different CG interpretations that traverses the space
of CG models instead of the space of CGs. More importantly it does also open up
for further studies on what independence models and systems the different CG
interpretations can represent, which is an important question where the answer
is still unclear.
Acknowledgments
This work is funded by the Swedish Research Council (ref. 2010-4808).
Bibliography
[1] Andersson, S.A., Madigan, D., Perlman, M.D.: An alternative markov property for chain graphs. Scandinavian Journal of Statistics 28, 33–85 (2001)
[2] Andersson, S.A., Perlman, M.D.: Characterizing Markov Equivalence
Classes For AMP Chain Graph Models. The Annals of Statistics 34, 939–972
(2006)
[3] Cox, D.R., Wermuth, N.: Linear Dependencies Represented by Chain
Graphs. Statistical Science 8, 204–283 (1993)
[4] Cox, D.R., Wermuth, N.: Multivariate Dependencies: Models, Analysis and
Interpretation. Chapman and Hall (1996)
[5] Drton, M.: Discrete Chain Graph Models. Bernoulli 15, 736–753 (2009)
[6] Frydenberg, M.: The Chain Graph Markov Property. Scandinavian Journal
of Statistics 17, 333–353 (1990)
[7] Häggström, O.: Finite Markov Chains and Algorithmic Applications. Campbridge University Press (2002)
[8] Koller, D., Friedman, N.: Probabilistic Graphcal Models. MIT Press (2009)
[9] Lauritzen, S.L., Wermuth, N.: Graphical Models for Association Between
Variables, Some of Which are Qualitative and Some Quantitative. The Annals of Statistics 17, 31–57 (1989)
[10] Ma, Z., Xie, X., Geng, Z.: Structural Learning of Chain Graphs via Decomposition. Journal of Machine Learning Research 9, 2847–2880 (2008)
[11] Peña, J.M.: Approximate Counting of Graphical Models Via MCMC. In:
Proceedings of the 11th International Conference on Artificial Intelligence
and Statistics. pp. 352–359 (2007)
[12] Peña, J.M., Sonntag, D., Nielsen, J.: An Inclusion Optimal Algorithm for
Chain Graph Structure Learning. In: Proceedings of the 17th International
Conference on Artificial Intelligence and Statistics. pp. 778–786 (2014)
[13] Robinson, R.W.: Counting Labeled Acyclic Digraphs. New Directions in the
Theory of Graphs pp. 239–273 (1973)
[14] Roverato, A., Studený, M.: A Graphical Representation of Equivalence
Classes of AMP Chain Graphs. Journal of Machine Learning Research 7,
1045–1078 (2006)
[15] Sonntag, D., Peña, J.M.: Chain Graph Interpretations and Their Relations
(2014), under review.
[16] Sonntag, D., Peña, J.M., Gómez-Olmedo, M.: Approximate Counting of
Graphical Models Via MCMC Revisited. Under Review (2014)
[17] Steinsky, B.: Enumeration of Labelled Chain Graphs and Labelled Essential
Directed Acyclic Graphs. Discrete Mathematics 270, 266–277 (2003)
Download