On Expressiveness of the AMP Chain Graph Interpretation Dag Sonntag ADIT, IDA, Linköping University, Sweden dag.sonntag@liu.se Abstract. In this paper we study the expressiveness of the AnderssonMadigan-Perlman interpretation of chain graphs. It is well known that all independence models that can be represented by Bayesian networks also can be perfectly represented by chain graphs of the Andersson-MadiganPerlman interpretation but it has so far not been studied how much more expressive this second class of models is. In this paper we calculate the exact number of representable independence models for the two classes, and the ratio between them, for up to five nodes. For more than five nodes the explosive growth of chain graph models does however make such enumeration infeasible. Hence we instead present, and prove the correctness of, a Markov chain Monte Carlo approach for sampling chain graph models uniformly for the Andersson-Madigan-Perlman interpretation. This allows us to approximate the ratio between the numbers of independence models representable by the two classes as well as the average number of chain graphs per chain graph model for up to 20 nodes. The results show that the ratio between the numbers of representable independence models for the two classes grows exponentially as the number of nodes increases. This indicates that only a very small fraction of all independence models representable by chain graphs of the Andersson-Madigan-Perlman interpretation also can be represented by Bayesian networks. Keywords: Chain graphs, Andersson-Madigan-Perlman interpretation, MCMC sampling. 1 Introduction Chain graphs (CGs) is a class of probabilistic graphical models (PGMs) that can contain two types of edges, representing symmetric resp. asymmetric relationships between the variables in question. This allows CGs to represent more independence models than for example Bayesian networks (BNs) and thereby represent systems more accurately than this less expressive class of models. Today there do however exist several different ways of interpreting CGs and what conditional independences they encode, giving rise to different so called CG interpretations. The most researched CG interpretations are the Lauritzen-WermuthFrydenberg (LWF) interpretation [6, 9], the Andersson-Madigan-Perlman (AMP) interpretation [1] and the multivariate regression (MVR) interpretation [3, 4]. Each interpretation has its own way of determining conditional independences in a CG and it can be noted that no interpretation subsumes another in terms of representable independence models [5, 15]. Although it is well known that any independence model that can be represented by BNs also can be represented by CGs, and that the opposite does not hold, it is unclear how much more expressive the latter class of models is. This question has previously been studied for CGs of the LWF interpretation (LWF CGs) and MVR interpretation (MVR CGs) [11, 16]. The results from these studies show that the ratio of the number of BN models (independence models representable by BNs) compared to the number of CG models (independence models representable by CGs) decreases exponentially as the number of nodes in the graphs increases. In this article we carry out a similar study for the AMP CG interpretation and investigate if the previous results also hold for this interpretation of CGs. Measuring the ratio of representable independence models should in principle be easy to do. We only have to enumerate every independence model representable by CGs of the AMP interpretation (AMP CGs) and check if they also can be perfectly represented by BNs. The problem here is that the number of AMP CG models grows superexponentially as the number of nodes in the graphs increases and it is infeasible to enumerate all of them for more than five nodes. Hence we instead use a Markov chain Monte Carlo (MCMC) approach that previously has been shown to be successful when studying BN models, LWF CG models and MVR CG models [11, 16]. The approach consists of creating a Markov chain whose states are all possible AMP CG models with a certain number of nodes and whose stationary distribution is the uniform distribution over these models. This does then allow us to sample the AMP CG models from approximately the uniform distribution over all AMP CG models by transitioning through the Markov chain. With such a set of models we can thereafter approximate the ratio of BN models to AMP CG models for up to 20 nodes. Moreover, since there exists an equation for calculating the exact number of AMP CGs for a set number of nodes [17], we can also estimate the average number of AMP CGs per AMP CG model. Finally we also make the AMP CG models publicly available online and it is our intent that they can be used for further studies of AMP CG models and to evaluate for example learning algorithms to get more accurate evaluations than what is achieved with the randomly generated CGs used today [10, 12]. The rest of the article is organised as follows. In the next section we will cover the definitions and notations used in this article. This is then followed by Section 3 where we discuss the theory of the MCMC sampling algorithm used and also prove that its unique stationary distribution is the uniform distribution of all AMP CG models. In Section 4 we then discuss the results found and in Section 5 we give a short conclusion. 2 Notation All graphs are defined over a finite set of variables V . If a graph G contains an edge between two nodes V1 and V2 , we denote with V1 →V2 a directed edge and with V1 −V2 an undirected edge. A set of nodes is said to be complete if there exist edges between all pairs of nodes in the set. The parents of a set of nodes X of G is the set paG (X) = {V1 ∣V1 →V2 is in G, V1 ∉ X and V2 ∈ X}. The children of X is the set chG (X) = {V1 ∣V2 →V1 is in G, V1 ∉ X and V2 ∈ X}. The neighbours of X is the set nbG (X) = {V1 ∣V1 −V2 is in G, V1 ∉ X and V2 ∈ X}. The boundary of X is the set bdG (X) = paG (X) ∪ nbG (X). The adjacents of X is the set adG (X) = {V1 ∣V1 →V2 ,V1 ←V2 or V1 −V2 is in G, V1 ∉ X and V2 ∈ X}. A route from a node V1 to a node Vn in G is a sequence of nodes V1 , . . . , Vn such that Vi ∈ adG (Vi+1 ) for all 1 ≤ i < n. A path is a route containing only distinct nodes. The length of a path is the number of edges in the path. A path is called a cycle if Vn = V1 . A path is descending if Vi ∈ paG (Vi+1 ) ∪ nbG (Vi+1 ) for all 1 ≤ i < n. The descendants of a set of nodes X of G is the set deG (X) = {Vn ∣ there is a descending path from V1 to Vn in G, V1 ∈ X and Vn ∉ X}. A path is strictly descending if Vi ∈ paG (Vi+1 ) for all 1 ≤ i < n. The strict descendants of a set of nodes X of G is the set sdeG (X) = {Vn ∣ there is a strict descending path from V1 to Vn in G, V1 ∈ X and Vn ∉ X}. The ancestors (resp. strict ancestors) of X is the set anG (X) = {V1 ∣Vn ∈ deG (V1 ), V1 ∉ X, Vn ∈ X} (resp. sanG (X) = {V1 ∣Vn ∈ sdeG (V1 ), V1 ∉ X, Vn ∈ X}). A cycle is called a semi-directed cycle if it is descending and Vi →Vi+1 is in G for some 1 ≤ i < n. A Bayesian network (BN) is a graph containing only directed edges and no semi-directed cycles while a Markov network (MN) is graph containing only undirected edges. A CG under the Andersson-Madigan-Perlman (AMP) interpretation, denoted AMP CG, contains only directed and undirected edges but no semi-directed cycles. Note that we in this article only will study the AMP CG interpretation, but that we will use the term CG when the results or notion generalizes to all CG interpretations. A connectivity component C of an AMP CG is a maximal (wrt set inclusion) set of nodes such that there exists a path between every pair of nodes in C containing only undirected edges. A subgraph of G is a subset of nodes and edges in G. A subgraph of G induced by a set of its nodes X is the graph over X that has all and only the edges in G whose both ends are in X. To illustrate these concepts we can study the AMP CG G with five nodes shown in Fig. 1a. In the graph we can for example see that the only child of A is B and that D is a neighbour of E. D is also a strict descendant of A due to the strictly descending path A→B→D, while E is not. E is however in the descendants of A together with B and D. A is therefore an ancestor of all nodes except itself and C. We can also see that G contains no semi-directed cycles since it contains no cycle at all. Moreover we can see that G contains four connectivity components: {A}, {B}, {C} and {D, E}. In Fig. 1b we can see a subgraph of G with the nodes B, D and E. This is also the induced subgraph of G with these nodes since it contains all edges between the nodes B, D and E in G. A A B C B D E D (a) An AMP CG G E (b) The subgraph of G induced by {B, D, E} B C D E (c) A LDG H Fig. 1: Three different AMP CGs Let X, Y and Z denote three disjoint subsets of V . We say that X is conditionally independent from Y given Z if the value of X does not influence the value of Y when the values of the variables in Z are known, i.e. pr(X, Y ∣Z) = pr(X∣Z)pr(Y ∣Z) holds and pr(Z) > 0. We say that X is separated from Y given Z, denoted X⊥G Y ∣Z, in an AMP CG, BN or MN G iff there exists no S-open path between X and Y . A path is said to be S-open iff every non-head-no-tail node on the path is not in Z and every head-no-tail node on the path is in Z or sanG (Z). A node B is said to be a head-no-tail in an AMP CG, BN or MN G between two nodes A and C on a path if one of the following configurations exists in G: A→B←C, A→B−C or A−B←C. Moreover G is also said to contain a triplex ({A, C}, B) iff one such configuration exists in G and A and C are not adjacent in G. A triplex ({A, C}, B) is said to be a flag (resp. a collider ) in an AMP CG or BN G iff G contains one following subgraphs induced by A, B and C: A→B−C or A−B←C (resp. A→B←C). If an AMP CG G is said to contain a biflag we mean that G contains the induced subgraph A→B−C←D where A and D might be adjacent, for four nodes A, B, C and D. Given a graph G we mean with G ∪ {A→B} the graph H with the same structure as G but where H also contains the directed edge A→B in addition to any other edges in G. Similarly we mean with G ∖ {A→B} the graph H with the same structure as G but where H does not contain the directed edge A→B. Note that if G did not contain the directed edge A→B then H is not a valid graph. To illustrate these concepts we can once again look at the graph G shown in Fig. 1a. G contains no colliders but two flags, B→D−E and C→E−D, that together form the biflag B→D−E←C. This means that B⊥ G E holds but that B⊥G E∣D does not hold since D is a head-no-tail node on the path B→D−E. The independence model M induced by a graph G, denoted as I(G), is the set of separation statements X⊥G Y ∣Z that hold in G. We say that two graphs G and H are Markov equivalent or that they are in the same Markov equivalence class iff I(G) = I(H). Moreover we say that G and H belong to the same strong Markov equivalent class iff I(G) = I(H) and G and H contain the same flags. By saying that an independence model is perfectly represented in a graph G we mean that I(G) contains all and only the independences in the independence model. An AMP CG model (resp. BN model or MN model) is an independence model representable by an AMP CG (resp. BN or MN). AMP CG models do today have two possible unique graphical representations, largest deflagged graphs (LDGs) [14] and AMP essential graphs [2]. In this article we will only use LDGs. A LDG is the AMP CG of a Markov equivalence class that has the minimum number number of flags while at the same time contains the maximum number of undirected edges for that strong Markov equivalence class. For AMP CGs there exists a set of operations that allows for changing the structure of edges within an AMP CG without altering the Markov equivalence class it belongs to. In this article we use the feasible split operation [15] and the legal merging [14] operation. A split is said to be feasible for a connectivity component C of an AMP CG G iff it can be divided into two disjoint sets U and L such that U ∪ L = C and if replacing all undirected edges between U and L with directed edges orientated towards L results in an AMP CG H such that I(G) = I(H). It has been shown that if the following conditions hold in G then such a split is possible: [15] (1) ∀A ∈ neG (L) ∩ U, L ⊆ neG (A), (2) neG (L) ∩ U is complete and (3) ∀B ∈ L, paG (neG (L) ∩ U ) ⊆ paG (B). A merging is on the other hand said to be legal for two connectivity components U and L in an AMP CG G, such that U ∈ paG (L), iff replacing all directed edges between U and L with undirected edges results in an AMP CG H such that G and H belongs to the same strong Markov equivalence class. It has been shown that if the following conditions hold in G then such a merging is possible [14] (1) paG (L) ∩ U is complete in G, (2) ∀B ∈ paG (L) ∩ U, paG (L) ∖ U = paG (B) and (3) ∀A ∈ L, paG (L) = paG (A). Note that a legal merging is not the reverse operator of a feasible split since a feasible split handles Markov equivalence classes, while a legal merging handles strong Markov equivalence classes. If we once again look at the CGs in Fig. 1 we can see that G, shown in Fig. 1a, and H, shown in Fig. 1c, are Markov equivalent since I(G) = I(H). Moreover we can note that G and H must belong to the same strong Markov equivalence class since they contain the same flags. This means that G cannot be a LDG since H is larger than G. In G it exists no feasible split, but one legal merging which is the merging of the connectivity components {A} and {B} that results in a CG with the same structure as H. In H it does on the other hand exist no legal merging but one feasible split which is the split of the connectivity component {A, B} into either A→B or B→A. We can finally note that this feasible split would result in no additional flags and hence, since no legal mergings are possible in H, that H must be a LDG. 3 Markov Chain Monte Carlo Approach In this section we cover the theory behind the MCMC approach used in this article for sampling LDGs from the uniform distribution. Note that we only cover the theory of the MCMC sampling method very briefly and for a more complete introduction of the sampling method we instead refer the reader to the work by Häggström [7]. The MCMC sampling approach consists of creating a Markov chain whose unique stationary distribution is the desired distribution and then sample this Markov chain after a number of transitions. The Markov chain is defined by a set of operators that allows us to transition from one state to another. It can then be shown that if these operators have certain properties and the number of transitions goes to infinity then all states have the same probability of being sampled [7]. Moreover, in practice it has also been shown that the number of transitions can be relatively small and a good approximation of the uniform distribution can still be achieved. In our case the possible states of the Markov chain are all possible independence models representable by AMP CGs, i.e. AMP CG models, represented by LDGs. The operators then add and remove certain edges in in these LDGs, allowing the MCMC method to transition between all possible LDGs. For the stationary distribution to be the unique uniform distribution of all possible LDGs we only have to prove that the operators fulfill the following properties [7]: aperiodicity, irreducibility and reversibility. Aperiodicity, i.e. that the Markov chain does not end up in the same state periodicly, and irreducibility, i.e. that any LDG can be reached from any other LDG using only the defined operators, proves that the Markov chain has a unique distribution. Reversibility then proves that this distribution also is the stationary distribution. The operators used are defined in Definition 1 and it follows from Lemma 1, 2 and 3 that they fulfill the properties described above for LDGs with at least two nodes. Definition 1. Markov Chain Operators Choose uniformly and perform one of the following six operators to transition from a LDG G to the next LDG H in the Markov chain. – Add directed edge. Choose two nodes X, Y in G uniformly and with replacement. If X ∉ adG (Y ) and G ∪ {X→Y } is a LDG let H = G ∪ {X→Y }, otherwise let H = G. – Remove directed edge. Choose two nodes X, Y in G uniformly and with replacement. If X→Y is in G and G∖{X→Y } is a LDG let H = G∖{X→Y }, otherwise let H = G. – Add undirected edge. Choose two nodes X, Y in G uniformly and with replacement. If X ∉ adG (Y ) and G ∪ {X−Y } is a LDG let H = G ∪ {X−Y }, otherwise let H = G. – Remove undirected edge. Choose two nodes X, Y in G uniformly and with replacement. If X−Y is in G and G ∖ {X−Y } is a LDG let H = G ∖ {X−Y }, otherwise let H = G. – Add two directed edges. Choose four nodes X, Y, Z, W in G uniformly and with replacement. If X ∉ adG (Y ), Z ∉ adG (W ) and G ∪ {X→Y, Z→W } is a LDG let H = G ∪ {X→Y, Z→W }, otherwise let H = G. Note that Y might be equal to W in this operation. – Remove two directed edges. Choose four nodes X, Y, Z, W in G uniformly and with replacement. If X→Y and Z→W are in G and G ∖ {X→Y, Z→W } is a LDG let H = G∖{X→Y, Z→W }, otherwise let H = G. Note that Y might be equal to W in this operation. Lemma 1. The operators in Definition 1 fulfill the aperiodicity property when G contains at least two nodes. Proof. Here it is enough to show that there exists at least one operator such that H is equal to G, and at least one operator such that it is not, for any possible G with at least two nodes. The latter follows directly from Lemma 3 since there exist more than one possible state when G contains at least two nodes. To see that the former must hold note that if the add directed edge operation results in a LDG for some nodes X and Y in a LDG G, then clearly the remove directed edge X→Y operation must result in a LDG H equal to G for that G. ◻ Lemma 2. The operators in Definition 1 fulfill the reversibility property for the uniform distribution. Proof. Since the desired distribution for the Markov chain is the uniform distribution proving that reversibility holds for the operators simplifies to proving that symmetry holds for them. This means that we need to show that for any LDG G the probability to transition to any LDG H is equal to the probability to transition from H to G. Here it is simple to see that each operator has a reverse operator such as remove directed edge for add directed edge etc. and that the “forward” operator and reverse operator are chosen with equal probability for a certain set of nodes. Moreover we can also see that if H is G with one operator performed upon it, such that H ≠ G, then clearly there can exist no other operator that transition G to H. Hence the operators fulfill the symmetry property and thereby the reversibility property. ◻ Lemma 3. Given a LDG G any other LDG H can be reached using the operators described in Definition 1 such that all intermediate graphs are LDGs. Proof. Here it is enough to prove that any LDG H can be reached from the empty graph G∅ since we, due to the reversibility property, then also know that G∅ can be reached from H (or G). A procedure to reach H from G∅ is detailed in Algorithm 1 and its correctness follows from Lemma 4. ◻ Lemma 4. Algorithm 1 is correct, i.e. defines a valid sequence of operations such that any LDG H can be reached from the empty graph such that all intermediate graphs are LDGs. Proof. To prove this lemma we need to show that all intermediate graphs are LDGs and that G = H when the algorithm terminates. The proof is divided into three separate parts. In the first part we will show that we can add one connectivity component C at a time as long as the subgraph of G and H induced 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Given a LDG H the following algorithm constructs G from the the empty graph using the operators in Definition 1 such that G = H when the algorithm terminates and all intermediate graphs are LDGs. Let G be the empty graph. Repeat until G = H: Let C be a connectivity component in G such that the subgraphs of G and H induced by anG (C) have the same structure. If all nodes in C have the same set of parents: Repeat for all nodes X ∈ C: Add the parents of X in H to X in G.1 For any two nodes X and Y in A st X−Y is in H but not in G, add X−Y to G. Else: (when all nodes in C do not have the same set of parents): Let H ′ be the maximally oriented AMP CG of H and C 1 the subcomponent with lowest order. Let A be and B be two empty sets of nodes. Let the order of a node be the size of A when it was added to A if it is in A. If it is in B, but not in A, let the order be ∣C∣ plus the size of A when it was added to B. If it is in neither set let it be larger than any node in A or B. If H ′ contains an induced subgraph of the form P →X−Y ←Q then: Add first X−Y and then both P →X and Q→Y at once to G. Add both X and Y to A. Else we know that H ′ must contain an induced subgraph of the form Y −X−Z st ∃P ∈ paH (X) and P ∉ paH (Z). Add Y −X, X−Z and P →X to G sequentially. Add X to A and Y and Z to B. Repeat until A = C 1 . If there exist two nodes X ∈ A and Y ∈ nbH ′ (X) st either ∃P ∈ paH (X) ∖ paH (Y ) or ∃Z ∈ nbH ′ (Y ) ∖ nbH ′ (X) hold then choose X and Y in this way but also choose Y (and the corresponding X) so that Y has the highest possible order: If ∃P ∈ paH (X) ∖ paH (Y ): Add P →X to G if it is not in G, then add X−Y to G if it is not in G. Add Y to A. Else (when ∃Z ∈ nbH ′ (Y ) ∖ nbH ′ (X)): Add X−Y and Y −Z to G (sequentially) if they are not already in G. Add Y to A and Z to B if they not already belong to these sets. For each node ai ∈ A and parent pj ∈ paH (ai ) st pj ∉ paG (ai ) and pj is not a parent of all nodes in A ∪ B in H, add pj →ai to G. For any two nodes X and Y in A st X−Y is in H but not in G, add X−Y to G. Let ai be the node with highest order in A st ∃cj ∈ nbH (ai ) ∖ A and add ai −cj to G and cj to A. Let ai be the node with highest order in A st paH (ai ) ≠ paG (ai ). Add the parents of ai in H to ai in G. 1 For any two nodes X and Y in A st X−Y is in H but not in G, add X−Y to G. Restart the loop in line 2. 1 With Add the parents of X in H to X in G we mean that the following lines should be executed. Repeat the first applicable line until no case is applicable: (1) If there exist two nodes Y, Z ∈ paH (X) ∖ paG (X) such that Y ∉ adH (Z) then add Y →X, Z→X to G with the add two directed edges operation. (2) If there exist two nodes Y, Z ∈ paH (X) such that Z ∈ paG (X), Y ∉ paG (X) and Y ∉ adH (Z) then add Y →X to G with the add directed edge operation. (3) If there exist a node Y ∈ paH (X) ∖ paG (X) such that G ∪ {Y →X} is a LDG then add Y →X to G with the add directed edge operation. Algorithm 1: Construction Algorithm by anG (C) have the same structure (lines 1 to 3). In the second part we will then cover the special cases when all nodes in the connectivity component C have the same set of parents (lines 4 to 7) and in the third part we will cover the more general case when all nodes do not have the same set of parents (lines 8 to 23). In the proof we use the term strong edge in a LDG G to indicate an edge that must exist in every AMP CG in the Markov equivalence class of G. For example, if an undirected edge X−Y is strong an AMP CG G, then there exists no feasible splits in G that orients this edge into X→Y or Y ←X. With strong undirected path we mean a path of strong undirected edges. Moreover additional terms will be defined in the parts they are used. Part 1: One connectivity component C can be chosen at a time (lines 2 and 3). As can be seen in Algorithm 1 each iteration of the outer loop in line 2 consists of taking one connectivity component C and add it to G until G and H have the same structure. C is chosen in such a way, in line 3, that the subgraphs of G and H induced by the anG (C) have the same structure. After each iteration we then know, as shown in part 2 and 3, that the subgraphs of G and H induced by anG (C) ∪ C have the same structure. To see that we can add the chain components this way it is enough to note that neither the conditions for when a split is feasible nor the conditions for when a merging is legal takes the children or descendants of C resp. L into consideration. This means that when we check if a chain component fulfills the criteria for a LDG we only look at the chain component itself and its ancestors. In turn this also means that a LDG G can only be a LDG if ∀C ∈ cc(G) the subgraphs induced by C ∪ anG (C) are LDGs. Moreover, this also means that when adding new directed edges to or undirected edges within a component C chosen in line 3 in a graph G, we only have to check if a split, that removes some flag, is feasible in C, not the whole graph G. This is because we already know that the subgraph of G induced by the ancestors of C must be a LDG. Similarly we only have to consider C and its parent components when we check if some mergings are legal in G. Finally, assuming that for each C chosen in line 3 the subgraphs of G and H induced by C ∪ anH (C) have the same structure after the iteration, it is easy to see that the loop must terminate when G = H due to the top down structure of components in H. Part 2: When all nodes in C have the same set of parents. For this part of the proof we can note that C does not contain any flags. This means that no feasible splits can exist in G that removes flags. Moreover, if we study the conditions for a legal merging we can also note that if all nodes in a chain component C share the same set of parents, then the conditions can only fail due to other parents, not the other nodes in C. From this it follows that we can add the parents to each node X ∈ C independently of the other nodes in C and that any merging must be illegal due to the parents of X. To see that all intermediate graphs are LDGs and that all parents are added, if parents are added as described in footnote 1, we can note the following. First note that no mergings can be legal. For the first two lines in footnote 1 this follows directly from the fact that parents added by these lines are part of unshielded colliders, and hence no merging can be legal between C and some parent-component of C. For line 3 in footnote 1 it does on the other hand follow from the condition that the resulting graph must be a LDG. To see that paG (X) = paH (X) must hold when line 3 is no longer applicable assume the contrary, i.e. that there exists a node Y ∈ paH (X) ∖ paG (X). We then know that paH (X) ⊆ adH (Y ) must hold, or line 1 or 2 would have been applicable, and that paH (Y ) ⊆ paH (X), or the graph G ∪ {Y →X} must be a LDG. This does however also mean that a merge is legal in H, since no node in C ∖ X can be part of preventing a split in H and ∀ai ∈ anH (X) bdG (ai ) = bdH (ai ), which is a contradiction. Hence we must have that line 5 and 6 can be executed st all intermediate graphs are LDGs and ∀X ∈ C paG (X) = paH (X) holds after line 6. For line 7 we can note that no flag-removing splits can exist in C since all nodes in C have the same parents. Hence each undirected edge can be added independently without taking the structure of C into consideration while no mergings are legal as described above. Part 3: When all nodes in C do not share the same parents. In this part we will show that the subgraphs of C ∪ anH (C) must have the same structure in H and G after lines 8 to 23 have been exectuted in algorithm 1 and that all intermediate graphs are LDGs. To do this we first have to define the terms subcomponents and subcomponentorder. Let a subcomponent of C be the different components that are subsets of C in the maximally oriented AMP CG H ′ of H, i.e. H where all feasible splits have been performed, iteratively. This means that H ′ is not the largest AMP CG in the strong Markov equivalence class of H, but still in the same strong Markov equivalence class as H. Hence we know that any component Ci in H can consist of several subcomponents in H ′ and these subset of nodes are then denoted Ci1 , Ci2 , ..., Cin , or C 1 , C 2 , ..., C n for short when Ci is implied. We also define an order of these subcomponents such that C k must have a lower order than C l if C k is an ancestor of C l in H ′ . In short this means that the subcomponent with lowest order has no parents in C in H ′ . Moreover, for the remainder of the proof let H ′ denote the maximally oriented AMP CG of H and C 1 the subcomponent with lowest order. In addition to this we can also make some conclusions about the structure of C and its subcomponents before we start the main part of the proof: (1) For every non-empty subset of nodes Vi ⊂ C it must be the case that the split with U = Vi and L = C ∖ Vi can either not be feasible in H or the split does not remove any flag in H. If this would not hold then H can not be a LDG. Also note that it must hold that the split with U = C ∖ Vi and L = Vi can either not be feasible or not remove any flag according to the same argument. (2) No splits are feasible within the subcomponents. This follows directly from the way they are defined. This in turn means that the subgraph of H induced by anH (C) ∪ C 1 must be a LDG. From this it follows that we can add C 1 to G first, and then consider the rest of C separately. (3) For every non-empty subset of nodes Vi ⊂ C 1 at least one of the conditions for a feasible split with U = Vi and L = C 1 ∖Vi must fail. If we study the conditions for a feasible split we can note that this can be for two reasons. Either there exist two nodes cl ∈ Vi and ck ∉ Vi st cl −ck is in H ′ and paH ′ (cl ) ⊆/ paH ′ (ck ) or there exist two nodes cl ∈ Vi and ck ∉ Vi st cl −ck is in H ′ and nbH ′ (ck ) ⊆/ nbH ′ (cl ). (4) All nodes in C ∖ C 1 must be strict descendants of C 1 in H ′ . To see this note that for a split to be feasible the adjacent nodes of L in U must be adjacent of all nodes in L. This means that if there exists a subcomponent C i st ∃ck ∈ C i , ck ∈ nbH (C 1 ), ck ∈ chH ′ (C 1 ) then any node in C 1 ∩ nbH (C i ) must be a parent of all nodes in C i . This in turn means that if there exists a subcomponent C j st ∃cl ∈ C j , cl ∈ nbH (C i ), cl ∈ chH ′ (C i ) then any node in C i ∩ nbH (C j ) must be a parent of all nodes in C j . From this it then iteratively follows that all nodes in C ∖ C 1 must be strict descendants of C 1 in H ′ . (5) For every node ck ∈ C ∖ C 1 that is a strict descendant of some node cl ∈ C 1 it must hold that paH (ck ) ⊆ paH (cl ). To see this assume this is not the case and that ck is a node for which the statement does not hold. Then a feasible split must have been performed when H was transformed into H ′ such that ck belonged to L and some node cm , st paH (cm ) ⊆ paH (cl ), belonged to U . This feasible split would then remove a flag from H, which contradicts that H is a LDG. Hence we have a contradiction. (6) For every node ck ∈ C ∖C 1 that is a strict descendant of some node cl ∈ C 1 it must hold that paH (cl ) ⊆ paH (ck ). To see this assume this is not the case and that ck is a node for which the statement does not hold. Then a split must have been performed when H was transformed into H ′ such that ck belonged to L and some node cm , st paH (cl ) ⊆ paH (cm ), belonged to U . We can then see that condition 3 for a feasible split is not fulfilled which contradicts that H and H ′ are in the same Markov equivalence class. (7) For every node ck ∈ C ∖C 1 that is a strict descendant of some node cl ∈ C 1 it must hold that paH (cl ) = paH (ck ). This follows directly from (5) and (6). We can also recall that all nodes not in C 1 must be a strict descendant of some node in C 1 according to (4). (8) Every set of parents that exists to some node in C must also exist to some node in C 1 . This follows directly from (7). Additionally, since we know that all nodes in C do not have the same parents, this also means that all nodes in C 1 do not have the same parents and hence that there exist two nodes ck , cl ∈ C 1 st paH (ck ) ⊆/ paH (cl ). (9) C 1 must be unique, i.e. consists of the same set of nodes no matter how H is split. This follows directly from (7) and (8). (10) ∀ck ∈ C i , i ≠ 1, paH (ck ) = paH (nbH (ck )). To see this note that all nodes ck for which there exists a path to cl ∈ c1 in the subgraph of H induced by (C ∖ C 1 ) ∪ cl must be a strict descendant of cl . This follows from the (4) and that for a split to be feasible all nodes in U ∩ nbH (L) must be adjacent of all nodes in L. Together with (7) it then follows that the statement must hold. If we study the algorithm we can now see that it consists of two phases. First lines 10 to 20 when C 1 is handled, and then lines 21 to 23 when the rest of C is handled. In both cases we can note that an edge only is added to G if it also exists in H. We will now cover line by line in the algorithm and prove that all intermediate graphs are LDGs. Line 11 and 13: To see that one of the induced subgraphs must exist in C 1 in H ′ assume the contrary. We know from (8) that all nodes in C 1 do not have the same parents. This means that there must exist two nodes X, Y ∈ C 1 st ∃P ∈ paH (X) ∖ paH (Y ). Let L consist of all nodes in C 1 that have the parent P and U of all nodes that does not. We then know that ∀ck ∈ U ∩ nbH ′ (L) L ∈ nbH ′ (ck ) or the induced subgraph described in line 13 must exist in H ′ . Similarly we know that U ∩ nbH ′ (L) must be complete. Finally we also know that ∀cl ∈ L paH ′ (nbH ′ (cl ) ∩ U ) ⊆ paH ′ (cl ) or the induced subgraph described in line 11 must exist in H ′ . This does however mean that a split, with the described U and L is feasible in C 1 in H ′ which is a contradiction. We can also note that the intermediate graphs for line 12 and 14 are LDGs since no splits are feasible (and no mergings possible) when C has not yet received any parents in G. Once it has received parents it is also easy to see that all parents are part of flags that can not be removed with some feasible split. For the loop in line 15 we can make the following observations: (a) The loop must terminate when A = C 1 . This follows from (3) which gives that the conditions for line 16 must be fulfilled for any A ⊂ C 1 . (b) The only nodes in C that can have parents are those in A. (c) The only nodes in C that can have any neighbours are those in A and B. (d) All parents of C must be part of a flag. Hence no mergings are legal. This follows from that a parent only is added to a node X in G if it is not a parent of all nodes for which there exists an undirected path to X. (e) For any edge X−Y added in lines 17 or 18, with the X and Y described for those lines, that edge must be strong for the current G and all future G if either paG (X) ≠ ∅ or paG (Y ) ≠ ∅. To see this assume the contrary and first assume that there exists a feasible split with X ∈ U and Y ∈ L. We can then see that the condition for line 17 clearly contradicts that such a split is feasible. For line 18 we can however imagine that if a split is feasible st the edge Y −Z is oriented into Y →Z and paG (X) ⊆ paG (Y ) then the edge X−Y might be feasible split into X→Y . This would however mean that Y must belong to A and hence that paG (Y ) ≠ ∅ which would prevent the split between Y and Z unless paG (Y ) ⊆ paG (Z) which in turn would mean that Z also must belong to A. We can now restart this part of the proof with the edge Y −Z instead and we can then see that for every edge it has to exist some other edge that would have to be split before the current edge for which the condition that either paG (X) ≠ ∅ or paG (Y ) ≠ ∅ must hold. To see that no split is feasible with X in L and Y ∈ U note that this would require Y to be adjacent of all nodes in A∪B in G. Moreover it would also have to hold that ∀ai ∈ A ∪ B paG (Y ) ⊆ paG (ai ). Now if A ∪ B = C 1 then this gives a contradiction since a split then would be feasible in H ′ . Hence A ∪ B ⊂ C 1 must hold. However, for Y to be adjacent of all nodes in A ∪ B it must hold that Y has been chosen as either X (with some node in A ∪ B as Y ) or Y (with some node in A∪B as X) in multiple iterations, even when it already belonged to A. This then gives a contradiction since it must exist some node in B or C 1 ∖ (A ∪ B) with higher order than Y that should have been possible to choose as Y according to conclusion (3). Hence any added edge as X−Y must be strong in all future G as long as X or Y have at least one parent not shared by some neighbour of X resp. Y . Moreover, if neither node have any parent then the split can not remove a flag in G. (f) When the loop terminates all nodes in C 1 must have the same parents in G and H with the exception of the parents that are parents of all nodes in C. This follows from the way that parents are added in line 19 and conclusion (8). (g) A strong undirected path must exist between any two nodes in C 1 after the loop terminates. This follows directly from (e), (f), that all nodes are in A when the loop terminates and that a strong undirected path exists between any two nodes in C 1 in H. With these observations we can now see that line 17 can be performed and that the intermediate graphs must be LDGs. P →X can be added since all edges W −X in G for some node W ∈ nbG (X) must be strong after X has received P as a parent as described in (e). That X−Y also is strong also follows from (e). For line 18 we can note that a split is feasible of X−Y to X→Y when only X−Y have been added to G. Such a split does however not remove any flag. Later, when Y −Z have been added we also know that the resulting graph must be a LDG as described in (e). For line 19 we can note that any added parent must be part of a flag in G and hence no legal mergings can be possible. Moreover it follows from (e) that no flag-removing splits can become feasible. That line 20 must result in LDGs follows from the fact that a strong undirected path must exist between any two nodes in C 1 as described in (g) and that this path can not be made weak by adding additional undirected edges that exist in H. Hence, after line 20 the subgraphs of H and G induced by C 1 must have the same structure. Note however that C 1 might have some parents in H that does not exist in G, but only if those parents are parents of all nodes in C as described in (f). These last parents are handled in line 22. Line 21 then creates several tree structures, each with a root node in C 1 , that extends A to consist of all nodes in C. Note that any two nodes ci and cj must belong to the same tree structure if there exists an undirected path between ci and cj in the subgraph of H induced by C ∖ C 1 . That no split, that removes a flag, can be feasible in G follows from the fact that only nodes in C 1 can have parents and no node not in C 1 can be adjacent of all nodes in C 1 in G before line 23. In line 22 the parents of the nodes in C ∖ C 1 are added to G. We can now note that in each subtree all nodes must have the same parents according to (10). Moreover the parents are added from the root and outwards, adding the parents to the leafs last. This gives that no edge X−Y , st X has a lower order than Y , can be oriented into X→Y or Y →X in a feasible flag-removing split. If both X, Y ∈ C 1 this follows from (e). If X ∉ C 1 or Y ∉ C 1 the edge can not be oriented to X→Y since Y only has a parent if X has the same parent. Nor can it be oriented to Y →X since ∃Z ∈ nbG (X) ∖ nbG (Y ) st Z has a lower order than X. That no mergings are legal before the last node receives its parents follows from that all parents are part of flags. When the last node X to receive parents, with the highest order in C, we can also note that no mergings can be legal. For the first two lines in footnote 1 this follows directly from the fact that every parent added by these lines are part of unshielded colliders, and hence no merging can be legal between C and some parent-component of C. For line 3 in footnote 1 it does on the other hand follow from the condition that the resulting graph must be a LDG. To see that paG (X) = paH (X) must hold when line 3 is no longer applicable assume the contrary, i.e. that there exists a node Y ∈ paH (X)∖paG (X). We then know that paH (X) ⊆ adH (Y ) must hold, or line 1 or 2 would have been applicable, and that paH (Y ) ⊆ paH (X), or the graph G ∪ {Y →X} must be a LDG. This does however also mean that a merge is legal in H, since ∀ai ∈ anH (X) ∪ C ∖ X bdG (ai ) = bdH (ai ), which is a contradiction. Hence we must have that ∀vi ∈ C paH (vi ) = paG (vi ) after line 22. Finally in line 23 the remaining undirected edges can be added to G since all nodes adjacent in C in H but not in G have the same set of parents. Hence no feasible splits can be flag-removing. Finally we do also have to prove what the conditions are for when the independence model of a LDG can be represented as a BN resp. a MN. Lemma 5. Given a LDG G there exists a BN H such that I(G) = I(H) iff G contains no flags and G contains no chordless undirected cycles. Proof. Since H is a BN and BNs is a subclass of AMP CGs we know that for I(G) = I(H) to hold H must be in the same Markov equivalence class as G if it is interpreted as an AMP CG. However, since G is a LDG and contains a flag we know that that flag must exist in every AMP CG in the Markov equivalence class of G. Hence I(G) = I(H) cannot hold if G contains a flag. Moreover it is also well known that the independence model of a MN containing chordless cycles cannot be represented as a BN. On the other hand, if G contains no flag and no chordless undirected cycles, then clearly all triplexes in G must be unshielded colliders. Hence, together with the fact that G can contain no semi-directed cycles, we can orient all undirected edges in G to directed edges without creating any semi-directed cycles as shown by Koller and Friedman [8, Theorem 4.13]. This means that the resulting graph must be a BN H such that I(G) = I(H). ◻ Lemma 6. Given a LDG G there exists a MN H such that I(G) = I(H) iff G contains no directed edges. Proof. This follows directly from the fact that G contains a triplex iff it also contains a directed edge and MNs cannot represent any triplexes. It is also clear that if G contains no directed edges than it can only contain undirected edges and hence be interpreted as a MN. ◻ Table 1: Exact and approximate ratios of AMP CG models whose independence mode can be represented as BNs, MNs, neither (in that order). NODES 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 4 EXACT 1.0000 1.0000 0.0000 1.0000 0.7273 0.0000 0.8259 0.2857 0.1607 0.5987 0.0689 0.3958 APPROXIMATE 1.0000 1.0000 0.0000 1.0000 0.7275 0.0000 0.8253 0.2839 0.1609 0.6019 0.0632 0.3931 0.4292 0.0112 0.5694 0.2987 0.0019 0.7010 0.2021 0.0002 0.7979 0.1373 0.0000 0.8627 0.0924 0.0000 0.9076 0.0604 0.0000 0.9396 0.0394 0.0000 0.9606 0.0251 0.0000 0.9749 0.0152 0.0000 0.9849 0.0106 0.0000 0.9894 0.0065 0.0000 0.9935 0.0044 0.0000 0.9956 0.0028 0.0000 0.9972 0.0017 0.0000 0.9983 0.0010 0.0000 0.9990 Results Using the Markov chain presented in the previous section we were able to sample AMP CG models uniformly for a set number of nodes for up to 20 nodes. For each number of nodes 105 samples were sampled with 105 transitions between each sampled sample. The success rate for the transitions, i.e. the percentage of transitions such that H ≠ G, was for 20 nodes approximately 10% and this corresponds well with earlier studies for LWF and MVR CGs [16]. To check if a graph G was a LDG it was first checked whether it was an AMP CG. If so then the LDG H in the Markov equivalence class of G was created whereafter it was checked whether H and G had the same structure. To construct the LDG of G the algorithm defined by Roverato and Studený was used [14]. The implementation was carried out in C++, run on an Intel Core i5 processor and it took approximately one week to complete the sampling of all sample sets. Moreover we also enumerated all LDGs for up to five nodes to allow a comparison between the approximated and exact values. The sampled graphs and code are available for public access at http ∶ //www.ida.liu.se/divisions/adit/data/graphs/CGSamplingResources. The results are shown in Tables 1 and 2. If we start with Table 1 we can here see the ratio of the number of BN models resp. MN models compared to number of AMP CG models. We can note that the approximation seems to be very accurate for up to five nodes but for more than five nodes we do not have any exact values to compare the approximations to. We can however plot the ratios of the number of BN models compared to the number of AMP CG models in a graph with logarithmic scales, as seen in Fig. 2. We can then see that the ratio follows the equation R = 5 ∗ 0.664n very closely, where n is the number of nodes and R the ratio. This means that the ratio decreases exponentially as the number of nodes increases and hence that the previous results seen for LWF and MVR CGs also hold for AMP CGs [16]. Fig. 2: The ratios of the number of BN models compared to the number of AMP CG models for different number of nodes. (Displayed with a logarithmic scale) If we go back to the table we can also note that the ratio of the number of MN models compared to the number of AMP CG models is almost zero for more than eight nodes similarly as seen in previous studies for LWF and MVR CGs. Moreover, if we compare the ratio of the number BN models to the number of CG models for the different CG interpretations we can see that the ratio is approximately 0.0017 for LWF CGs, 0.0011 for MVR CGs and 0.0010 for AMP CGs for 20 nodes [16]. This means that LWF CGs can represent the smallest amount of independence models while MVR and AMP CGs can represent about the same amount. Another interesting observation that can be made here is that the difference between the ratios of the number of BN models compared to the number of AMP CG models resp. MVR CG models is very small for any number of nodes. If we continue to Table 2 we can here see the exact and approximate ratios of AMP CG models to AMP CGs. This ratio can be calculated using the equation #CGmodels #BN s #BN models #CGmodels = ∗ ∗ #CGs #CGs #BN s #BN models (1) where #CGmodels represents the number of AMP CG models etc. The ratio #BN s can then be found using the iterative equations by Robinsson [13] and #CGs Steinsky [17] while #BN models #BN s can be found in previous studies for BN models #CGmodels [11]. Finally we can also get the ratio #BN by inverting the ratio found models in Table 1. If we study the values in Table 2 we can see that the ratio seems to converge to somewhere around 0.06 similarly as seen for the average number of BNs per BN model [16]. In the previous study estimating the average number Table 2: Exact and approximate ratios of AMP CG models to AMP CGs. NODES EXACT APPROXIMATE 2 0.5000 0.5074 0.2200 0.2235 3 4 0.1327 0.1312 0.1043 0.1008 5 6 0.0860 7 0.0775 0.0701 8 0.0668 9 10 0.0616 11 0.0594 12 0.0595 13 0.0608 0.0635 14 15 0.0559 16 0.0598 17 0.0557 18 0.0567 0.0596 19 20 0.0632 of LWF CGs per LWF CG model no such convergence has been seen [11]. We can however note that this study only goes up to 13 nodes but also that the approximated values for the LWF CG interpretation is considerably lower than 0.06. The ratio also shows that traversing the space of AMP CG models when learning CG structures is considerably more efficient than traversing the space of all AMP CGs. 5 Conclusion In this article we have approximated the ratio of the number of AMP CG models compared to the number of BN models for up to 20 nodes. This has been achieved by creating a Markov chain whose unique stationary distribution is the uniform distribution of all AMP CG models. The results show that the ratio of the number of independence models representable by AMP CGs compared to the corresponding number for BNs grows exponentially as the number of nodes increases. This confirms previous results for LWF and MVR CGs but a new, and unexpected, result is also that AMP and MVR CGs seem to be able to represent almost the same amount of independence models. It has previously been shown that there exist some independence models perfectly representable by both MVR and AMP CGs, but not BNs [15], and it would now be interesting to see how large this set is compared to the set of all AMP resp. MVR CG models. In addition to the comparison of the number of AMP CG models to the number of BN models we have also approximated the average number of AMP CGs per CG model. The results indicate that the average ratio converges to somewhere around 0.06. Such a convergence has not previously been seen for any other CG interpretation but it has been found to be ≈ 0.25 for BNs [11]. With these new results sampling methods for CG models of all CG interpretations are now available [11, 16]. This opens up for MCMC based structure learning algorithms for the different CG interpretations that traverses the space of CG models instead of the space of CGs. More importantly it does also open up for further studies on what independence models and systems the different CG interpretations can represent, which is an important question where the answer is still unclear. Acknowledgments This work is funded by the Swedish Research Council (ref. 2010-4808). Bibliography [1] Andersson, S.A., Madigan, D., Perlman, M.D.: An alternative markov property for chain graphs. Scandinavian Journal of Statistics 28, 33–85 (2001) [2] Andersson, S.A., Perlman, M.D.: Characterizing Markov Equivalence Classes For AMP Chain Graph Models. The Annals of Statistics 34, 939–972 (2006) [3] Cox, D.R., Wermuth, N.: Linear Dependencies Represented by Chain Graphs. Statistical Science 8, 204–283 (1993) [4] Cox, D.R., Wermuth, N.: Multivariate Dependencies: Models, Analysis and Interpretation. Chapman and Hall (1996) [5] Drton, M.: Discrete Chain Graph Models. Bernoulli 15, 736–753 (2009) [6] Frydenberg, M.: The Chain Graph Markov Property. Scandinavian Journal of Statistics 17, 333–353 (1990) [7] Häggström, O.: Finite Markov Chains and Algorithmic Applications. Campbridge University Press (2002) [8] Koller, D., Friedman, N.: Probabilistic Graphcal Models. MIT Press (2009) [9] Lauritzen, S.L., Wermuth, N.: Graphical Models for Association Between Variables, Some of Which are Qualitative and Some Quantitative. The Annals of Statistics 17, 31–57 (1989) [10] Ma, Z., Xie, X., Geng, Z.: Structural Learning of Chain Graphs via Decomposition. Journal of Machine Learning Research 9, 2847–2880 (2008) [11] Peña, J.M.: Approximate Counting of Graphical Models Via MCMC. In: Proceedings of the 11th International Conference on Artificial Intelligence and Statistics. pp. 352–359 (2007) [12] Peña, J.M., Sonntag, D., Nielsen, J.: An Inclusion Optimal Algorithm for Chain Graph Structure Learning. In: Proceedings of the 17th International Conference on Artificial Intelligence and Statistics. pp. 778–786 (2014) [13] Robinson, R.W.: Counting Labeled Acyclic Digraphs. New Directions in the Theory of Graphs pp. 239–273 (1973) [14] Roverato, A., Studený, M.: A Graphical Representation of Equivalence Classes of AMP Chain Graphs. Journal of Machine Learning Research 7, 1045–1078 (2006) [15] Sonntag, D., Peña, J.M.: Chain Graph Interpretations and Their Relations (2014), under review. [16] Sonntag, D., Peña, J.M., Gómez-Olmedo, M.: Approximate Counting of Graphical Models Via MCMC Revisited. Under Review (2014) [17] Steinsky, B.: Enumeration of Labelled Chain Graphs and Labelled Essential Directed Acyclic Graphs. Discrete Mathematics 270, 266–277 (2003)