Model Complexity of Pseudo-independent Models Jae-Hyuck Lee

Model Complexity of Pseudo-independent Models Jae-Hyuck Lee and Yang Xiang Department of Computing and Information Science University of Guelph, Guelph, Canada {jaehyuck, yxiang}@cis.uoguelph,ca Abstract A type of problem domains known as pseudo-independent (PI) domains poses difficulty for common probabilistic learning methods based on the single-link lookahead search. To learn this type of domain models, a learning method based on the multiple-link lookahead search is needed. An improved result can be obtained by incorporating model complexity into a scoring metric to explicitly trade off model accuracy for complexity and vice versa during selection of the best model among candidates at each learning step. To implement this scoring metric for the PI-learning method, the complexity formula for PI models is required. Previous studies found the complexity formula for full PI models, one of the three major types of PI models (the other two are partial and mixed PI models). This study presents the complexity formula for atomic partial PI models, partial PI models that contain no embedded PI submodels. The complexity is acquired by arithmetic operation on the spaces of domain variables. The new formula provides the basis for further characterizing the complexity of non-atomic PI models, which contain embedded PI submodels in their domains. Keywords: probabilistic reasoning, knowledge discovery, data mining, machine learning, belief networks, model complexity. Introduction Learning probabilistic networks (Lam & Bacchus 1994; Cooper & Herskovits 1992; Heckerman, Geiger, & Chickering 1995; Friedman, Murphy, & Russell 1998) has been an active area of research recently. The task of learning networks is NP-hard (Chickering, Geiger, & Heckerman 1995). Therefore, learning algorithms use a heuristic search, and the common search method is the single-link lookahead which generates network structures that differ by a single link at each level of the search. Pseudo-independent (PI) models (Xiang, Wong, & Cercone 1996) are a class of probabilistic domain models where a group of marginally independent variables displays collective dependency. PI models can be classified into three types: full, partial, and mixed PI models based on the pattern of marginal independency and the extent of collective dependency. The most restrictive type is full PI models c 2005, American Association for Artificial IntelliCopyright gence (www.aaai.org). All rights reserved. where every proper subset of variables are marginally independent. This is relaxed in partial PI models where not every proper subset of variables are marginally independent. While in full or partial PI models, all variables in the domain are collectively dependent, this is not the case in mixed PI models where only proper subsets, called embedded PI subdomains, of domain variables are collectively dependent. However, the marginal independency pattern of each embedded PI subdomain in mixed PI models is that of either a full or a partial PI model. PI models cannot be learned by the single-link lookahead search because the underlying collective dependency cannot be recovered. Incorrectly learned models introduce silent errors when used for decision making. To learn PI models, a more sophisticated search method called multi-link lookahead (Xiang, Wong, & Cercone 1997) should be used. It was implemented in the learning algorithm called RML (Xiang et al. 2000). The algorithm is equipped with the KullbackLeibler cross entropy as the scoring metric for the goodnessof-fit to data. The scoring metric of the learning algorithm can be improved by combining cross entropy as a measure of model accuracy with a measure of model complexity. A weighted sum of the two measures is a simple way of combination and other alternatives are also possible. By using such a scoring metric, between two models of the same accuracy (as measured by cross entropy), the one with less complexity will end up with a higher score and be preferred. The focus of this paper is on the assessment of the model complexity and we defer the details of combination to future work. Model complexity is defined by the number of parameters required to fully specify a model. In previous work (Xiang 1997), a formula was presented for estimating the number of parameters in full PI models, one of the three types of PI models (the other two are partial and mixed PI models). However, the formula was very complex, and did not show the structural dependence relationships among parameters. The new concise formula for full PI models was recently presented (Xiang, Lee, & Cercone 2004) using a hypercube (Xiang, Lee, & Cercone 2003). In this study, we present the model complexity formula for atomic partial PI models, partial PI models that contain no embedded PI submodels. Atomic partial PI models are the building blocks of mixed PI models. This new formula is simple in form, and provides good insight into the structural dependency relationships among parameters of partial PI models. Furthermore, the previous complexity formula for full PI models is integrated into this new formula. In other words, substituted with the conditions of full PI models, the new formula is reduced to the formula for full PI models; this also confirms that full PI models are a special case of partial PI models. Besides, the new formula provides the basis for further characterizing the complexity of mixed PI models. We apply the hypercube method to show how the complexity of partial PI models can be acquired from the spaces of variables. Background Let V be a set of n discrete variables X1 , . . . , Xn (in what follows we will focus on domains of finite and discrete variables). Each variable Xi has a finite space Si = {xi,1 , xi,2 , . . . , xi,Di } of cardinality Di . The space of a set V of variables is defined by the Cartesian product of the spaces of all variables in V , that is, SV = S1 × · · · × Sn (or Q S ). Thus, SV contains the tuples made of all possible i i combinations of values of the variables in V . Each tuple is called a configuration of V , denoted by (x1 , . . . , xn ). Let P (Xi ) denote the probability function over Xi and P (xi ) denote the probability value P (Xi = xi ). The following axiom of probability is called the total probability law: P (Si ) = 1, or P (xi,1 )+P (xi,2 )+· · ·+P (xi,Di ) = 1. (1) A probabilistic domain model (PDM) M over V defines the probability values of every configuration for every subset A ⊂ V . P (V ) or P (X1 , . . . , Xn ) refers to the joint probability distribution (JPD) function over X1 , . . . , Xn , and P (x1 , . . . , xn ) refers to the joint probability value of the configuration (x1 , ..., xn ). The probability function P (A) over any proper subsets A ⊂ V refers to the marginal probability distribution (MPD) function over A. If A = {X1 , . . . , Xm } (A ⊂ V ), then P (x1 , . . . , xm ) refers to the marginal probability value. A set of probability values that directly specifies a PDM is called parameters of the PDM. A joint probability value P (x1 , . . . , xn ) is referred to as a joint parameter or joint and a marginal probability value P (x1 , . . . , xm ) as a marginal parameter or marginal. Among parameters associated with a PDM, some parameters can be derived from others by using constraints such as the total probability law. Such derivable parameters are called constrained or dependent parameters while underivable parameters are called unconstrained, free, or independent parameters. The number of independent parameters of a PDM is called the model complexity of the PDM, denoted as ω. When no information of the constraints on a general PDM is given, the PDM should be specified only by joint parameters. The following ωg gives the number of joint parameters required: Let M be a general PDM over V = {X1 , . . . , Xn }. Then the number of independent parameters of M is upper-bounded by n Y ωg = Di − 1. (2) i=1 One joint is dependent since it can be derived from others by the total probability law (Eq. (1)). For any three disjoint subsets of variables A, B and C in V , A and B are called conditionally independent given C, denoted by I(A, B | C), iff P (A|B, C) = P (A|C) for all values in A, B and C such that P (B, C) > 0. Given subsets of variables A, B, C, D, E ⊆ V , the following property of conditional independence is called Symmetry: I(A, B | C) ⇔ I(B, A | C); (3) and the following is Composition: I(A, B | C) ∧ I(A, D | C) ⇒ I(A, B ∪ D | C). (4) Two disjoint subsets A and B are said to be marginally independent, denoted by I(A, B | ∅), iff P (A|B) = P (A) for all values A and B such that P (B) > 0. If two subsets of variables are marginally independent, no dependency exists between them. Hence, each subset can be modeled independently without losing information. If each variable Xi in a set A is marginally independent of the rest, the variables in A are said to be marginally independent. The probability distribution over a set of marginally independent variables can be written as the Q product of the marginal of each variable, that is, P (A) = Xi ∈A P (Xi ). Variables in a set A are called generally dependent if P (B|A \ B) 6= P (B) for every proper subset B ⊂ A. If a subset of variables is generally dependent, a proper subsets cannot be modeled independently without losing information. Variables in A are collectively dependent if, for each proper subset B ⊂ A, there exists no proper subset C ⊂ A \ B that satisfies P (B|A \ B) = P (B|C). Collective dependence prevents conditional independence and modeling through proper subsets of variables. A pseudo-independent (PI) model is a PDM where proper subsets of a set of collectively dependent variables display marginal independence (Xiang, Wong, and Cercone 1997). Definition 1 (Full PI model). A PDM over a set V (|V | > 3) of variables is a full PI model if the following properties (called axioms of full PI models) hold: (SI ) Variables in any proper subset of V are marginally independent. (SII ) Variables in V are collectively dependent. The complexity of full PI models is given as follows: Theorem 2 (Complexity of full PI models by Xiang, Lee, & Cercone, 2004). Let a PDM M be a full PI model over V = {X1 , . . . , Xn }. Then the number of independent parameters of M is upper-bounded by ωf = n Y i=1 (Di − 1) + n X (Di − 1). (5) i=1 The axiom (SI ) of marginal independence is relaxed in partial PI models, as is defined through marginally independent partition. Definition 3 (Marginally independent partition). Let V (|V | > 3) be a set of variables and B = {B1 , . . . , Bm } (m > 2) be a partition of V . B is a marginally independent partition if for every subset A = {Xi,k | Xi,k ∈ Bk for k = 1, . . . , m}, variables in A are marginally independent. Each partition block Bi in B is called a marginally independent block. A marginally independent partition of V groups variables in V into m marginally independent blocks. The property of marginally independent blocks is that if a subset A is formed by taking one element from different blocks, then variables in A are always marginally independent. In a partial PI model, it is not necessary that every proper subset is marginally independent. Definition 4 (Partial PI model). A PDM over a set V (|V | > 3) of variables is a partial PI model on V if the following properties (called axioms of partial PI models) holds: (SI0 ) V can be partitioned into two or more marginally independent blocks. (SII ) Variables in V are collectively dependent. The following definitions on maximum marginally independent partition is needed later for obtaining the complexity of partial PI models: Definition 5 (Maximum partition and minimum block). Let B = {B1 , . . . , Bm } be a marginally independent partition of a partial PI model over V . B is a maximum marginally independent partition if there exists no marginally independent partition B 0 of V such that |B| < |B 0 |. The blocks of a maximum marginally independent partition are called the minimum marginally-independent blocks or minimum blocks. Complexity of atomic partial PI models The following defines atomic partial PI models: Definition 6 (Atomic PI model). A PDM M over a set V (|V | ≥ 3) of variables is an atomic PI model if M is either a full or partial PI model, and no collective dependency exists in any proper subsets of V . Because the dependency within each block, the complexity of partial PI models is higher than full PI models, but lower than general PDMs with variables of the same space cardinalities, as we will see later. The following lemma states that in a PDM that satisfies Composition (Eq. (4)), if every pair of variables between two subsets are marginally independent, then the two subsets are marginally independent. Lemma 7 (Marginal independence of subsets). Let M be a PDM over V where Composition holds in every subset. Let Bα = {Y1 , . . . , Ys } and Bβ = {Z1 , . . . , Zt } denote any two disjoint nonempty subsets of variables in V . If I(Yi , Zj | ∅) holds for every pair (Yi , Zj ), then I(Bα , Bβ | ∅). (6) Proof. We prove that I(Yi , Zj | ∅) for every (Yi , Zj ) implies I(Yi , Bβ | ∅) and that I(Yi , Bβ | ∅) for every Yi implies I(Bα , Bβ | ∅). Applying Composition recursively from I(Yi , Z1 | ∅) to I(Yi , Zt | ∅) gives I(Yi , Z1 ∪· · ·∪Zt | ∅) or I(Yi , Bβ | ∅). By Symmetry (Eq. (3)), I(Yi , Bβ | ∅) is equivalent to I(Bβ , Yi | ∅). In the same manner, by applying Composition recursively from I(Bβ , Y1 | ∅) to I(Bβ , Ys | ∅) gives I(Bβ , Y1 ∪ · · · ∪ Ys ) or I(Bβ , Bα | ∅). By Symmetry this is equivalent to I(Bα , Bβ | ∅). 2 Lemma 7 assumes that the PDM satisfies Composition. Composition implies no collective dependency in the PDM as follows: Lemma 8 (Composition implies no collective dependency). Let M be a PDM over V that satisfies Composition. Then no collective dependency exists in V . Proof. This Lemma directly follows from the definition of Composition (Eq. (4)). 2 The following Lemmas 9 and 10 are required for proving Lemma 11 on marginal independence of blocks. The two lemmas state that the marginal independency of partition is preserved after removing a proper subset of variables from the partition blocks and after merging the blocks. These Lemmas hold because the marginal independency of partition is defined by the independency in a set of each variable taken from the partition blocks. Lemma 9 (Marginal independency of any subpartitions). Let a PDM M be a partial PI model over V = {X1 , . . . , Xn } (n > 3) with a marginally-independent partition B = {B1 , . . . , Bm } (m > 2). A new partition 0 } over V 0 (V 0 ⊆ V ) can be defined by B 0 = {B10 , . . . , Bm removing a proper subset of variables from one or more partition blocks such that Bi0 ⊆ Bi for every Bi0 (i = 1, . . . , m). Then, B 0 is also a marginally-independent partition called a subpartition. Lemma 10 (Marginal independency after merging). Let a PDM M be a partial PI model over V = {X1 , . . . , Xn } (n > 3) with a marginally-independent partition B = b = {B1 , . . . , Bm } (m > 2). A new partition B b1 , . . . , B br } (r < m) over V can be defined by merging {B b is also a marginallyone or more blocks in B. Then, B independent partition. The following lemma states that in an atomic partial PI model, removing one variable from V makes any two blocks in V marginally independent. Lemma 11 (Marginal independence between two independent blocks). Let a PDM M be an atomic partial PI model over V = {X1 , . . . , Xn } (n > 3), where Composition holds in every proper subset. Let a marginallyindependent partition be denoted by B = {B1 , . . . , Bm }. When m > 2, for any two distinct blocks Br and Bq in B, I(Br , Bq | ∅). (7a) When m = 2, that is B = {B1 , B2 }, for any Xi ∈ B1 or any Xj ∈ B2 , I(B1 \ {Xi }, B2 | ∅) and I(B1 , B2 \ {Xj } | ∅). (7b) Proof. • Case 1, where m > 2: Let Br = {Y1 , . . . , Ys } and Bq = {Z1 , . . . , Zt }. Let V 0 = Br ∪ Bq . By the definition of partial PI models, I(Y, Z | ∅) holds for every Y and Z. Since V 0 ⊂ V , Composition holds in V 0 . By Lemma 7, I(Br , Bq | ∅) must hold. • Case 2, where m = 2: Let Vi0 = B1 \ {Xi } ∪ B2 and Vj0 = B1 ∪ B2 \ {Xj } . Then Vi0 ⊂ V and Vj0 ⊂ V . Therefore, by the same argument as Case 1, both I(B1 \ {Xi }, B2 | ∅) and I(B1 , B2 \ {Xj } | ∅) hold. 2 As shown in the parameterization of full PI models (Xiang, Lee, & Cercone 2004), a sum of joint probabilities can be represented by a product of marginal probabilities. While the size of joint spaces grows exponentially in the number of variables, the size of marginal spaces grows only linearly. This results in the lesser complexity of a full PI model compared with a general PDM. A similar joint-marginal relationship holds for partial PI models. However, the marginal probability is from each partition block, not from each variable as is the case with full PI models. This is shown in the following lemma: Lemma 12 (Joint-marginal equality of partial PI models). Let a PDM M be an atomic partial PI model over V = {X1 , . . . , Xn } (n > 3), where Composition holds in every proper subset. Let a marginally-independent partition be denoted by B = {B1 , . . . , Bm } (m > 2). For an arbi0 } denote a subpartition of trary Xi , let B 0 = {B10 , . . . , Bm B made by removing an Xi from B so that every Bi0 is the same as Bi except the block from which Xi was removed. Then, Di X 0 P (X1 , . . . , xi,k , . . . , Xn ) = P (B10 ) · · · P (Bm ). (8) k=1 Proof. The summation k=1 P (X1 , . . . , xi,k , . . . , Xn ) at the left represents marginalization on a variable Xi . This is equal to P (X1 , . . . , Xi−1 , Xi+1 , . . . , Xn ) or in partitional 0 ). notation P (B10 , . . . , Bm The proof by induction is as follows: • Base case where m = 2: We need to show P (B10 , B20 ) = P (B10 )P (B20 ). This is equivalent to I(B10 , B20 | ∅), which is obvious by Lemma 11. • Induction hypothesis: Assume the following holds for m = k: (9) • Induction step: We need to show the following holds for m = k + 1: 0 0 ). ) = P (B10 ) · · · P (Bk+1 P (B10 , . . . , Bk+1 Merging {B10 , . . . , Bk0 } into one block and applying Lemmas 9, 10, 11 give 0 0 ). ) = P (B10 , . . . , Bk0 )P (Bk+1 P (B10 , . . . , Bk0 , Bk+1 0 0 ). ) = P (B10 ) · · · P (Bk0 )P (Bk+1 P (B10 , . . . , Bk0 )P (Bk+1 Therefore, by mathematical induction from m = 2, k and k + 1, Eq. (8) must hold for any m. 2 X1 X2 B1 B2 X3 Figure 1: A partial PI model. (The dotted circle depicts each partition block.) Corollary 13, which directly follows from Lemma 12, shows the relation between joint parameters and marginal parameters, by which joint parameters can be derived from other marginal and joint parameters. Corollary 13 (Partial PI joint constraint). Let a PDM M be an atomic partial PI model over V = {X1 , . . . , Xn } (n > 3), where Composition holds in every proper subset. Let a marginally-independent partition be denoted by B = {B1 , . . . , Bm } (m > 2). For an arbitrary Xi , let 0 B 0 = {B10 , . . . , Bm } denote a subpartition of B made by removing an Xi from B so that every Bi0 is the same as Bi except the block from which Xi was removed. Then, 0 P (X1 , . . . , xi,r , . . . , Xn ) = P (B10 ) · · · P (Bm ) − PD i P (B10 , . . . , Bk0 ) = P (B10 ) · · · P (Bk0 ). By the induction hypothesis (Eq. (9)), Di X P (X1 , . . . , xi,k , . . . , Xn ). (10) k=1,k6=r We are to derive the number of independent parameters for specifying a PDM by using the constraint (Corollary 13) on atomic partial PI domains. First, we determine the number of independent marginal parameters, denoted as ωm , 0 that are required to specify all P (B10 ), . . . , P (Bm ) terms in Corollary 13. Next, we determine the number of joint parameters that cannot be derived from ωm plus other joint parameters. In a hypercube, this is equivalent to counting the number of cells that cannot be derived since a cell in a hypercube corresponds to a joint parameter in the JPD. The procedure is as follows: (i) Check cells one by one to see whether it can be derived from ωm and any other cells by applying Corollary 13. (ii) As soon as a cell is determined to be derivable, it is eliminated from further consideration. Repeat this procedure until no more cells can be eliminated. The remaining cells and the ωm marginal parameters constitute the total number of independent parameters of the partial PI model. For example, consider the partial PI model in Figure 1, which corresponds to the hypercube in Figure 2(a). The PDM consists of three variables {X1 , X2 , X3 }; X1 , X2 are ternary, and X3 is binary. The marginally-independent partition B = {B1 , B2 } or {{X1 }, {X2 , X3 }}. For this PDM, ωm = (3 − 1) + (3 × 2 − 1) = 7. For example, to specify P (X1 ) for B1 , 2 marginals are required such as: p(2, 1, 2), p(1, 2, 1), p(2, 2, 1), Since no more cells can be further eliminated, the number of independent parameters needed to specify the partial PI model is 11, with 7 marginal parameters and 4 joint parameters. Note it would take 17 parameters to specify the JPD of a general PDM over three variables of the same space cardinalities. P (x1,1 ), P (x1,2 ); X3 and to specify P (X2 , X3 ) for B2 , 5 parameters are required such as: P (x2,1 ), P (x2,2 ), P (x2,3 ), P (x3,1 ), P (x3,2 ), P (x3,3 ). We refer to the set of cells with the identical value Xi = xi,j as the hyperplane at Xi = xi,j in the hypercube. For example, the hyperplane at X1 = x1,3 in Figure 2(a) refer to the following 6 cells. (We use p(i, j, k) as an abbreviation for P (X1,i , X2,j , X3,k ).): p(3, 1, 1), p(3, 2, 1), p(3, 3, 1), p(3, 1, 2), p(3, 2, 2), p(3, 3, 2) By Corollary 13, we have p(3, 1, 1) = P (x2,1 , x3,1 ) − (p(1, 1, 1) + p(2, 1, 1)). That is, the cell at the front-lower-left corner can be derived by the two cells behind it and the marginal parameters. All other cells on the hyperplane at X1 = x1,3 can be similarly derived. Therefore, these 6 cells can be eliminated from further consideration. The remaining 12 cells are shown in Figure 2(b). Using the same idea, four of the X3 x1,2 x1,3 x1,1 x3,2 x1,2 x3,1 X1 x2,1 x2,2 x2,3 (a) The original hypercube (3 × 3 × 2) x3,2 x3,1 X2 X2 X1 x1,2 (11) We assume that the 7 marginal parameters have been specified, and thus the other 2 marginal parameters P (x1,3 ) and P (x2,3 , x3,2 ) can be derived by the total probability law (Eq. (1)). Then, by marginalization, the following 6 marginal parameters can also be derived from the 5 parameters in (11) plus P (x2,3 , x3,2 ): X3 X3 x3,2 x1,1 x1,2 x3,1 P (x2,1 , x3,1 ), P (x2,2 , x3,1 ), P (x2,3 , x3,1 ), P (x2,1 , x3,2 ), P (x2,2 , x3,2 ). x1,1 x1,1 x2,1 x2,2 x2,3 (b) The hypercube after eliminating X1 = x1,3 Figure 2: Eliminating derivable cells from a JPD hypercube. remaining cells at X2 = x2,3 can be derived, and therefore eliminated from further consideration. The remaining 8 cells are shown in Figure 3(a). Again, four of the remaining cells at X3 = x3,2 can be derived. After eliminating them, only 4 cells are left, as shown in Figure 3(b): p(1, 1, 1), x3,1 X2 X1 x2,1 x2,2 X2 X1 (a) The hypercube after eliminating X2 = x2,3 x2,1 x2,2 (b) The hypercube after eliminating X3 = x3,2 Figure 3: Eliminating derivable cells from a JPD hypercube. Now we present the general result on the number of independent parameters of atomic partial PI models. Theorem 14 (Complexity of atomic partial PI models). Let a PDM M be an atomic partial PI model over V = {X1 , . . . , Xn } (n > 3), where Composition holds in every proper subset. Let D1 , . . . , Dn denote the cardinality of the space of each variable. Let a marginally-independent partition of V be denoted by B = {B1 , . . . , Bm } (m > 2), and the cardinality of the space of each block B1 , . . . , Bm be denoted by D(B1 ) , . . . , D(Bm ) , respectively. Then, the number ωap of parameters required for specifying the JPD of M is upper-bounded by ωap = Qn i=1 (Di − 1) + Pm j=1 (D(Bj ) − 1). (12) Proof. Before proving the theorem, we explain the result briefly. The first term on the right is the cardinality of the joint space of a general PDM over the set of variables except the space of each variable is reduced by one. This term is the same as the one in the model complexity formula for the full PI model (Eq. (2)). The second term is the number of marginal parameters for specifying the joint space of each partition block. What we need to show is how to derive allQjoint paramePm n ters with j=1 (D(Bj ) − 1) marginals plus i=1 (Di − 1) Pm joints. First, j=1 (D(Bj ) − 1) marginal parameters are re0 quired for specifying all P (B10 ), . . . , P (Bm ) terms in Corollary 13. We construct a hypercube for M to apply Eq. (10) among groups of cells. Applying Corollary 13 and using the similar argument for the example in Figure 1, we can eliminate hyperplanes at X1 = x1,D1 , X2 = x2,D2 , . . . , Xn = xn,Dn in that order such that for each Xi , all cells on the hyperplanes at Xi = xi,Di can be derived from cells outside the hyperplane and the marginal parameters. The remaining cells form a hypercube whose length along the Xi axis is Di − 1 (i = 1, 2, . . . ,Q n). Therefore, the total number n of cells in this hypercube is i=1 (Di − 1). 2 The following shows how to use Theorem 14 to compute complexity of an atomic partial PI model. Example 15 (Computing model complexity). Consider the partial PI model in Figures 4. The domain consists of 6 variables from X1 to X6 , and X1 , X2 , X3 are ternary; X4 , X5 are binary; and X6 is 5-nary. The domain has a marginally-independent partition {B1 , B2 , B3 } or {{X1 , X3 }, {X2 , X4 , X5 }, {X6 }}. X X1 2 X4 B1 X3 X6 X5 B2 B3 Figure 4: A partial PI model with 3 blocks and a total of 6 variables. (The dotted circles depict each partition block.) The number of marginal parameters for all partition blocks is 23, given by (3 · 3 − 1) + (3 · 2 · 2 − 1) + (5 − 1). The number of independent joint parameters is 32, given by (2 − 1)2 (3 − 1)3 (5 − 1). Therefore, the total number of parameters for specifying the domain in this example is 23 + 32 = 55. Compare this number with the number of parameters for specifying a general PDM over the same set of variables by using the total probability law, giving 22 · 33 · 5 − 1 = 539. This shows the complexity of a partial PI model is significantly less than that of a general PDM. From the perspective of using a learned model, this means the model allows faster inference, more accurate results, and requires less space to represent and to store parameters during inference, as well as provides more expressive power with a compact form. 2 Note Theorem 14 holds also for full PI models since a full PI model is a special case of partial PI models. The proof of this is done by substituting D(Bj ) in Eq. (12) with Di (every partition block of full PI models is a singleton), yielding Eq. (5). Conclusion In this work, we present the complexity formula for atomic partial PI models, the building blocks of non-atomic PI models. We employ the hypercube method for analyzing the complexity of PI models. Further work to be done in this line of research is to find the complexity formula for non-atomic PI models. Since a non-atomic PI model contains (recursively embedded) full or partial-PI submodel(s), the complexity can be acquired by applying either full or atomic partial PI formula appropriately to each subdomain (recursively along the depth of embedding from the top to the bottom). The study of PI model complexity will lead to a new generation of PI-learning algorithms equipped with a better scoring metric. Acknowledgments The authors are grateful to the anonymous reviewers for their comments on this paper. Moreover, we especially thank Mr. Chang-Yup Lee for his financial support. Besides, this research is supported in part by NSERC of Canada. References Chickering, D.; Geiger, D.; and Heckerman, D. 1995. Learning Bayesian networks: search methods and experimental results. In Proceedings of 5th Conference on Artificial Intelligence and Statistics, 112–128. Cooper, G., and Herskovits, E. 1992. A Bayesian method for the induction of probabilistic networks from data. Machine Learning 9:309–347. Friedman, N.; Murphy, K.; and Russell, S. 1998. Learning the structure of dynamic probabilistic networks. In Cooper, G., and Moral, S., eds., Proceedings of 14th Conference on Uncertainty in Artificial Intelligence, 139–147. Heckerman, D.; Geiger, D.; and Chickering, D. 1995. Learning Bayesian networks: the combination of knowledge and statistical data. Machine Learning 20:197–243. Lam, W., and Bacchus, F. 1994. Learning Bayesian networks: an approach based on the MDL principle. Computational Intelligence 10(3):269–293. Xiang, Y.; Hu, J.; Cercone, N.; and Hamilton, H. 2000. Learning pseudo-independent models: analytical and experimental results. In Hamilton, H., ed., Advances in Artificial Intelligence. Springer. 227–239. Xiang, Y.; Lee, J.; and Cercone, N. 2003. Parameterization of pseudo-independent models. In Proceedings of 16th Florida Artificial Intelligence Research Society Conference, 521–525. Xiang, Y.; Lee, J.; and Cercone, N. 2004. Towards better scoring metrics for pseudo-independent models. International Journal of Intelligent Systems 20. Xiang, Y.; Wong, S.; and Cercone, N. 1996. Critical remarks on single link search in learning belief networks. In Proceedings of 12th Conference on Uncertainty in Artificial Intelligence, 564–571. Xiang, Y.; Wong, S.; and Cercone, N. 1997. A ‘microscopic’ study of minimum entropy search in learning decomposable Markov networks. Machine Learning 26(1):65–92. Xiang, Y. 1997. Towards understanding of pseudoindependent domains. In Poster Proceedings of 10th International Symposium on Methodologies for Intelligent Systems.

Model Complexity of Pseudo-independent Models Jae-Hyuck Lee

Related documents

Products

Support

Model Complexity of Pseudo-independent Models Jae-Hyuck Lee

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib