Model Complexity of Pseudo-independent Models Jae-Hyuck Lee

Model Complexity of Pseudo-independent Models
Jae-Hyuck Lee and Yang Xiang
Department of Computing and Information Science
University of Guelph, Guelph, Canada
{jaehyuck, yxiang}@cis.uoguelph,ca
Abstract
A type of problem domains known as pseudo-independent
(PI) domains poses difficulty for common probabilistic learning methods based on the single-link lookahead search. To
learn this type of domain models, a learning method based
on the multiple-link lookahead search is needed. An improved result can be obtained by incorporating model complexity into a scoring metric to explicitly trade off model accuracy for complexity and vice versa during selection of the
best model among candidates at each learning step. To implement this scoring metric for the PI-learning method, the complexity formula for PI models is required. Previous studies
found the complexity formula for full PI models, one of the
three major types of PI models (the other two are partial and
mixed PI models). This study presents the complexity formula for atomic partial PI models, partial PI models that contain no embedded PI submodels. The complexity is acquired
by arithmetic operation on the spaces of domain variables.
The new formula provides the basis for further characterizing the complexity of non-atomic PI models, which contain
embedded PI submodels in their domains.
Keywords: probabilistic reasoning, knowledge discovery,
data mining, machine learning, belief networks, model complexity.
Introduction
Learning probabilistic networks (Lam & Bacchus 1994;
Cooper & Herskovits 1992; Heckerman, Geiger, & Chickering 1995; Friedman, Murphy, & Russell 1998) has been
an active area of research recently. The task of learning
networks is NP-hard (Chickering, Geiger, & Heckerman
1995). Therefore, learning algorithms use a heuristic search,
and the common search method is the single-link lookahead
which generates network structures that differ by a single
link at each level of the search.
Pseudo-independent (PI) models (Xiang, Wong, & Cercone 1996) are a class of probabilistic domain models where
a group of marginally independent variables displays collective dependency. PI models can be classified into three
types: full, partial, and mixed PI models based on the pattern of marginal independency and the extent of collective
dependency. The most restrictive type is full PI models
c 2005, American Association for Artificial IntelliCopyright gence (www.aaai.org). All rights reserved.
where every proper subset of variables are marginally independent. This is relaxed in partial PI models where not
every proper subset of variables are marginally independent.
While in full or partial PI models, all variables in the domain are collectively dependent, this is not the case in mixed
PI models where only proper subsets, called embedded PI
subdomains, of domain variables are collectively dependent.
However, the marginal independency pattern of each embedded PI subdomain in mixed PI models is that of either a full
or a partial PI model.
PI models cannot be learned by the single-link lookahead
search because the underlying collective dependency cannot
be recovered. Incorrectly learned models introduce silent
errors when used for decision making. To learn PI models,
a more sophisticated search method called multi-link lookahead (Xiang, Wong, & Cercone 1997) should be used. It was
implemented in the learning algorithm called RML (Xiang
et al. 2000). The algorithm is equipped with the KullbackLeibler cross entropy as the scoring metric for the goodnessof-fit to data.
The scoring metric of the learning algorithm can be improved by combining cross entropy as a measure of model
accuracy with a measure of model complexity. A weighted
sum of the two measures is a simple way of combination and
other alternatives are also possible. By using such a scoring
metric, between two models of the same accuracy (as measured by cross entropy), the one with less complexity will
end up with a higher score and be preferred. The focus of
this paper is on the assessment of the model complexity and
we defer the details of combination to future work.
Model complexity is defined by the number of parameters required to fully specify a model. In previous work (Xiang 1997), a formula was presented for estimating the number of parameters in full PI models, one of the three types
of PI models (the other two are partial and mixed PI models). However, the formula was very complex, and did not
show the structural dependence relationships among parameters. The new concise formula for full PI models was recently presented (Xiang, Lee, & Cercone 2004) using a hypercube (Xiang, Lee, & Cercone 2003).
In this study, we present the model complexity formula
for atomic partial PI models, partial PI models that contain
no embedded PI submodels. Atomic partial PI models are
the building blocks of mixed PI models. This new formula
is simple in form, and provides good insight into the structural dependency relationships among parameters of partial
PI models. Furthermore, the previous complexity formula
for full PI models is integrated into this new formula. In
other words, substituted with the conditions of full PI models, the new formula is reduced to the formula for full PI
models; this also confirms that full PI models are a special
case of partial PI models. Besides, the new formula provides
the basis for further characterizing the complexity of mixed
PI models. We apply the hypercube method to show how
the complexity of partial PI models can be acquired from
the spaces of variables.
Background
Let V be a set of n discrete variables X1 , . . . , Xn (in
what follows we will focus on domains of finite and discrete variables). Each variable Xi has a finite space Si =
{xi,1 , xi,2 , . . . , xi,Di } of cardinality Di . The space of a set
V of variables is defined by the Cartesian product of the
spaces
of all variables in V , that is, SV = S1 × · · · × Sn (or
Q
S
).
Thus, SV contains the tuples made of all possible
i
i
combinations of values of the variables in V . Each tuple is
called a configuration of V , denoted by (x1 , . . . , xn ).
Let P (Xi ) denote the probability function over Xi and
P (xi ) denote the probability value P (Xi = xi ). The following axiom of probability is called the total probability
law:
P (Si ) = 1, or P (xi,1 )+P (xi,2 )+· · ·+P (xi,Di ) = 1. (1)
A probabilistic domain model (PDM) M over V defines the
probability values of every configuration for every subset
A ⊂ V . P (V ) or P (X1 , . . . , Xn ) refers to the joint probability distribution (JPD) function over X1 , . . . , Xn , and
P (x1 , . . . , xn ) refers to the joint probability value of the
configuration (x1 , ..., xn ). The probability function P (A)
over any proper subsets A ⊂ V refers to the marginal
probability distribution (MPD) function over A. If A =
{X1 , . . . , Xm } (A ⊂ V ), then P (x1 , . . . , xm ) refers to the
marginal probability value.
A set of probability values that directly specifies a PDM
is called parameters of the PDM. A joint probability value
P (x1 , . . . , xn ) is referred to as a joint parameter or joint and
a marginal probability value P (x1 , . . . , xm ) as a marginal
parameter or marginal. Among parameters associated with
a PDM, some parameters can be derived from others by using constraints such as the total probability law. Such derivable parameters are called constrained or dependent parameters while underivable parameters are called unconstrained,
free, or independent parameters. The number of independent
parameters of a PDM is called the model complexity of the
PDM, denoted as ω.
When no information of the constraints on a general PDM
is given, the PDM should be specified only by joint parameters. The following ωg gives the number of joint parameters required: Let M be a general PDM over V =
{X1 , . . . , Xn }. Then the number of independent parame-
ters of M is upper-bounded by
n
Y
ωg =
Di − 1.
(2)
i=1
One joint is dependent since it can be derived from others by
the total probability law (Eq. (1)).
For any three disjoint subsets of variables A, B and C in
V , A and B are called conditionally independent given C,
denoted by I(A, B | C), iff P (A|B, C) = P (A|C) for all
values in A, B and C such that P (B, C) > 0. Given subsets
of variables A, B, C, D, E ⊆ V , the following property of
conditional independence is called Symmetry:
I(A, B | C) ⇔ I(B, A | C);
(3)
and the following is Composition:
I(A, B | C) ∧ I(A, D | C) ⇒ I(A, B ∪ D | C).
(4)
Two disjoint subsets A and B are said to be marginally
independent, denoted by I(A, B | ∅), iff P (A|B) = P (A)
for all values A and B such that P (B) > 0. If two subsets of
variables are marginally independent, no dependency exists
between them. Hence, each subset can be modeled independently without losing information. If each variable Xi in a
set A is marginally independent of the rest, the variables in
A are said to be marginally independent. The probability
distribution over a set of marginally independent variables
can be written as the Q
product of the marginal of each variable, that is, P (A) = Xi ∈A P (Xi ).
Variables in a set A are called generally dependent if
P (B|A \ B) 6= P (B) for every proper subset B ⊂ A. If
a subset of variables is generally dependent, a proper subsets cannot be modeled independently without losing information. Variables in A are collectively dependent if, for
each proper subset B ⊂ A, there exists no proper subset
C ⊂ A \ B that satisfies P (B|A \ B) = P (B|C). Collective dependence prevents conditional independence and
modeling through proper subsets of variables.
A pseudo-independent (PI) model is a PDM where proper
subsets of a set of collectively dependent variables display
marginal independence (Xiang, Wong, and Cercone 1997).
Definition 1 (Full PI model). A PDM over a set V (|V | >
3) of variables is a full PI model if the following properties
(called axioms of full PI models) hold:
(SI ) Variables in any proper subset of V are marginally independent.
(SII ) Variables in V are collectively dependent.
The complexity of full PI models is given as follows:
Theorem 2 (Complexity of full PI models by Xiang, Lee,
& Cercone, 2004). Let a PDM M be a full PI model over
V = {X1 , . . . , Xn }. Then the number of independent parameters of M is upper-bounded by
ωf =
n
Y
i=1
(Di − 1) +
n
X
(Di − 1).
(5)
i=1
The axiom (SI ) of marginal independence is relaxed in
partial PI models, as is defined through marginally independent partition.
Definition 3 (Marginally independent partition). Let V
(|V | > 3) be a set of variables and B = {B1 , . . . , Bm }
(m > 2) be a partition of V . B is a marginally independent
partition if for every subset A = {Xi,k | Xi,k ∈ Bk for k =
1, . . . , m}, variables in A are marginally independent. Each
partition block Bi in B is called a marginally independent
block.
A marginally independent partition of V groups variables
in V into m marginally independent blocks. The property of
marginally independent blocks is that if a subset A is formed
by taking one element from different blocks, then variables
in A are always marginally independent.
In a partial PI model, it is not necessary that every proper
subset is marginally independent.
Definition 4 (Partial PI model). A PDM over a set V
(|V | > 3) of variables is a partial PI model on V if the following properties (called axioms of partial PI models) holds:
(SI0 ) V can be partitioned into two or more marginally independent blocks.
(SII ) Variables in V are collectively dependent.
The following definitions on maximum marginally independent partition is needed later for obtaining the complexity of partial PI models:
Definition 5 (Maximum partition and minimum block).
Let B = {B1 , . . . , Bm } be a marginally independent
partition of a partial PI model over V . B is a maximum marginally independent partition if there exists no
marginally independent partition B 0 of V such that |B| <
|B 0 |. The blocks of a maximum marginally independent partition are called the minimum marginally-independent blocks
or minimum blocks.
Complexity of atomic partial PI models
The following defines atomic partial PI models:
Definition 6 (Atomic PI model). A PDM M over a set V
(|V | ≥ 3) of variables is an atomic PI model if M is either a
full or partial PI model, and no collective dependency exists
in any proper subsets of V .
Because the dependency within each block, the complexity of partial PI models is higher than full PI models, but
lower than general PDMs with variables of the same space
cardinalities, as we will see later.
The following lemma states that in a PDM that satisfies
Composition (Eq. (4)), if every pair of variables between two
subsets are marginally independent, then the two subsets are
marginally independent.
Lemma 7 (Marginal independence of subsets). Let M be
a PDM over V where Composition holds in every subset.
Let Bα = {Y1 , . . . , Ys } and Bβ = {Z1 , . . . , Zt } denote any
two disjoint nonempty subsets of variables in V . If I(Yi , Zj |
∅) holds for every pair (Yi , Zj ), then
I(Bα , Bβ | ∅).
(6)
Proof. We prove that I(Yi , Zj | ∅) for every (Yi , Zj )
implies I(Yi , Bβ | ∅) and that I(Yi , Bβ | ∅) for every Yi
implies I(Bα , Bβ | ∅).
Applying Composition recursively from I(Yi , Z1 | ∅) to
I(Yi , Zt | ∅) gives I(Yi , Z1 ∪· · ·∪Zt | ∅) or I(Yi , Bβ | ∅).
By Symmetry (Eq. (3)), I(Yi , Bβ | ∅) is equivalent to
I(Bβ , Yi | ∅). In the same manner, by applying Composition recursively from I(Bβ , Y1 | ∅) to I(Bβ , Ys | ∅) gives
I(Bβ , Y1 ∪ · · · ∪ Ys ) or I(Bβ , Bα | ∅). By Symmetry this
is equivalent to I(Bα , Bβ | ∅).
2
Lemma 7 assumes that the PDM satisfies Composition.
Composition implies no collective dependency in the PDM
as follows:
Lemma 8 (Composition implies no collective dependency). Let M be a PDM over V that satisfies Composition. Then no collective dependency exists in V .
Proof. This Lemma directly follows from the definition
of Composition (Eq. (4)).
2
The following Lemmas 9 and 10 are required for proving
Lemma 11 on marginal independence of blocks. The two
lemmas state that the marginal independency of partition is
preserved after removing a proper subset of variables from
the partition blocks and after merging the blocks. These
Lemmas hold because the marginal independency of partition is defined by the independency in a set of each variable
taken from the partition blocks.
Lemma 9 (Marginal independency of any subpartitions). Let a PDM M be a partial PI model over V =
{X1 , . . . , Xn } (n > 3) with a marginally-independent partition B = {B1 , . . . , Bm } (m > 2). A new partition
0
} over V 0 (V 0 ⊆ V ) can be defined by
B 0 = {B10 , . . . , Bm
removing a proper subset of variables from one or more partition blocks such that Bi0 ⊆ Bi for every Bi0 (i = 1, . . . , m).
Then, B 0 is also a marginally-independent partition called
a subpartition.
Lemma 10 (Marginal independency after merging). Let
a PDM M be a partial PI model over V = {X1 , . . . , Xn }
(n > 3) with a marginally-independent partition B =
b =
{B1 , . . . , Bm } (m > 2).
A new partition B
b1 , . . . , B
br } (r < m) over V can be defined by merging
{B
b is also a marginallyone or more blocks in B. Then, B
independent partition.
The following lemma states that in an atomic partial PI
model, removing one variable from V makes any two blocks
in V marginally independent.
Lemma 11 (Marginal independence between two independent blocks). Let a PDM M be an atomic partial PI
model over V = {X1 , . . . , Xn } (n > 3), where Composition holds in every proper subset. Let a marginallyindependent partition be denoted by B = {B1 , . . . , Bm }.
When m > 2, for any two distinct blocks Br and Bq in B,
I(Br , Bq | ∅).
(7a)
When m = 2, that is B = {B1 , B2 }, for any Xi ∈ B1
or any Xj ∈ B2 ,
I(B1 \ {Xi }, B2 | ∅) and I(B1 , B2 \ {Xj } | ∅).
(7b)
Proof.
• Case 1, where m > 2:
Let Br = {Y1 , . . . , Ys } and Bq = {Z1 , . . . , Zt }. Let
V 0 = Br ∪ Bq . By the definition of partial PI models,
I(Y, Z | ∅) holds for every Y and Z. Since V 0 ⊂ V , Composition holds in V 0 . By Lemma 7, I(Br , Bq | ∅) must
hold.
• Case 2, where m = 2: Let Vi0 = B1 \ {Xi } ∪ B2 and Vj0 = B1 ∪ B2 \
{Xj } . Then Vi0 ⊂ V and Vj0 ⊂ V . Therefore, by the
same argument as Case 1, both I(B1 \ {Xi }, B2 | ∅) and
I(B1 , B2 \ {Xj } | ∅) hold.
2
As shown in the parameterization of full PI models (Xiang, Lee, & Cercone 2004), a sum of joint probabilities can
be represented by a product of marginal probabilities. While
the size of joint spaces grows exponentially in the number
of variables, the size of marginal spaces grows only linearly.
This results in the lesser complexity of a full PI model compared with a general PDM. A similar joint-marginal relationship holds for partial PI models. However, the marginal
probability is from each partition block, not from each variable as is the case with full PI models. This is shown in the
following lemma:
Lemma 12 (Joint-marginal equality of partial PI models). Let a PDM M be an atomic partial PI model over
V = {X1 , . . . , Xn } (n > 3), where Composition holds in
every proper subset. Let a marginally-independent partition
be denoted by B = {B1 , . . . , Bm } (m > 2). For an arbi0
} denote a subpartition of
trary Xi , let B 0 = {B10 , . . . , Bm
B made by removing an Xi from B so that every Bi0 is the
same as Bi except the block from which Xi was removed.
Then,
Di
X
0
P (X1 , . . . , xi,k , . . . , Xn ) = P (B10 ) · · · P (Bm
). (8)
k=1
Proof. The summation k=1 P (X1 , . . . , xi,k , . . . , Xn )
at the left represents marginalization on a variable Xi . This
is equal to P (X1 , . . . , Xi−1 , Xi+1 , . . . , Xn ) or in partitional
0
).
notation P (B10 , . . . , Bm
The proof by induction is as follows:
• Base case where m = 2: We need to show P (B10 , B20 ) =
P (B10 )P (B20 ). This is equivalent to I(B10 , B20 | ∅), which is
obvious by Lemma 11.
• Induction hypothesis: Assume the following holds for
m = k:
(9)
• Induction step: We need to show the following holds for
m = k + 1:
0
0
).
) = P (B10 ) · · · P (Bk+1
P (B10 , . . . , Bk+1
Merging {B10 , . . . , Bk0 } into one block and applying Lemmas 9, 10, 11 give
0
0
).
) = P (B10 , . . . , Bk0 )P (Bk+1
P (B10 , . . . , Bk0 , Bk+1
0
0
).
) = P (B10 ) · · · P (Bk0 )P (Bk+1
P (B10 , . . . , Bk0 )P (Bk+1
Therefore, by mathematical induction from m = 2, k and
k + 1, Eq. (8) must hold for any m.
2
X1
X2
B1
B2
X3
Figure 1: A partial PI model. (The dotted circle depicts each
partition block.)
Corollary 13, which directly follows from Lemma 12,
shows the relation between joint parameters and marginal
parameters, by which joint parameters can be derived from
other marginal and joint parameters.
Corollary 13 (Partial PI joint constraint). Let a PDM M
be an atomic partial PI model over V = {X1 , . . . , Xn }
(n > 3), where Composition holds in every proper subset. Let a marginally-independent partition be denoted by
B = {B1 , . . . , Bm } (m > 2). For an arbitrary Xi , let
0
B 0 = {B10 , . . . , Bm
} denote a subpartition of B made by
removing an Xi from B so that every Bi0 is the same as Bi
except the block from which Xi was removed. Then,
0
P (X1 , . . . , xi,r , . . . , Xn ) = P (B10 ) · · · P (Bm
)
−
PD i
P (B10 , . . . , Bk0 ) = P (B10 ) · · · P (Bk0 ).
By the induction hypothesis (Eq. (9)),
Di
X
P (X1 , . . . , xi,k , . . . , Xn ).
(10)
k=1,k6=r
We are to derive the number of independent parameters
for specifying a PDM by using the constraint (Corollary 13)
on atomic partial PI domains. First, we determine the number of independent marginal parameters, denoted as ωm ,
0
that are required to specify all P (B10 ), . . . , P (Bm
) terms in
Corollary 13. Next, we determine the number of joint parameters that cannot be derived from ωm plus other joint
parameters. In a hypercube, this is equivalent to counting
the number of cells that cannot be derived since a cell in a
hypercube corresponds to a joint parameter in the JPD. The
procedure is as follows:
(i) Check cells one by one to see whether it can be derived
from ωm and any other cells by applying Corollary 13.
(ii) As soon as a cell is determined to be derivable, it is
eliminated from further consideration.
Repeat this procedure until no more cells can be eliminated.
The remaining cells and the ωm marginal parameters constitute the total number of independent parameters of the partial PI model.
For example, consider the partial PI model in Figure 1,
which corresponds to the hypercube in Figure 2(a). The
PDM consists of three variables {X1 , X2 , X3 }; X1 , X2 are
ternary, and X3 is binary. The marginally-independent partition B = {B1 , B2 } or {{X1 }, {X2 , X3 }}. For this PDM,
ωm = (3 − 1) + (3 × 2 − 1) = 7. For example, to specify
P (X1 ) for B1 , 2 marginals are required such as:
p(2, 1, 2), p(1, 2, 1), p(2, 2, 1), Since no more cells can be
further eliminated, the number of independent parameters
needed to specify the partial PI model is 11, with 7 marginal
parameters and 4 joint parameters. Note it would take 17
parameters to specify the JPD of a general PDM over three
variables of the same space cardinalities.
P (x1,1 ), P (x1,2 );
X3
and to specify P (X2 , X3 ) for B2 , 5 parameters are required
such as:
P (x2,1 ), P (x2,2 ), P (x2,3 ), P (x3,1 ), P (x3,2 ), P (x3,3 ).
We refer to the set of cells with the identical value Xi =
xi,j as the hyperplane at Xi = xi,j in the hypercube. For
example, the hyperplane at X1 = x1,3 in Figure 2(a) refer
to the following 6 cells. (We use p(i, j, k) as an abbreviation
for P (X1,i , X2,j , X3,k ).):
p(3, 1, 1), p(3, 2, 1), p(3, 3, 1), p(3, 1, 2), p(3, 2, 2), p(3, 3, 2)
By Corollary 13, we have
p(3, 1, 1) = P (x2,1 , x3,1 ) − (p(1, 1, 1) + p(2, 1, 1)).
That is, the cell at the front-lower-left corner can be derived by the two cells behind it and the marginal parameters. All other cells on the hyperplane at X1 = x1,3 can
be similarly derived. Therefore, these 6 cells can be eliminated from further consideration. The remaining 12 cells
are shown in Figure 2(b). Using the same idea, four of the
X3
x1,2
x1,3
x1,1
x3,2
x1,2
x3,1
X1
x2,1 x2,2 x2,3
(a) The original hypercube
(3 × 3 × 2)
x3,2
x3,1
X2
X2
X1
x1,2
(11)
We assume that the 7 marginal parameters have been specified, and thus the other 2 marginal parameters P (x1,3 )
and P (x2,3 , x3,2 ) can be derived by the total probability
law (Eq. (1)). Then, by marginalization, the following 6
marginal parameters can also be derived from the 5 parameters in (11) plus P (x2,3 , x3,2 ):
X3
X3
x3,2
x1,1
x1,2
x3,1
P (x2,1 , x3,1 ), P (x2,2 , x3,1 ), P (x2,3 , x3,1 ),
P (x2,1 , x3,2 ), P (x2,2 , x3,2 ).
x1,1
x1,1
x2,1 x2,2 x2,3
(b) The hypercube after
eliminating X1 = x1,3
Figure 2: Eliminating derivable cells from a JPD hypercube.
remaining cells at X2 = x2,3 can be derived, and therefore eliminated from further consideration. The remaining 8
cells are shown in Figure 3(a). Again, four of the remaining
cells at X3 = x3,2 can be derived. After eliminating them,
only 4 cells are left, as shown in Figure 3(b): p(1, 1, 1),
x3,1
X2
X1
x2,1 x2,2
X2
X1
(a) The hypercube after
eliminating X2 = x2,3
x2,1 x2,2
(b) The hypercube after
eliminating X3 = x3,2
Figure 3: Eliminating derivable cells from a JPD hypercube.
Now we present the general result on the number of independent parameters of atomic partial PI models.
Theorem 14 (Complexity of atomic partial PI models).
Let a PDM M be an atomic partial PI model over V =
{X1 , . . . , Xn } (n > 3), where Composition holds in every
proper subset. Let D1 , . . . , Dn denote the cardinality of the
space of each variable. Let a marginally-independent partition of V be denoted by B = {B1 , . . . , Bm } (m > 2), and
the cardinality of the space of each block B1 , . . . , Bm be denoted by D(B1 ) , . . . , D(Bm ) , respectively. Then, the number
ωap of parameters required for specifying the JPD of M is
upper-bounded by
ωap =
Qn
i=1 (Di
− 1) +
Pm
j=1 (D(Bj )
− 1).
(12)
Proof. Before proving the theorem, we explain the result
briefly. The first term on the right is the cardinality of the
joint space of a general PDM over the set of variables except
the space of each variable is reduced by one. This term is
the same as the one in the model complexity formula for the
full PI model (Eq. (2)). The second term is the number of
marginal parameters for specifying the joint space of each
partition block.
What we
need to show is how to derive allQjoint paramePm
n
ters with j=1 (D(Bj ) − 1) marginals plus i=1 (Di − 1)
Pm
joints. First, j=1 (D(Bj ) − 1) marginal parameters are re0
quired for specifying all P (B10 ), . . . , P (Bm
) terms in Corollary 13. We construct a hypercube for M to apply Eq. (10)
among groups of cells. Applying Corollary 13 and using
the similar argument for the example in Figure 1, we can
eliminate hyperplanes at X1 = x1,D1 , X2 = x2,D2 , . . . ,
Xn = xn,Dn in that order such that for each Xi , all cells
on the hyperplanes at Xi = xi,Di can be derived from cells
outside the hyperplane and the marginal parameters. The remaining cells form a hypercube whose length along the Xi
axis is Di − 1 (i = 1, 2, . . . ,Q
n). Therefore, the total number
n
of cells in this hypercube is i=1 (Di − 1).
2
The following shows how to use Theorem 14 to compute
complexity of an atomic partial PI model.
Example 15 (Computing model complexity). Consider
the partial PI model in Figures 4. The domain consists of 6 variables from X1 to X6 , and X1 , X2 , X3 are
ternary; X4 , X5 are binary; and X6 is 5-nary. The domain has a marginally-independent partition {B1 , B2 , B3 }
or {{X1 , X3 }, {X2 , X4 , X5 }, {X6 }}.
X
X1
2
X4
B1
X3
X6
X5
B2
B3
Figure 4: A partial PI model with 3 blocks and a total of 6
variables. (The dotted circles depict each partition block.)
The number of marginal parameters for all partition
blocks is 23, given by (3 · 3 − 1) + (3 · 2 · 2 − 1) + (5 − 1).
The number of independent joint parameters is 32, given
by (2 − 1)2 (3 − 1)3 (5 − 1). Therefore, the total number of parameters for specifying the domain in this example is 23 + 32 = 55. Compare this number with the number of parameters for specifying a general PDM over the
same set of variables by using the total probability law, giving 22 · 33 · 5 − 1 = 539. This shows the complexity of a
partial PI model is significantly less than that of a general
PDM.
From the perspective of using a learned model, this means
the model allows faster inference, more accurate results, and
requires less space to represent and to store parameters during inference, as well as provides more expressive power
with a compact form.
2
Note Theorem 14 holds also for full PI models since a
full PI model is a special case of partial PI models. The
proof of this is done by substituting D(Bj ) in Eq. (12) with
Di (every partition block of full PI models is a singleton),
yielding Eq. (5).
Conclusion
In this work, we present the complexity formula for atomic
partial PI models, the building blocks of non-atomic PI models. We employ the hypercube method for analyzing the
complexity of PI models.
Further work to be done in this line of research is to find
the complexity formula for non-atomic PI models. Since a
non-atomic PI model contains (recursively embedded) full
or partial-PI submodel(s), the complexity can be acquired
by applying either full or atomic partial PI formula appropriately to each subdomain (recursively along the depth of embedding from the top to the bottom). The study of PI model
complexity will lead to a new generation of PI-learning algorithms equipped with a better scoring metric.
Acknowledgments
The authors are grateful to the anonymous reviewers for
their comments on this paper. Moreover, we especially
thank Mr. Chang-Yup Lee for his financial support. Besides,
this research is supported in part by NSERC of Canada.
References
Chickering, D.; Geiger, D.; and Heckerman, D. 1995.
Learning Bayesian networks: search methods and experimental results. In Proceedings of 5th Conference on Artificial Intelligence and Statistics, 112–128.
Cooper, G., and Herskovits, E. 1992. A Bayesian method
for the induction of probabilistic networks from data. Machine Learning 9:309–347.
Friedman, N.; Murphy, K.; and Russell, S. 1998. Learning
the structure of dynamic probabilistic networks. In Cooper,
G., and Moral, S., eds., Proceedings of 14th Conference on
Uncertainty in Artificial Intelligence, 139–147.
Heckerman, D.; Geiger, D.; and Chickering, D. 1995.
Learning Bayesian networks: the combination of knowledge and statistical data. Machine Learning 20:197–243.
Lam, W., and Bacchus, F. 1994. Learning Bayesian networks: an approach based on the MDL principle. Computational Intelligence 10(3):269–293.
Xiang, Y.; Hu, J.; Cercone, N.; and Hamilton, H. 2000.
Learning pseudo-independent models: analytical and experimental results. In Hamilton, H., ed., Advances in Artificial Intelligence. Springer. 227–239.
Xiang, Y.; Lee, J.; and Cercone, N. 2003. Parameterization of pseudo-independent models. In Proceedings of
16th Florida Artificial Intelligence Research Society Conference, 521–525.
Xiang, Y.; Lee, J.; and Cercone, N. 2004. Towards better
scoring metrics for pseudo-independent models. International Journal of Intelligent Systems 20.
Xiang, Y.; Wong, S.; and Cercone, N. 1996. Critical remarks on single link search in learning belief networks. In
Proceedings of 12th Conference on Uncertainty in Artificial Intelligence, 564–571.
Xiang, Y.; Wong, S.; and Cercone, N. 1997. A ‘microscopic’ study of minimum entropy search in learning decomposable Markov networks. Machine Learning
26(1):65–92.
Xiang, Y. 1997. Towards understanding of pseudoindependent domains. In Poster Proceedings of 10th International Symposium on Methodologies for Intelligent Systems.