Probabilistic reasoning with terms

advertisement
Probabilistic reasoning with terms
Peter A. Flach, Elias Gyftodimos and Nicolas Lachiche
No Institute Given
Abstract. Many problems in artificial intelligence can be naturally approached
by generating and manipulating probability distributions over structured objects.
First-order terms such as lists, trees and tuples and nestings thereof can represent
individuals with complex structure in the underlying domain, such as sequences
or molecules. Higher-order terms such as sets and multisets provide additional
representational flexibility. In this paper we present two Bayesian approaches
that employ such probability distributions over structured objects: the first is an
upgrade of the well-known naive Bayesian classifier to deal with first-order and
higher-order terms, and the second is an upgrade of propositional Bayesian networks to deal with nested tuples.
1 Introduction
Attempts to combine logic and probabilities have a long history. There are, roughly
speaking, two possible approaches: one is to merge logic and probabilities into a single
integrated system; the other is to combine them in a complementary fashion, employing
each to address issues which the other is unable to handle. Rudolf Carnap’s inductive
logic is an example of the first approach: he introduces a degree of confirmation c(H ; E )
which is a number between 0 and 1 expressing the degree to which evidence E confirms
hypothesis H. A degree of confirmation of 1 is equivalent to logical entailment; a lower
degree of confirmation ‘may be regarded as a numerical measure for a partial entailment’.
We have argued previously [1, 2] against such a tight merger of logic and probabilities, because it leads to a degenerated form of logic. In particular, in inductive logic
a proof amounts to calculating a degree of confirmation, therefore any hypothesis has
a ‘proof’ from any evidence. Probabilities do establish a semantics of some sort, but
not one that is sufficient to build a fully-fledged logical system. Proof theory requires a
semantics which indicates what is preserved going from evidence to hypothesis; probability theory can only offer a truth-estimating semantics.
We therefore argue in favour of the complementary approach of combining logic
and probabilities. For instance, in [1] we have developed qualitative logics of induction
which are able to distinguish between hypotheses that explain certain evidence and hypotheses that do not, for various definitions of what it means to explain. These logics are
built on a preservation semantics in which the property to be preserved from evidence
to hypothesis is explanatory power: the hypothesis explains the evidence if it has at least
the same explanatory power as the evidence. What such qualitative logics of induction
cannot answer is: how plausible is this hypothesis, given this evidence? For answering
such questions we can employ an additional probabilistic truth-estimating semantics.
We then have both a proof theory (only hypotheses explaining the evidence can be derived from it) and a probabilistic semantics assigning degrees of belief. Notice that the
same probabilistic semantics may be used in combination with different non-deductive
logics.
Halpern [4] has identified the use of probability for assigning degrees of belief as
one of two ways of combining logic and probability, the other being defining probability distributions over domains of first-order variables. This paper follows the latter
approach. We investigate how we can define probability distributions over structured
objects represented by first- and higher-order terms. First-order terms such as lists, trees
and tuples and nestings thereof can represent individuals with complex structure in the
underlying domain, such as sequences or molecules. Higher-order terms such as sets
and multisets provide additional representational flexibility. In this paper we present two
Bayesian approaches that employ such probability distributions over structured objects:
the first is an upgrade of the well-known naive Bayesian classifier to deal with firstorder and higher-order terms, and the second is an upgrade of propositional Bayesian
networks to deal with nested tuples.
2 Probability distributions over tuples, lists and sets
We start by considering the question: how to define probability distributions over complex objects such as lists and sets? We assume that we have one or more alphabets (or
atomic types) A = fx1 ; : : : ; xn g of atomic objects (e.g. integers or characters – we are
only dealing with categorical data analysis in this paper), as well as probability distributions PA over elements of the alphabets. Most of the distributions will make fairly
strong independence assumptions, in the style of the naive Bayesian classifier.
2.1 Probability distributions over tuples
For completeness we start with the simplest case, where our complex objects are tuples
or vectors. This is the only aggregation mechanism that has routinely been considered
by statisticians and probability theorists.
Consider the set of k-tuples, i.e. elements of the Cartesian product of k alphabets:
(x1 ; x2 ; : : : ; xk ) 2 A1 A2 : : : Ak .
Definition 1 (Distribution over tuples). The following defines a distribution over tuples:
k
Ptu ((x1 ; x2 ; : : : ; xk )) = ∏ PAi (xi )
i=1
Obviously, this distribution assumes that each component of the tuple is statistically independent of the others. Bayesian networks are one way of relaxing this assumption by
explicitly modelling dependencies between variables. We will return to this in Section
3.
2.2 Probability distributions over lists
We can define a uniform probability distribution over lists if we consider only finitely
L+1 1
many of them, say, up to and including length L. There are n n,,
1 of those for n > 1, so
under a uniform distribution every list has probability nLn+,1 1,1 for n > 1, and probability
1
L+1 for n = 1. Clearly, such a distribution does not depend on the internal structure of
the lists, treating each of them as equiprobable.
A slightly more interesting case includes a probability distribution over lengths of
lists. This has the additional advantage that we can define distributions over all (infinitely many) lists over A. For instance, we can use the geometric distribution over
list lengths: Pτ (l ) = τ(1 , τ)l , with parameter τ denoting the probability of the empty
list. Of course, we can use other infinite distributions, or arbitrary finite distributions,
as long as they sum up to 1. The geometric distribution corresponds to the head-tail
representation of lists.
We then need, for each list length l, a probability distribution over lists of length
l. We can again assume a uniform distribution: since there are nl lists of length l, we
would assign probability n,l to each of them. Combining the two distributions over
list lengths and over lists of fixed length, we assign probability τ( 1,n τ )l to any list of
length l. Such a distribution only depends on the length of the list, not on the elements
it contains.
We can also use the probability distribution PA over the alphabet to define a nonuniform distribution over lists of length l. For instance, among the lists of length 3,
list [a; b; c] would have probability PA (a)PA (b)PA (c), and so would its 5 permutations.
Combining PA and Pτ thus gives us a distribution over lists which depends on the length
and the elements of the list, but ignores their positions or ordering.
Definition 2 (Distribution over lists). The following defines a probability distribution
over lists:
l
Pli ([x j1 ; : : : ; x jl ]) = τ(1 , τ)l ∏ PA (x ji )
i=1
where 0 < τ 1 is a parameter determining the probability of the empty list.
Example 1 (Order-independent distribution over lists). Consider an alphabet A = fa; b; cg,
and suppose that the probability of each element occurring is estimated as PA (a) = :2,
PA (b) = :3, and PA (c) = :5. Taking τ = (1 , :2)(1 , :3)(1 , :5) = :28, i.e. using the
bitvector estimate of an empty list, we have PA (a) = (1 , :28) :2 = :14, PA (b) = :22,
and PA (c) = :36, and Pli ([a]) = :28 :14 = :04, Pli ([b]) = :06, Pli ([c]) = :10, Pli ([a; b]) =
:28 :14 :22 = :009, Pli ([a; c]) = :014, Pli ([b; c]) = :022, and Pli ([a; b; c]) = :28 :14 :22 :36 = :003. We also have, e.g., Pli ([a; b; b; c]) = :28 :14 :22 :22 :36 = :0007.
0
0
0
Notice that this and similar distributions can easily be represented by stochastic
logic programs. For instance, the distribution of Example 1 corresponds to the following
SLP:
0.2: element(a).
0.3: element(b).
0.5: element(c).
0.28: list([]).
0.72: list([H|T]):-element(H),list(T).
Not surprisingly, the predicate definitions involved are range-restricted type definitions;
the probability labels either define a distribution exhaustively, or else follow the recursive structure of the type.
An alternative way to see such distributions is as follows. Introducing an extended
alphabet A0 = fε; x1 ; : : : ; xn g and a renormalised distribution PA (ε) = τ and PA (xi ) =
l
(1 , τ)PA (xi ), we have Pli ([x j1 ; : : : ; x jl ]) = PA (ε) ∏i=1 PA (x ji ). That is, under Pli we can
view each list as an infinite tuple of finitely many independently chosen elements of the
alphabet, followed by the stop symbol ε representing an infinite empty tail.
If we want to include the ordering of the list elements in the distribution, we can
assume a distribution PA2 over pairs of elements of the alphabet, so that [a; b; c] would
have probability PA2 (ab)PA2 (bc) among the lists of length 3. Such a distribution would
be calculated by the following SLP:
0
0
0.05:
0.10:
0.05:
...
0.15:
0
0
pair(a,a).
pair(a,b).
pair(a,c).
pair(c,c).
0.28: listp([]).
0.13: listp([X]):-element(X).
0.59: listp([X,Y|T]):-pair(X,Y),listp([Y|T]).
Such a distribution would take some aspects of the ordering into account, but note that
[a; a; b; a] and [a; b; a; a] would still obtain the same probability, because they consist
of the same pairs (aa, ab, and ba), and they start and end with a. Obviously we can
continue this process with triples, quadruples etc., but note that this is both increasingly
computationally expensive and unreliable if the probabilities must be estimated from
data.
A different approach is obtained by taking not ordering but position into account.
For instance, we can have three distributions PA;1 , PA;2 and PA;3+ over the alphabet,
for positions 1, 2, and 3 and higher, respectively. Among the lists of length 4, the
list [a; b; c; d ] would get probability PA;1 (a)PA;2 (b)PA;3+ (c)PA;3+(d ); so would the list
[a; b; d ; c]. Again, this wouldn’t be hard to model with a stochastic logic program.
In summary, all except the most trivial probability distributions over lists involve
(i) a distribution over lengths, and (ii) distributions over lists of fixed length. The latter take the list elements, ordering and/or position into account. We do not claim any
originality with respect to the above distributions, as these and similar distributions are
commonplace in e.g. bioinformatics. In the next section we use list distributions to define distributions over sets and multisets. Notice that, while lists are first-order terms,
sets represent predicates and therefore are higher-order terms. In this respect our work
extends related work on probability distributions over first-order terms.
2.3 Probability distributions over sets and multisets
A multiset (also called a bag) differs from a list in that its elements are unordered, but
multiple elements may occur. Assuming some arbitrary total ordering on the alphabet,
each multiset has a unique representation such as f[a; b; b; c]g. Each multiset can be
mapped to the equivalence class of lists consisting of all permutations of its elements.
For instance, the previous multiset corresponds to 12 lists. Now, given any probability distribution over lists that assigns equal probability to all permutations of a given
list, this provides us with a method to turn such a distribution into a distribution over
multisets. In particular, we can employ Pli above which defines the probability of a list,
among all lists with the same length, as the product of the probabilities of their elements.
Definition 3 (Distribution over multisets). For any multiset s, let l stand for its cardinality, and let ki stand for the number of occurrences of the i-th element of the alphabet.
The following defines a probability distribution over multisets:
Pms (s) =
l!
k1 ! : : : kn !
Pli (s) = l!τ ∏
i
PA (xi )ki
ki !
0
where τ is a parameter giving the probability of the empty multiset.
Here,
l!
k1 !:::kn !
stands for the number of permutations of a list with possible duplicates.
Example 2 (Distribution over multisets). Continuing Example 1, we have Pms (f[a]g) =
:04, Pms (f[b]g) = :06, Pms (f[c]g) = :10 as before. However, Pms (f[a; b]g) = :02, Pms (f[a; c]g) =
:03, Pms (f[b; c]g) = :04, Pms (f[a; b; c]g) = :02, and Pms (f[a; b; b; c]g) = :008.
Notice that the calculation of the multiset probabilities involves explicit manipulation of the probabilities returned by the geometric list distribution. It is therefore hard
to see how this could be equivalently represented by a stochastic logic program.
The above method, of defining a probability distribution over a type by virtue of that
type being isomorphic to a partition of another type for which a probability distribution
is already defined, is more generally applicable. Although in the above case we assumed
that the distribution over each block in the partition is uniform, so that we only have
to count its number of elements, this is not a necessary condition. Indeed, blocks in
the partition can be infinite, as long as we can derive an expression for its cumulative
probability. We will now proceed to derive a probability distribution over sets from
distribution Pli over lists in this manner.
Consider the set fa; bg. It can be interpreted to stand for all lists of length at least
2 which contain (i) at least a and b, and (ii) no other element of the alphabet besides a
and b. The cumulative probability of lists of the second type is easily calculated.
Lemma 1. Consider a subset S of l elements from the alphabet, with cumulative probability PA (S) = ∑xi 2S PA (xi ). The cumulative probability of all lists of length at least l
0
0
(P 0 (S))l
containing only elements from S is f (S) = τ 1,AP
A0 (S)
.
Proof. We can delete all elements in S from the alphabet and replace them by a single
element xS with probability PA (S). The lists we want consist of l or more occurrences
of xS . Their cumulative probability is
0
(P 0 (S))l
∑ τ(PA (S)) j = τ 1 ,APA (S)
0
j l
0
f (S) as defined in Lemma 1 is not a probability, because the construction in the
proof includes lists that do not contain all elements of S. For instance, if S = fa; bg
they include lists containing only a’s or only b’s. More generally, for arbitrary S the
construction includes lists over every possible subset of S, which have to be excluded in
the calculation of the probability of S. In other words, the calculation of the probability
of a set iterates over its subsets.
Definition 4 (Subset-distribution over sets). Let S be a non-empty subset of l elements
from the alphabet, and define
Pss (S) =
∑ (,PA (S0))l,l f (S0)
0
0
S0 S
where l 0 is the cardinality of S0 , and f (S0 ) is as defined in Lemma 1. Furthermore, define
Pss (0/ ) = τ. Pss (S) is a probability distribution over sets.
Example 3 (Subset-distribution over sets). Continuing Example 2, we have Pss (0/ ) =
P (fag)
τ = :28, Pss (fag) = f (fag) = τ A
= :05, Pss (fbg) = :08, and Pss (fcg) = :16.
1,PA (fag)
Furthermore, Pss (fa; bg) = f (fa; bg) , PA (fag) f (fag) , PA (fbg) f (fbg) = :03,
Pss (fa; cg) = :08, and Pss (fb; cg) = :15. Finally, Pss (fa; b; cg) = f (fa; b; cg) , PA (fa; bg) f (fa; bg) , PA (fa; cg) f (fa; cg) , PA (fb; cg) f (fb; cg) + (PA (fag))2 f (fag) + (PA (fbg))2 f (fbg) + (PA (fcg))2 f (fcg) = :18.
0
0
0
0
0
0
0
0
0
0
Pss takes only the elements occurring in a set into account, and ignores the remaining elements of the alphabet. For instance, the set fa; b; cg will have the same probability regardless whether there is one more element d in the alphabet with probability
p, or 10 more elements with cumulative probability p. This situation is analogous to
lists. In contrast, the usual approach of propositionalising sets by representing them as
bitvectors over which a tuple distribution can be defined has the disadvantage that the
probability of a set depends on the elements in its extension as well as its complement.
The subset-distribution is exponential in the cardinality of the set and is therefore expensive to compute for large sets. On the other hand, it can deal with finite subsets of
infinite domains. The bitvector probability calculation is independent of the size of the
subset, and cannot handle infinite domains.
2.4 Naive Bayes classification of terms
It is easy to nest the above distributions: if our terms are sets of lists, the alphabet over
which the sets range are the collection of all possible lists, and the list distribution can
be taken as the distribution over that alphabet. The same holds for arbitrary nestings of
types.
What can we do with probability distributions over terms? The usual things: inference, learning, prediction, and the like. We can do inference, either directly or through
sampling, if we have completely specified distributions over all types involved. This
involves queries of the type ‘what is the probability of this term?’. We would normally
assume that the type structure of the terms is known, so learning would just involve estimating the parameters of the distributions as well as the probabilities of atomic terms.
This is usually fairly straightforward, for instance the maximum likelihood estimate of
1
the parameter τ giving the probability of the empty list is 1+
, where l is the average
l
length of a list.
In [6] we have used this approach to implement a naive Bayes classifier for firstand higher-order terms. We assume that the type structure is given (i.e., whether to treat
a sequence of elements as a list or a set), and during the training phase the algorithm
estimates the parameters and the atomic distributions. Predictions are made by calculating the posterior class distribution given the term to be classified. The approach is naive
because of the strong independence assumptions made: for instance, for lists we use the
distribution Pli defined above, which treats each element of the list as independent of
the others and of its position.
An interesting point to note is that in the context of the naive Bayes classifier, the
list and multiset distributions behave identically (the number of permutations of a given
list is independent of the class). Somewhat more surprisingly, we found that the subset
and list distributions behaved nearly identically as well.
In this section, we defined fairly naive probability distributions over tuples, lists, sets
and multisets, and indicated how these could be used for naive Bayes classification. A
natural next question is whether the approach carries over to Bayesian networks, of
which the naive Bayes classifier is a special case. In the next section we will address
this question for the special case of nested tuples.
3 Hierarchical Bayesian Networks
Bayesian Networks [7] are being used extensively for reasoning under uncertainty. Inference mechanisms for Bayesian Networks are compromised by the fact that they can
only deal with propositional domains.
Hierarchical Bayesian Networks (HBNs) are an extension of Bayesian Networks,
such that nodes in the network may correspond to (possibly nested) tuples of atomic
types. Links in the network represent probabilistic dependencies the same way as in
standard Bayesian Networks, the difference being that those links may lie at any level
of nesting into the data structure.
3.1 Principles of Hierarchical Bayesian Networks
An HBN consists of two parts: the structural and the probabilistic part. The former describes the type hierarchy that builds the domains of the structured variables as tuples
of simpler types. It also contains the direct probabilistic dependencies, which are observed between elements of the same tuple. The latter contains the quantitative part of
the conditional probabilities for the variables that are defined in the structural part.
P(A)
t
t
A
a1
a2
P(BII|A,BI)
0.3
0.7
a1bI1
a1bI2
a2bI1
a2bI2
P(BI|A)
bI1
bI2
P(C|BI,BII)
c1
c2
c3
a1
a2
0.4
0.8
0.6
0.2
bI1bII1
bI1bII2
bI1bII3
bI2bII1
bI2bII2
bI2bII3
0.2
0.3
0.2
0.5
0.6
0.7
0.6
0.4
0.2
0.3
0.2
0.2
0.2
0.3
0.6
0.2
0.2
0.1
A
B
BI
BII
C
(a)
A
C
B
BI
BII
(b)
BI
bII1
bII2
bII3
0.3
0.1
0.7
0.9
0.4
0.5
0.1
0.1
0.3
0.4
0.2
0
BII
C
(c)
(d)
Fig. 1. A simple Hierarchical Bayesian Network. (a) Nested representation of the network structure. (b) Tree representation of the network structure. (c) Standard Bayesian Network expressing
the same dependencies. (d) Probabilistic part.
Figure 1 presents a simple Hierarchical Bayesian Network. The structural part consists of three variables, A; B and C, where B is itself a pair (BI ; BII ). This may be represented either using nested nodes (a), or by a tree-like type hierarchy (b). We use the
symbol t to denote a top-level composite node that includes all the variables of our
world. In (c) it is shown how the probabilistic dependency links unfold if we flatten the
hierarchical structure to a standard Bayesian Network.
In an HBN we observe two types of relationships between nodes: Relationships
in the type structure (which we call t-relationships) and relationships that are formed
by the probabilistic dependency links (p-relationships). We will make use of everyday
terminology for both kinds of relationships, and refer to parents, ancestors, siblings,
spouses etc. in the obvious meaning.
In the previous example, B has two t-children, namely BI and BII, one p-parent (A)
and one p-child (C).
We also assume that a probabilistic dependency link scope “propagates” through
the type structure, defining a set of higher-level probabilistic relationships.
Trivially, all p-parents of a node are also considered its higher-level parents. For
example, we consider the higher-level parents of C to be B (as a trivial case), BI and BII
(because they are t-descendants of B and there exists a p-link B ! C), while the higherlevel parents of BII are BI (trivial) and A (because A ! B and BII is a t-descendant of
B).
For a more formal and complete description of the above notions, see [3].
3.2 Probability distribution decomposition
A Hierarchical Bayseian Network is a compact representation of the probabilistic independencies between the atomic types it contains, in a similar manner as in standard
Bayesian Networks. Informally, we can express this as the following property:
The value of an atomic variable is independent of all atomic variables that
are not its higher-level descendants, given the value of its higher-level parents.
The independencies that an HBN describes can be exploited using the chain rule of
conditional probability, to decompose the full joint probability of all the atomic types
into a product of the conditional probabilities, in the following way:
P(x) =
PAX (x)
if x 2 AX is atomic
∏ni=1 P(xi jPar(xi )) otherwise
where x1 ; x2 ; : : : ; xn are the components of x and Par(xi ) are the direct p-parents of
xi in the structure.
3.3 Inference in Hierarchical Bayesian Networks
Inference algorithms used for standard Bayesian Networks can also be applied to the
Hierarchical Bayesian Network model. Backward reasoning algorithms [8] and message passing algorithms [7] can be used if we restrict our focus on the corresponding
Bayesian Networks. The additional information in the Hierarchical Bayesian Network
may serve towards a better interpretation of the resulting probability distributions.
Standard inference algorithms are not directly applicable to networks that contain
loops (a loop is a closed path in the underlying undirected graph structure). One technique of coping with loops is to merge groups of variables into compound nodes, eliminating the circles in the graph. In the case of Hierarchical Bayesian Networks, knowledge of the hierarchical structure can serve as a guideline for which nodes to merge, in
such a way that the resulting network would allow a more meaningful interpretation.
Message Propagation Algorithm A very efficient method of calculating beliefs in
polytree-structured standard Bayesian Networks (i.e., that do not contain any undirected closed path) is the message-passing algorithm described in [7]. Every node in
the network is associated to a belief state, that is a vector whose elements sum up to one
and correspond to the proportionate beliefs that each value in the domain may be the
variable’s value, given all available knowledge. The belief state of every node can be
directly retrieved given the belief states of its parents and children. Whenever a change
in a node’s belief state occurs, either forced by some direct observation or indirectly,
due to a change of the state of a neighbour, the node calculates its new belief state and
propagates the relevant information to its parents and children (Fig. 2). The algorithm
then repeats until the network reaches an equilibrium.
Belief calculation is performed locally in three independent steps:
.....
πj
.......
π
λ
λ
π
λ
π
BEL(x)
.....
λi
........
(a)
(b)
Fig. 2. Belief propagation in polytrees. (a) Message passing in polytree structures. (b) Local belief
update.
Belief updating: Calculating a node’s belief vector, using the latest information available from its neighbours.
Bottom-up propagation: Computing the λ messages that will be sent to parent nodes.
Top-down propagation: Computing the π messages that will be sent to children nodes.
The locality of the above algorithm is based on probabilistic independencies that
result from the assumption that the network does not contain any undirected circles. If
a network does contain loops, either an equilibrium cannot be reached, or, if it can be,
it will not necessarily represent the actual joint probability distribution.
This algorithm may be directly applied to Hierarchical Bayesian Networks in the
trivial case where the corresponding Bayesian Network does not contain undirected
cycles. Even if individual composite nodes contain no loops, these will occur in the
corresponding Bayesian Network in any case where some composite node participates
in more than one p-relationship. The only case where loops will not occur is if we allow
for polytree-like structures, where only leaf nodes may be composite, under the further
restriction of not containing any p-links.
Structures containing loops In the case where probabilistic dependencies form loops
in the network infrastructure (i.e., the undirected network) the above algorithm cannot be applied directly. We define a Hierarchical Bayesian Network to contain loops if
its corresponding Bayesian Network contains loops. There are several approaches that
can be used to perform inference on a network containing loops [7, 5]. As a trivial case,
these methods can be applied to any Hierarchical Bayesian Network after flattening it to
a standard Bayesian Network. Here we will restrict our discussion on an existing clustering method and show how it can be specifically adapted to the Hierarchical Bayesian
Network case.
Clustering methods eliminate loops by grouping together clusters of two or more
vertices into composite nodes. Different cluster selections may yield different polytrees
A
A
B
C
ABC
D
D
(a)
(b)
B
C
ABCD
D
(c)
(d)
Fig. 3. Coping with loops. (a) A Bayesian Network containing the loop ABDA. (b) Grouping
all non-leaf variables in a single cluster. (c) Equivalent Markov network. (d) Join tree algorithm
results in a single cluster.
when applied to a given Bayesian Network. As an extreme case, all non-leaf nodes may
be grouped in a single cluster (Fig. 3(b)).
One popular method, described in [7], is based on the construction of join trees.
Briefly, the technique consists in building a triangulated undirected graph G (i.e., a
Markov Network) that represents independency relations similar to the original Bayesian
Network, and then linking the maximal cliques of G to form a tree structure. The advantage of this method is that the resulting directed acyclic structure is a tree, making the
application of message propagation highly efficient. The trade-off is that the common
parents of a node in the original Bayesian Network, along with the node itself, will be
grouped into a single cluster, so information about independencies between them will
be lost (Fig. 3(c,d)).
The same algorithm can be applied directly on any fully flattened Hierarchical
Bayesian Network. However, we can make use of the additional information that a Hierarchical Bayesian Network structure contains to arrive to more informative structures.
The method we introduce is the following algorithm:
Algorithm 1 (HBN-decycling algorithm) To decycle a node v:
– If v is a non-leaf node:
Decycle all components of v
If v participates in two or more p-edges, prune the network structure on v.
If v has exactly one p-parent (or p-child), flatten the structure on v replacing
every p-connected subset of the t-children of v by a single cluster.
If v has no p-parents or p-children, flatten the structure on the node v.
– If v is a leaf node, leave v unchanged
By applying the HBN-decycling algorithm on a Hierarchical Bayesian Network and
then retreiving its corresponding Bayesian Network, we arrive at a polytree Bayesian
Network. The application of the inference algorithm for Bayesian Networks is not as
efficient for polytrees as it is for standard trees, but this is balanced by the smaller size
of individual nodes.
In figure 4(a) we see an HBN-tree structure containing a loop. The structure is
similar to the Bayesian Network in figure 3(a), with nodes A and B additionally forming
a composite node. The result of the decycling algorithm (Fig. 4(b)) retains much more
structural information from the original structure.
t
AB
AB
C
C
D
D
A
B
(a)
(b)
Fig. 4. (a) A Hierarchical Bayesian Network containing the loop ABDA. (b) Result of the HBNdecycling algorithm.
4 Concluding remarks
In this paper we have considered the issue of probabilistic reasoning with terms. Aggregation mechanisms such as lists and sets are well-understood by logicians and computer
scientists and occur naturally in many domains; yet they have been mostly ignored in
statistics and probability theory, where tupling seems the only aggregation mechanism.
The individuals-as-terms perspective has been explored in machine learning in order
to investigate the links between approaches that can deal with structured data such as
inductive logic programming, and more traditional approaches which assume that all
data are in a fixed attribute-value vector format. This has been very fruitful because it
has paved the way for upgrading well-known propositional techniques to the first- and
higher-order case.
In our work we are exploring similar upgrades of propositional probabilistic models to the first- and higher-order case. We have done this upgrade for the naive Bayes
classifier but further work remains, in particular regarding an extended vocabulary of
parametrised probability distributions from which to choose. In the case of Bayesian
networks, we are currently concentrating on a restricted upgrade considering nested tuples only. Our approach takes advantage of existing Bayesian Network methods, and is
an elegant way of incorporating into them specific knowledge regarding the structure of
the domain. Our current work is focused on implementing inference and learning methods for Hierarchical Bayesian Networks, aiming to be tested against the performance
of standard Bayesian Networks. Further on, we plan to introduce more aggregation operators for types, such as lists and sets. This will demand somewhat more sophisticated
probability decomposition methods than the cartesian product, and will allow the application of the model to structures of arbitrary form and length, such as web pages or
DNA-sequences.
References
1. P.A. Flach. Conjectures: An Inquiry Concerning the Logic of Induction. PhD thesis,
Katholieke Universiteit Brabant, 1995.
2. Peter A. Flach. Logical characterisations of inductive learning. In Dov M. Gabbay and Rudolf
Kruse, editors, Handbook of defeasible reasoning and uncertainty management systems, Vol.
4: Abductive reasoning and learning, pages 155–196. Kluwer Academic Publishers, October
2000.
3. Elias Gyftodimos and Peter A. Flach. Hierarchical bayesian networks: A probabilistic reasoning model for structured domains. In Edwin de Jong and Tim Oates, editors, Proceedings
of the ICML-2002 Workshop on Development of Representations. University of New South
Wales, 2002.
4. J. Halpern. An analysis of first-order logics of probability. Artificial Intelligence, 46:311–350,
1990.
5. M. Henrion. Propagating uncertainty by logic sampling in bayes’ networks. Technical report,
Department of Engineering and Public Policy, Carnegie-Mellon University, 1986.
6. Nicolas Lachiche and Peter A. Flach. 1bc2: a true first-order bayesian classifier. In Proceedings of the 12th International Conference on Inductive Logic Programming, 2002.
7. Judea Pearl. Probabilistic Reasoning in Intelligent Systems — Networks of Plausible inference. Morgan Kaufmann, 1988.
8. Stuart J. Russell and Peter Norvig. Artificial intelligence, a modern approach. Prentice Hall,
2nd edition, 1995.
Download