Document 14014737

advertisement
Aspets of Bayesian Networks
Amber Tomas
Otober 31, 2003
Supervisor: Dr Nigel Bean
Thesis submitted as a requirement for the degree of Bahelor of Mathematial and
Computer Sienes (Honours) in the Shool of Applied Mathematis
Contents
Signed Statement
v
Abstrat
vi
1
.
.
.
.
.
1
1
3
3
4
6
.
.
.
.
8
9
11
14
15
.
.
.
.
.
.
.
17
18
19
19
21
26
33
42
1.1 Introdution . . . . . . . .
1.2 Preliminary Information .
1.2.1 Graph Theory . .
1.2.2 Probability Theory
1.2.3 Hypothesis Testing
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2 Bayesian Networks
2.1 Independene Relations . . . . .
2.2 Probabilisti Bayesian Networks
2.3 Bayesian Belief Networks . . . .
2.4 Causal Bayesian Networks . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3 Information Propagation
3.1 Belief Distributions . . . . . . . . .
3.2 Conepts and Notation . . . . . . .
3.3 Belief Propagation . . . . . . . . .
3.3.1 Propagation in Trees . . . .
3.3.2 Propagation in Polytrees . .
3.3.3 Networks ontaining Loops
3.4 Answering Queries . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4 Aspets of Struture
44
4.1 Conepts from Information Theory . . . . . . . . . . . . . . . . . . . . . . . 45
4.2 Changes to the Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2.1 Changes to the Parents of a Node . . . . . . . . . . . . . . . . . . . 49
i
4.2.2 Changes in Networks Containing Loops
4.2.3 Removal of Nodes . . . . . . . . . . . .
4.3 Simplifying the Struture . . . . . . . . . . . .
4.4 The Value of Evidene . . . . . . . . . . . . . .
4.4.1 Seleting a Set of Evidene Nodes . . .
4.4.2 Updating the Set of Evidene Nodes . .
5 Learning Bayesian Networks from Data
5.1 Introdution . . . . . . . . . . . . . . . . . . . .
5.2 Considerations of learning . . . . . . . . . . . .
5.3 Methods of Learning Bayesian Networks . . . .
5.3.1 Soring Funtions and Searh Methods .
5.3.2 Maximum Likelihood . . . . . . . . . . .
5.3.3 Hypothesis Testing . . . . . . . . . . . .
5.3.4 Resampling . . . . . . . . . . . . . . . .
5.3.5 Bayesian Methods . . . . . . . . . . . .
5.4 The Information Theoreti Approah . . . . . .
5.4.1 Chow and Liu Trees . . . . . . . . . . .
5.4.2 More general networks . . . . . . . . . .
5.5 The Bayesian Approah . . . . . . . . . . . . .
5.5.1 Notation . . . . . . . . . . . . . . . . . .
5.5.2 Known Struture . . . . . . . . . . . . .
5.5.3 Unknown Struture . . . . . . . . . . .
5.5.4 Prior Distributions . . . . . . . . . . . .
5.5.5 Inomplete Data . . . . . . . . . . . . .
5.6 Condene Measures on Strutural Features . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
53
54
56
62
63
64
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
67
67
68
69
69
71
72
73
74
75
76
79
81
81
82
84
86
91
95
6 Inuene Diagrams for Deision Making
98
6.1 Deision Senarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.2 Inuene Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.3 Solution of Inuene Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . 103
7 Conlusions and Remarks
106
List of Figures
1.1 A Bayesian Network for the variables Party, Alohol Consumption, Level of
Co-ordination and Clarity of Speeh . . . . . . . . . . . . . . . . . . . . . .
1.2 The network in a) has a tree struture, b) is a polytree and ) ontains a
loop. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1 The four types of onnetions possible in a direted network; a) disonneted node, b) serial onnetion, ) diverging onnetion and d) onverging
onnetion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 E is an auxillary variable used to represent the dependeny relations of a)
in a direted graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 An alternate struture for the probability distribution over the variables
given in Figure 1.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4 The network in a) is a Probabilisti Bayesian Network of a domain and the
network in b) is a Bayesian Belief Network of the same domain. . . . . . . .
3.1 A Bayesian Network over the two variables X and Y . . . . . . . . . . . . .
3.2 A setion of a Bayesian Network showing the messages that node X reeives
from its neighbours and the messages that X sends out to its neighbours
at eah iteration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 A Bayesian Network model to predit a itizen's vote in an eletion. . . . .
3.4 A graphial representation of the Noisy OR-Gate. Only if a ause is present
and its inhibitor is not ating will the event X our. . . . . . . . . . . . . .
3.5 The Markov Network of the direted network in a) is formed by removing the arrows on the links and adding a link between nodes whih had a
ommon hild. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.6 The formation of a join tree ) from the Bayesian Network in a). This allows
the use of propagation methods for trees in what was originally a multiply
onneted network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.7 A simple Bayesian Network ontaining a loop. . . . . . . . . . . . . . . . . .
iii
1
4
10
11
13
14
17
21
24
31
35
36
36
3.8 Networks with the addition of query nodes. . . . . . . . . . . . . . . . . . . 42
4.1 A tree struture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 A Bayesian Network formed on the four variables Infetion (I), Fever (F),
Tired (T) and Alertness (A). The parent of Alertness is yet to be determined.
4.3 Equivalent network strutures . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4 The networks whih would be used for inferene if the state of X3 in the
networks in Figure 4.3 were xed. . . . . . . . . . . . . . . . . . . . . . . . .
4.5 A Bayesian Network. If X4 is instantiated and X3 removed, then the path
from X1 to X6 will be bloked. . . . . . . . . . . . . . . . . . . . . . . . . .
4.6 The network formed based on the node ordering X1 ; : : : ; X5 , with boundary
strata as given. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.7 A strutural heirahy. For example, C ould represent a luster of variables
representing auses, D ontain disease nodes and S ontain nodes representing possible symptoms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.8 A Bayesian Network with hypothesis variable Glandular Fever and information variables Test Result, Thermometer Reading and Tiredness. . . . . .
50
51
53
54
54
55
58
62
5.1 A Bayesian Network for whih we wish to learn the parameters by the
method of maximum likelihood. . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.2 Diagram illustrating the addition of dummy links to the network in a) to
form the omplete network b). . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.3 All strutures on two variables. . . . . . . . . . . . . . . . . . . . . . . . . . 89
6.1 A Bayesian Network with a deision node D and utility nodes U1 ; U2 and U3 . 99
6.2 A network representing the two deisions whih must be made in order to
buy a movie tiket. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.3 Here A represents the variables Attration to boyfriend, M represents the
variable Mood, and E represents the variable Enjoyment. . . . . . . . . . . . 101
6.4 A hain of variables representing the deision senario in Figure 6.2. . . . . 102
6.5 The deision senario of Figure 6.2 represented as a network with added
information and preedene links. . . . . . . . . . . . . . . . . . . . . . . . . 102
6.6 Inuene Diagram representing the deision of whih tiket to buy on a
sequene of dates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.7 Compat representation of the Inuene Diagram in Figure 6.6. . . . . . . . 105
Signed Statement
This work ontains no material whih has been aepted for the award of any other degree
or diploma in any university or other tertiary institution and, to the best of my knowledge
and belief, due referene has been made in the text to material previously published or
written by another person.
SIGNED: ....................... DATE: .......................
v
Abstrat
Bayesian Networks are a representation of a probability distribution. They onsist of
two omponents, a graphial omponent and a probabilisti omponent. The graphial
omponent enondes the dependene struture whih exists between the variables in the
domain, and the probabilisti omponent provides the remaining information about the
joint probability distribution. Bayesian Networks are used primarily as a modelling tool
to aid deision making, and in this thesis they are disussed in suh a ontext.
In Chapter 2 we introdue Bayesian Networks in a more formal manner and look at
three types in partiular - Probabilisti Bayesian Networks, Bayesian Belief Networks and
Causal Bayesian Networks. Chapter 3 then examines how inferene in a Bayesian Network
is performed and the proess by whih probabilities and information is propagated through
the network. In Chapter 4 we take an information theoreti approah to examine the eet
on a probability distribution of minor hanges to the network struture and the best ways
to inorporate new information about the variables. In these setions we assume for the
most part that the struture and probabilities of our network are known. In Chapter 5 we
look at how to form a Bayesian Network for the domain of interest given that we have some
data and/or prior knowledge. Both the Information Theory and Bayesian approahes are
onsidered in some detail. Finally, in Chapter 6 we apply the theory of previous hapters
to deision theory. In partiular we disuss the formation of Inuene Diagrams as an
extension to Bayesian Networks and how these an be solved to give an optimal deision
sequene.
vi
Chapter 1
1.1 Introdution
Bayesian Networks provide a graphial representation and a probabilisti model of the
relationships whih exist among a set of variables. For example, the Bayesian Network in
Figure 1.1 displays the assoiation struture that exists between the four variables Party,
Alohol Consumption, Level of Co-ordination and Clarity of Speeh. It shows the existene
of an assoiation between Party and Alohol Consumption but also represents the lak
of diret assoiation between attendene at a party and the level of one's o-ordination
- this assoiation is mediated by the variable Alohol Consumption. The strength of
assoiation along eah link is quantied by a set of onditional probabilities. Quantifying
these relationships allows us to make inferene about the variables in the model. For
example knowing the state of Level of Co-ordination and Clarity of Speeh allows us to
reason about the most likely state of Alohol Consumption and Party.
Party
Alcohol
Consumption
Level of
Co−ordination
Clarity of
Speech
Figure 1.1: A Bayesian Network for the variables Party, Alohol Consumption, Level of
Co-ordination and Clarity of Speeh
.
The use of Bayesian Networks has inreased signiantly over the last deade as their
suitability for aiding deision problems and storing and updating information about the
variables in some domain has been reognised and aepted in elds suh as Defene,
Management, Artiial Intelligene and Teleommuniations. This boom in popularity
1
1.1.
INTRODUCTION
has been aided by the large amounts of researh done in the early nineties on improving
the eÆieny of reasoning using Bayesian Networks, in onjuntion with the inrease in
available omputing power. During this period a broad sope of appliations was realised
whih ontributed to the development of easy to use graphial interfae software whih is
now widely available from many dierent ompanies.
In this thesis we onsider several theoretial aspets of Bayesian Networks. Although
Bayesian Networks are a tool for modelling and deision making, by taking a theoretial
approah we hope to gain a solid understanding of the proesses whih aet the suitability
of a Bayesian Network to model spei senarios. For example, gaining an understanding
of how information is used by a network, and the impliations of its struture, we beome
more aware of the issues whih are important when formulating a model, and how the
assumptions we make might aet the onlusions we draw. Addressing the theoretial
aspets of how inferene proeeds when using a Bayesian Network also give us an understanding of the issues that are likely to arise when forming a model of a system, and how
best to deal with them. It is hoped that the ideas in this thesis will provide the reader
with a good understanding of Bayesian Networks and hene a solid basis from whih one
an start to model real-world senarios.
Although the onstrution and implementation of a Bayesian Network as a tool for
inferene is unique, the theoretial basis for suh networks brings together areas of mathematis suh as graph theory, probability and statistis. In Setion 1.2 we present some
relevant ideas from these elds, knowledge of whih will be assummed in later setions.
In Chapter 2 we introdue Bayesian Networks in a more formal manner and look at
three types in partiular - Probabilisti Bayesian Networks, Bayesian Belief Networks and
Causal Bayesian Networks. Chapter 3 then examines how inferene in a Bayesian Network
is performed and the proess by whih probabilities and information is propagated through
the network. In Chapter 4 we take an information theoreti approah to examine the eet
on a probability distribution of minor hanges to the network struture and the best ways
to inorporate new information about the variables. In these hapters we assume, for the
most part, that the struture and probabilities of our network are known. In Chapter 5 we
look at how to form a Bayesian Network for the domain of interest given that we have some
data and/or prior knowledge. Both the Information Theory and Bayesian approahes are
onsidered in some detail. Finally, in Chapter 6 we apply the theory of previous hapters
to deision theory. In partiular we disuss the formation of Inuene Diagrams as an
extension to Bayesian Networks and how these an be solved to give an optimal deision
sequene.
2
1.2.
PRELIMINARY INFORMATION
1.2 Preliminary Information
1.2.1
Graph Theory
1.2.1.1 Undireted Graphs
An undireted graph G onsists of a set of nodes X and a set of edges E , where eah
element e of E is an unordered pair (Xi ; Xj ) and Xi and Xj are distint elements of X .
We say a graph is omplete if there exists an edge between every distint pair of nodes. A
graph on n nodes is omplete if and only if there are n(n 1)=2 distint edges. The order
of a node is the number of links with one end-point at that node. There is said to be a path
from X1 to Xk in G if there exist edges in E (X1 ; X2 ); (X2 ; X3 ); : : : ; (Xk 1 ; Xk ), where
the Xi are all distint. An undireted graph is said to ontain a yle if the end-points
X1 and Xk of a path oinide [11℄.
1.2.1.2 Direted Graphs
If the set of edges, E , of a graph onsists of ordered pairs (Xi ; Xj ), then the graph D is
alled a direted graph. The diretion of the edges is generally represented graphially by an
arrow from node Xi to node Xj . A direted graph is termed ayli if there are no direted
yles, that is there does not exist a sequene of edges (X1 ; X2 ); (X2 ; X3 ); : : : ; (Xk 1 ; Xk )
suh that X1 = Xk . Note that, although this denition appears idential to that for
undireted graphs, here the diretion of the links is important. As for undireted graphs,
a direted graph is omplete if there is an edge (in either diretion) between every distint
pair of nodes. Although an undireted omplete graph is unique, there exist many omplete
direted graphs over the set E , whih an be obtained by reversing the diretion of some
or all of the links.
The parents of a node Xj are dened to be those nodes Xi suh that (Xi ; Xj ) is in E .
The hildren of a node Xj are dened to be those nodes Xk suh that (Xj ; Xk ) is in E .
Additionally, the anestors of a node Xi in D are all nodes whih are predeessors of Xi ,
that is a node Xa is an anestor of Xi if there exists a direted path from Xa to Xi .
A direted graph is a tree if every node has at most one parent. D is alled a polytree
or is said to be singly onneted if there are no loops (undireted yles).
For example, onsider the direted graphs in Figure 1.2. The graph in Figure 1.2 a)
is a tree, as eah node has no more than one parent. The graph in Figure 1.2 b) is a
polytree, and the graph in Figure 1.2 ) ontains a loop, though not a (direted) yle. In
Figure 1.2 b), the node X5 has parents fX3 ; X4 g and anestors fX1 ; X2 ; X3 ; X4 g.
3
1.2.
X1
PRELIMINARY INFORMATION
X2
X1
X1
X3
X2
X2
X4
X3
X5
X4
a)
X3
b)
X4
c)
X6
X7
X5
Figure 1.2: The network in a) has a tree struture, b) is a polytree and ) ontains a loop.
1.2.2
Probability Theory
In this thesis we use upperase letters, Xi say, to represent random variables and the assoiated lower ase letter, xi say, to denote that Xi is in the partiular state xi . All variables
are assummed to be disrete and have a nite number of mutually exlusive states. We
assume that we are interested in the set of random variables U = fX1 ; X2 ; : : : ; Xn g, where
U represents the set of all variables in the universe, or domain, and that the variables in
U exist on the multi-dimensional state spae .
The joint probability distribution of the set of variables U is denoted P (X1 ; X2 ; : : : ; Xn )
or P (U) and speies the probability that fX1 = x1 g \ fX2 = x2 g \ : : : \ fXn = xn g for
all u = (x1 ; x2 ; : : : ; xn ) in .
It will often be onvenient to deompose the set of variables U into disjoint subsets. In
general, we denote a set by an upper-ase bold fae letter, the state spae of the variables
in that set by the orresponding alligraphi letter, and a omponent of that state spae by
the lower ase bold letter. For example the variables in the set X an take ongurations
P
x 2 X . When making a summation over all x 2 X , say, we will often write x to mean
P
x2X , where the restrition of x to X is taken to be impliit.
For any onguration u of the variables in U and subset of U; X say, we let X[u℄ be
the omponents of u that orrespond to the random variables in X. That is, X[u℄ is a
vetor in X , whose entries orrespond to a partiular subset of the entries in u.
Consider a subset of variables A U whih exists on the multi-dimentional state spae
A. We denote the joint probability distribution of the variables in A by P (A). The event
fA = ag ours when the variables in A are in onguration (state) a, and P (A = a)
speies the probability of this event. Hene
P (A = a) = P (
\
Xi
2A
fXi = Xi[a℄g);
and we will often abbreviate this to P (a).
The joint probability distribution of the variables in a subset A of U an be obtained
4
1.2.
PRELIMINARY INFORMATION
from P (U) by the summation
P (A = a) =
X
u2
P (u) I(A[u℄ = a);
for all a 2 A, where I is the indiator funtion
(
I=
1 if A[u℄ = a
0 otherwise.
P (A) is alled the marginal probability distribution of A.
The onditional probability of two sets of random variables A and B is given by
P (A = ajB = b) =
P (fA = ag \ fB = bg)
;
P (B = b)
(1.1)
and the joint onditional distribution of A given fB = bg is denoted by P (AjB = b).
This implies that
P (fA = ag \ fB = bg) = P (A = ajB = b)P (B = b):
(1.2)
As well as onsidering the marginal distributions of subsets of variables, we will also
want to speify the joint distribution of the variables whih belong to two dierent subsets,
where variables in one subset may not be independent of the variables in the other. The
joint probability distribution of all the variables whih are ontained in either A or in
B, P (A [ B) speies the probabilities P (fA = ag \ fB = bg) for all a 2 A and b 2 B.
Throughout this thesis the joint probability distribution of the random variables in the
union of the sets A and B is denoted P (A; B).
The variables in A are said to be independent of the variables in B if and only if
P (fA = ag \ fB = bg) = P (A = a)P (B = b);
for all vetors a 2 A and b 2 B. Equivalently, as an be seen from (1.1), A and B are
independent if and only if P (A = ajB = b) = P (A = a) for all a and b.
Several times in this thesis we shall make use of the following theorem.
Theorem: The hain rule for probability states that
P (X1 ; X2 ; : : : ; Xn ) =
n
Y
i
=1
5
P (Xi jX1 ; : : : ; Xi 1 ):
1.2.
PRELIMINARY INFORMATION
Proof. The result hold trivially for n = 1. Assume the result holds for n = k
for n = k,
1. Then
P (X1 ; X2 ; : : : ; Xk ) = P (Xk jX1 ; X2 ; : : : ; Xk 1 )P (X1 ; X2 ; : : : ; Xk 1 ); from (1.2)
kY1
= P (Xk jX1 ; X2 ; : : : ; Xk 1 ) P (Xi jX1 ; : : : ; Xi 1 )
i=1
=
k
Y
P (Xi jX1 ; X2 ; : : : ; Xi 1 );
=1
where the seond line follows from the assumption.
i
An important result that is needed in this thesis is Bayes' Rule. To derive this, we
rst rewrite (1.1) as
P (B = bjA = a) =
whih gives
P (fA = ag \ fB = bg)
;
P (A = a)
P (fA = ag \ fB = bg) = P (B = bjA = a)P (A = a):
(1.3)
Applying (1.3) in (1.2) gives
P (B = bjA = a)P (A = a) = P (A = ajB = b)P (B = b);
whih leads to Bayes' Rule,
P (A = ajB = b) =
P (B = bjA = a)P (a)
;
P (B = b)
for all a 2 A and b 2 B.
Finally, the expeted value of a random varible X say, is dened as
E [X ℄ =
X
2X
x P (X = x):
x
It is generally onsidered to be the `average' value of that random variable.
1.2.3
Hypothesis Testing
`Statistial hypothesis testing is a formal means of distinguishing between probability
distributions on the basis of random variables generated from one of the distributions' [33℄.
In general we assume that the data is generated from the distribution orresponding to
our null hypothesis, and we seek to determine if there is enough evidene to rejet the null
hypothesis in favour of the alternative hypothesis. The hypotheses are generally denoted
6
1.2.
PRELIMINARY INFORMATION
H0 and Ha respetively. We refer to any formal method for determining whether to rejet
or aept the null hypothesis as a hypothesis test.
A hypothesis test is generally based on a test statisti, that is some value whih an
be alulated from observed data. Assumming the null hypothesis to be true, it an be
possible to alulate the sampling distribution of the test statisti, or, more likely, the
asymptoti sampling distribution - the distribution of the test statisti as the size of the
sample tends to innity. If we then draw a sample and alulate the value of the test
statisti for that partiular sample, we an use the hypothesised distribution of the test
statisti to alulate the probability of observing suh a value given that the null hypothesis
is true. The p-value is the probability of observing a partiular value of the test statisti,
assumming that the null hypothesis is true. If this is small then we have strong evidene
against the null hypothesis. The level of signiane of a test is the value hosen prior
to having observed a sample suh that the null hypothesis will be rejeted if the p-value
is below this level. For example a p-value of 0.03 for a test onduted at the 5% level of
signiane will result in the null hypothesis being rejeted, whereas a p-value of 0.06 will
not. In the ase where a p-value is greater than the level of signiane, it is said that
there is insuÆient evidene to rejet the null hypothesis.
A Type One error is the probability that the test will rejet the null hypothesis when
it is in fat true, and a Type Two error is the probability that the test will fail to rejet
the null hypothesis when the alternative is true. The power of the test is the probability
the test will rejet the null hypothesis when the alternative hypothesis is true.
If we are performing a test for the values of some parameters of our model, say,
then intuitively this should somehow be based on how likely it is that the data we have
observed were generated from the hypothesised model. Many hypothesis tests are based
on the likelihood. Given that we have sampled N observations x1 ; x2 ; : : : ; xN , if we assume
the value of the parameters to be xed then the likelihood of the model is given by
L(jx1 ; x2 ; : : : ; xN ) = P (x1 ; x2 ; : : : ; xN j):
That is, the probability that the sample we observed was generated from the distribution
dened by the parameters . Note that the likelihood is generally onsidered as a funtion
of .
7
Chapter 2
Bayesian Networks
A Bayesian Network is a direted ayli graph with the following properties:
- Eah node represents a variable in the domain of interest. In this thesis we will
assume all variables are disrete with a nite number of states1 . For example, in
Figure 1.1 the variable Party would typially be an indiator variable with states
yes and no. As the other three variables an be measured on a ontinuous sale, we
typially break up this sale so that the variables are then ordinal. That is, they have
a disrete number of states whih an be ordered, for example Level of Coordination
and Alohol Consumption may have states low, moderate and high.
- Eah node has assoiated with it a table of onditional probabilities. The table of
onditional probabilities at a node X , say, speies the probability that X will be in
state x given that its parents Pa(X ) are in some onguration pa(X ) for all states
x of X and possible ongurations of the variables in Pa(X ). That is the strength
of the relationship between a node and its parent set is quantied by the onditional
probability distributions P (X jPa(X ) = pa(X )), for all ongurations pa(X ). In
the network in Figure 1.1 for example, the node Level of Coordination would have
the onditional probability table P (Level of Coordination jAlohol Consumption ).
For eah of the states of Alohol Consumption, this table will give the probability
distribution over the states of Level of Coordination. If eah variable has 3 states,
as suggested above, there would be 9 entries in the table.
Bayesian Networks have been used and interpreted in a variety of ontexts. They an
be used as an eÆient means of storing a probability distribution, or the ars interpreted
as having ausal impliation and the underlying probability distribution used simply as
Although a Bayesian Network an be formed on ontinuous variables or a mixture of ontinuous and
disrete variables, the theory is quite dierent to the disrete ase.
1
8
2.1.
INDEPENDENCE RELATIONS
a tool for inferene. In this hapter we rst look at the relationship between the struture of a network and the probability distribution whih it represents. We then disuss
the formation and appliation of three types of Bayesian Network, namely Probabilisti
Bayesian Networks, Bayesian Belief Networks and Causal Bayesian Networks.
2.1 Independene Relations
The struture of a Bayesian Network represents the independene relations whih exist between the variables. By understanding the independene relations enoded by the
struture we are able to exploit these to make Bayesian Networks an eÆient tool for the
storage and retrieval of information. Here we look at the onnetion between struture
and independene in more detail.
Let G be a graph, X; Y and Z be three disjoint sets of variables from the set of
all variables in the universe U, and let P represent a joint probability distribution over
U. We use M to represent some dependeny model in U, where a dependeny model
an be thought of as a rule whih determines all subsets of triplets (X; Y; Z) for whih
the assertion X is independent of Y given Z is true. If so, we denote this assertion by
I (X; Z; Y)M , where if the ontext is lear the subsript M may be omitted.
The representation of independene relations in a direted graph is assoiated with
the onept of d-separation. This onept is best illustrated by onsidering the types of
onnetion possible at a node, and the assoiated impliations of independene. The four
types of onnetion possible at a node, B say, are shown in Figure 2.1. Figure 2.1 a)
shows the trivial ase; when B is not onneted to any other node, B is onsidered to be
independent of the remaining variables in the network. For the other ases, suppose we
wish to know the onditions neessary to blok the ow of information from node A to
node C , that is to render A and C independent. From Figure 2.1 b) it an be seen that
if we know the state of B then no further information an be obtained at C by knowing
the state of A, that is, knowing the state of B bloks the ow of information from A to
C . A serial onnetion hene represents the notion that A is onditionally independent of
C given B , and we say B d-separates A from C . Likewise in Figure 2.1 ), if we know the
state of B then no information an be passed between A and C and so B d-separates A
from C at a diverging onnetion also.
In the ase of a onverging onnetion, shown in Figure 2.1 d), if we have no information
about B then A annot reeive information about C and so A and C are independent.
However, if the state of B is known and we have knowledge about the state of A say, then
this will have an eet on our belief in the state of C . For example, if A represents the
variable Inome, C the variable Number of Dependents and B the variable Spending on
9
2.1.
INDEPENDENCE RELATIONS
B
B
a)
A
B
A
C
C
b)
c)
A
C
d)
B
Figure 2.1: The four types of onnetions possible in a direted network; a) disonneted
node, b) serial onnetion, ) diverging onnetion and d) onverging onnetion.
Leisure, knowing the state of Inome does not give us any information about the possible
number of dependents. However, if we know that a household has a moderately large
expenditure on leisure, knowing there are many dependents would imply that it is likely
Inome is quite large. In the ase of a onverging onnetion then, B does not d-separate
A from C .
More formally, if X; Y and Z are three disjoint subsets of nodes in a direted ayli
graph D, then Z is said to d-separate X from Y, denoted hXjZjYiD if there is no path
between a node in X and a node in Y for whih the following onditions both hold [30℄,
1. every node with a onverging onnetion is in Z or has a desendent in Z, and
2. every other node is outside Z.
Further, we say D is a D-map of M if
I (X; Z; Y)M ) hXjZjYiD ;
that is if every relation of onditional independene in the model is represented in the
graph by an instane of d-separation. D is alled an I-map of M if
hXjZjYiD ) I (X; Z; Y)M ;
that is the d-separations of the graph orrespond to onditional independenies in the
model. D is termed a minimal I-map of M if it is an I-map of M and would no longer be
an I-map of M if any link were removed. Note that a graph with fewer links will imply
more d-separations and so more assertions of independene. If a link were removed from
a minimal I-map then this would reate a d-separation whih does not orrespond to an
assertion of onditional independene in the model. If D is both a D-map and an I-map,
we say D is a perfet map.
There does not neessarily exist a direted ayli graph that is a perfet map of some
distribution P . If a network is an I-map, then the d-separation properties of the network
orrespond to onditional independenies of the domain. Although not every onditional
independeny will neessarily be represented in the network, a network whih is a minimal
10
2.2.
PROBABILISTIC BAYESIAN NETWORKS
I-map will be as `lose' to a D-map as possible, that is the number of independenies not
represented by the network will be minimised.
Given that we have a network known to be an I-map of a model, there exist algorithms
for deduing the independene relations enoded in the struture. For example, Geiger,
Verma and Pearl [15℄ present, among others, an algorithm whih takes as input two sets of
nodes X and Z and returns a set of nodes whih ontains suÆient information to ompute
P (XjZ). This is equivalent to identifying those nodes that are not d-separated from X,
given that we know the states of the variables in Z.
2.2 Probabilisti Bayesian Networks
Given a probability distribution P on U, a direted ayli graph D is a Probabilisti
Bayesian Network of P if and only if D is a minimal I-map of P [30, pp 119℄.
Given any probability distribution P it is possible to onstrut an undireted graph G
whih is an I-map of P [30℄. Although the equivalent statement does not hold for direted
graphs, with the use of auxillary variables it is possible to represent any dependeny model
expressed by an undireted graph G in a direted ayli graph [30, pp 130℄. For example,
the graph in Figure 2.2 a) asserts the two independene relationships I (A; fB; Dg; C )
and I (B; fA; C g; D). In Figure 2.2 b), D and B are both serial onnetions and so
I (A; fB; Dg; C ) is true also. However, as C has a onverging onnetion, it does not
d-separate B and D, hene the relation I (B; fA; C g; D) does not hold in this direted
graph. However, if we introdue a fth (auxillary) variable E , as in Figure 2.2 ), then the
onnetion at C is serial and so now I (B; fA; C g; D) and the two independene relations
of the original undireted network in Figure 2.2 a) hold.
A
D
a)
A
B
C
D
A
B
b)
C
D
B
c)
C
E
Figure 2.2: E is an auxillary variable used to represent the dependeny relations of a) in
a direted graph.
Given that we have some distribution P over a set of variables, we require a method
of assigning the appropriate links between the nodes in the orresponding network. The
following theorem allows us to develop suh a method.
11
2.2.
PROBABILISTIC BAYESIAN NETWORKS
Theorem 1: A neessary and suÆient ondition for D to be a Probabilisti
Bayesian Network of P is that eah variable X be onditionally independent of
all its anestors given its parents Pa(X ), and that no proper subset of Pa(X )
satises this ondition2 [30, pp 120℄.
Consider some ordering (X1 ; X2 ; : : : ; Xn ) for the variables in U. To form a Bayesian
Network with respet to this ordering, we speify that the anestors of a node Xi must
be ontained in the set U(i) = fX1 ; X2 ; : : : ; Xi 1 g. If we let Bi U(i) be a minimal set
satisfying I (Xi ; Bi ; U(i) nBi ), then by Theorem 1 any direted ayli graph formed by
designating Bi as parents of Xi , that is setting Pa(Xi ) = Bi , is a Probabilisti Bayesian
Network of P .
We refer to the ordered set of subsets fB1 ; B2 ; : : : ; Bn g as the boundary strata of M
relative to the given ordering. Note that if a Bayesian Network is formed by the above
method, then
P (Xi jX1 ; X2 ; : : : ; Xi 1 ) = P (Xi jPa(Xi )):
(2.1)
If we hoose an alternate ordering for the variables, the boundary strata and hene the
resulting network will dier. Hene there are many possible strutures whih an represent
the same probability distribution one the onditional distributions have been speied.
However, although every network is a minimal I-map of P , some orderings may result
in a struture whih represents more of the onditional independenies than others. The
ordering hosen an be the dierene between a omplete graph, whih requires the largest
number of entries in the onditional probaility tables, or a more eÆient representation.
Ideally when forming a Bayesian Network we would like to enode as many independenies
as possible.
For example, onsider the variables represented in Figure 1.1 and suppose the independene relations
I (LC; AC; P ) I (CS; AC; P ) I (LC; AC; CS )
are judged to hold, where we have abreviated the varible names to the rst letter of eah
word. Under the ordering (P; AC; LC; CS ) the network in Figure 1.1 would result. If we
had instead used the ordering (LC; CS; AC; P ), we would obtain the network in Figure 2.3.
The joint probability distribution over the domain an be omputed most eÆiently
by making use of the onditional independenies enoded in the network. Beginning with
2
A proof of this theorem an be found in [30℄.
12
2.2.
PROBABILISTIC BAYESIAN NETWORKS
Level of
Coordination
Clarity
of Speech
Alcohol
Consumption
Party
Figure 2.3: An alternate struture for the probability distribution over the variables given
in Figure 1.1.
the hain rule we have
P (X1 ; X2 : : : ; Xn ) = P (X1 )P (X2 jX1 )P (X3 jX2 ; X1 ) : : : P (Xn jXn 1 ; : : : ; X1 )
Y
=
P (Xi jPa(Xi ));
(2.2)
i
where the seond line follows from the relation (2.1). Hene the joint distribution an
be obtained from the distributions of eah node onditioned on its parents. Ordinarily, to
store the joint distribution over the variables X1 ; X2 ; : : : ; Xn we would need to store the
Q
probability P (x1 ; x2 ; : : : ; xn ) for one less than the ni=1 jXi j possible ongurations of the
variables, where we have used the fat that the probabilities must sum to 1, and used jXi j
to denote the number of states of Xi . In a Bayesian Network the size of the onditional
probability table required at a node Xi will depend on the number of parents of Xi and
the number of states of Xi and its parents. For some xed onguration of the parents of
Xi , we know that the sum over the probabilities that Xi is in some state k is equal to 1.
Therefore the number of entries required to dene the onditional probabiltiy tables and
hene, from (2.2), the joint probability distribution, is
n
X
i
=1
Y
(jXi j 1)
Xk
2Pa(Xi )
jXk j:
This an be onsiderably less than the previous expression. For example, onsider again
the network in Figures 1.1 and 2.3. In Figure 1.1, assuming Party has 2 states and the
other variables 3 states, the onditional probability table at Alohol Consumption has
3 2 entries and those at Level of Coordination and Clarity of Speeh have 3 3 entries.
Hene this network requires the speiation of 2 + 6 + 9 + 9 = 26 probabilities. In the
network of Figure 2.3, as there are 3 3 possible ongurations for the parents of Alohol
13
2.3.
BAYESIAN BELIEF NETWORKS
Consumption, the onditional probability table at this node has 3 9 entries. In total
there are (3 1) + (3 3) + (3 9) + (2 3) = 45 probabilities to be speied, and so this
network, formed on a dierent ordering, is learly less eÆient than the original network.
However, as neither are omplete graphs, they are both more eÆeient than storing the
joint probaility distribution. This would require storage of 2 33 1 = 53 values.
A omplete Bayesian Network will have no storage advantages. The more independene
relations that are enoded in the struture of the network, the more eÆient it beomes.
Often relations of onditional independene an be fored to hold so that an eÆient
approximation to the underlying probability distribution is obtained. Suh methods are
presented in Setion 4.3 and Chapter 5.
As well as failitating eÆient storage Bayesian Networks also have omputational
advantages. For example if a marginal distribution is desired, depending on the ordering
of the variables its retrieval may require far fewer operations than summing over all of
the variables neessary to obtain the marginal distribution from the joint distribution.
In Setion 3 we see how this allows for eÆient updating of the probability distribution
when information about the state of a variable has been reeived.
2.3 Bayesian Belief Networks
X1
X2
Xn−2
X1
X2
Xn−2
Xn−1
a)
Xn−1
Xn
b)
Xn
Figure 2.4: The network in a) is a Probabilisti Bayesian Network of a domain and the
network in b) is a Bayesian Belief Network of the same domain.
Consider the Probabilisti Bayesian Network in Figure 2.4 a) for the probability distribution P a (X1 ; X2 ; : : : ; Xn ), whih indiates that Xn 1 is independent of the remaining
variables. Suppose in forming a Bayesian Network for P a we make Xn the hild of Xn 1 ,
as depited in Figure 2.4 b), to represent our (inorret) belief in a diret relationship
between these two variables. Let Paa (Xn ) be the set fX1 ; : : : ; Xn 2 g and Pab (Xn ) the
set Paa (Xn ) [ fXn 1 g. From Figure 2.4 a) we know
I (Xn; fX1 ; : : : ; Xn 2 g; fX1 ; : : : ; Xn 1 gnfX1 ; : : : ; Xn 2 g);
and so the boundary stratum for node Xn is Bn = fX1 ; : : : ; Xn 2 g. In Figure 2.4 b), sine
14
2.4.
CAUSAL BAYESIAN NETWORKS
Pa(Xn ) 6= Bn then this is not a minimal I-map and so by denition is not a Probabilisti
Bayesian Network. However this network is still a Bayesian Network.
If we know that Xn 1 is independent of the remaining variables, and dene the probabilities P (X1 ); : : : ; P (Xn 1 ) identially in a) and b), then
n
Y1
P b (X1 ; : : : ; Xn ) = P b (Xn jPab (Xn )) P (Xi jPab (Xi ))
i=1
n
Y1
= P b (Xn jPaa (Xn ); Xn 1 ) P (Xi );
i=1
as the nodes X1 ; : : : ; Xn 1 have no parents. Hene, as Xn is independent of Xn 1 in the
true model,
n
Y1
P b (X1 : : : ; Xn ) = P b (Xn jPaa (Xn )) P (Xi )
i=1
= P a (X1 ; : : : ; Xn ):
However, in b) the link Xn 1 ! Xn was reated beause of our belief in a diret relationship between the variables, hene the onditional probabilities we assign to Xn ,
P b (Xn jPab (Xn )), will not be equal to P a (Xn jPaa (Xn )) as assumed above. That is, the
joint probability distributions will not be equal.
The network formed by the introdution of Xn 1 ! Xn is an example of a Bayesian
Belief Network for P . Bayesian Belief Networks are Bayesian Networks formed on a
persons beliefs of ausal relationships and onditional independenies. When we form a
Bayesian Belief Network we are trying to use our knowledge to rereate the underlying
but unknown probability distribution as aurately as possible.
The information we have about P is often organised to give an intuitive understanding
of the major relationships between variables and the onstraints in the domain. The
parents of a node Xi are hene those variables we believe have a diret inuene on
Xi . Often the variables alloated as parents of Xi ould be onsidered to be auses of
Xi , though in a Bayesian Belief Network, unlike a Causal Bayesian Network, a ausal
interpretation is not neessary to alloate a variable as a parent of Xi .
2.4 Causal Bayesian Networks
Causal Bayesian Networks are similar to Bayesian Belief Networks exept for the philosophy behind their onstrution. We form a Bayesian Belief Network based on our beliefs
in an attempt to model the underlying probability distribution. When we form a Causal
Bayesian Network, based on our notions of ausation, we are trying to repliate the human
15
2.4.
CAUSAL BAYESIAN NETWORKS
system of reasoning. This is ommonly believed to be based on a ausal struture [31℄. In
a Causal Bayesian Network it is not important if the network represents the underlying
distribution, so long as it is an aurate model of how we reason about the system. This
means we an use the network as a model and tool for deision making and inferene,
based on the reasoning of the expert who ompiled it.
Networks formed in this manner have several useful features. Firstly, if a ausal relation
is no longer believed to hold, we an represent this by simply removing the relevant link.
Additionally, Causal Bayesian Networks allow one to model the eet of interventions. An
intervention ours when a variable is set to a partiular state, for example a swith may
be set to ON. This is fundamentally dierent to what we will refer to as an instantiated
variable, whih is a variable whose state is known and is in that partiular state beuase
of the inuene of other variables in the network. To model an intervention at a node
we simply remove the links to that node from its parents and then treat the node as
instantiated. Note that it is the partiular topology and ausal assumptions that result in
these features. The ausal semantis of Bayesian Networks and interventions are overed
extensively in [31℄.
Summary
In Setion 2.1 we disussed the representation of independene relations in a direted
graph. We then dened a Bayesian Network as a model with two omponents - the graphial omponent (a direted graph) speies the independene relations whih hold between
the variables in the domain, and the onditional probabilites quantify the relationships
whih are present between the variables. Any model whih satises these two omponents
is a Bayesian Network. In Setions 2.2, 2.3 and 2.4 we introdued three spei types of
Bayesian Networks. These are eah Bayesian Networks whih have additional onstraints
on either their onstrution or interpretation. A Probabilisti Bayesian Network has the
added onstraint that the network must be a minimal I-map of the domain to be represented, a Bayesian Belief Network has only the onstraint that we belief the speied
struture (and assoiated independene impliations) and onditional probabilities to be
orret, and a Causal Bayesian Network has the onstraint that the parents of a node
are its diret auses. Whih type of Bayesian Network is used depends on the spei
modelling task at hand.
16
Chapter 3
Information Propagation
Consider a Bayesian Network whih represents a probability distribution over the variables
in the domain U. If we observe the present state of a variable this may give us information
about the likely state of the other variables in the network. For example, onsider the
network on two binary variables X and Y , as shown in Figure 3.1. If we observe that
P(X)
0
1
0.5 0.5
X
Y
P(Y|X)
0
1
X=0
0.1
0.9
X=1
0.95 0.05
Figure 3.1: A Bayesian Network over the two variables X and Y .
Y is in state 1, we an use this information to update our beliefs about X . In this ase,
intuitively, we would expet it to be more likely that X is in state 0 and we would update
our beliefs aordingly.
In this hapter we look at how to systematially update our beliefs, or the probability
distributions at eah node, given that we have reeived some information about the state of
a variable or variables. In Setion 3.1 we intodue and justify our probabilisti approah
to updating beliefs. In Setion 3.3 we introdue a belief updating mehanism for the
partiular ase where the network is a tree, and then extend this to singly onneted
networks and networks ontaining loops respetively. Finally, in Setion 3.4, we look at
how the propagation sheme an be used to obtain answers to queries about the states of
the variables in the network.
17
3.1.
BELIEF DISTRIBUTIONS
3.1 Belief Distributions
Given that we have a Bayesian Network and have observed the state of some variable,
it would be possible to update our beliefs about the remaining variables based on our
intuition. In this hapter we propose as an alternative a systemati method based on
the rules of probability. Here we address why we should use a propagation mehanism to
update our beliefs, instead of relying on the intuition whih allowed us to initialise the
network, and justify our assumption that beliefs should obey the rules of probability.
In the ontext of belief networks, after having initialised the network, that is having
speied the links and the neessary onditional probabilities, we have a belief distribution
at eah node. Hene our belief in X , BEL(X ) is speied by the values BEL(X = x) for
P
all states x of X dened suh that BEL(X = x) 2 [0; 1℄ for all x and x BEL(x) = 1.
The Probabilisti Bayesian Network analogue is the marginal probability distribution of
X . In the example in Figure 3.1, BEL(X ) = (0:5; 0:5) and BEL(Y ) = (0:525; 0:475).
Note that to initialise these beliefs we are required only to give the onditional probability
table at eah node and so make only loal assessments of the strength of relation between
variables. In a large network we may reeive some information about the state of a variable,
and are then required to update the beliefs of nodes whih may be far from where the
infomation was observed. To do this subjetively requires us to onsider the relationships
between variables over the entire network. The omplexity of this task for a reasonable
sized network is well beyond what any mind is apable of proessing without making
simpliations or resorting to stored generalisations. By using a systemati proedure to
update our beliefs by using the information in the network, we are able to remove this
highly subjetive task. Another advantage of propagating information in this manner is
that a network an be ompiled with expert knowledge and a non-expert an then enter
information and obtain a onlusion based on the expert knowledge embedded in the
system.
The question as to why we an treat beliefs in the same way as probabilities has
been addressed extensively in many elds suh as Management, Psyhology and Deision
Theory. Given that we assign numerial values to beliefs, on hoosing a sale from 0 to 1
it seems reasonable to assume
BEL(X = x) =
0 if we are ertain X 6= x
1 if we are ertain X = x;
though assigning values to intermediate degrees of belief is not straightfoward. In general,
suggested methods usually involve some form of omparison between the unknown belief
and one's belief in an event whih is known to our with a ertain probability. Several
researhers have developed rules or axioms that they feel beliefs should follow and in eah
18
3.2.
CONCEPTS AND NOTATION
ase these rules imply that beliefs should follow the rules of probability [18℄.
3.2 Conepts and Notation
When using a Belief Network in a deision senario we are generally interested in only a
few variables. The variables of interest are referred to as hypothesis variables and their
nodes as hypothesis nodes. We base our deisions on the belief distribution at these nodes.
Our goal is to determine the belief distribution at the hypothesis nodes after having
reieved some information or evidene e. Nodes at whih we reieve evidene are alled
information or evidene nodes. Evidene will typially be in the form of an observation
whih onrms the state of the variable. One evidene has been reeived the distribution
of that node is xed and we say that the variable is instantiated. Evidene nodes are
typially leaf nodes and hypothesis nodes are often root nodes.
For example, onsider the network in Figure 1.1. If a parent observed their hild's
low level of oordination on returning home late one night, then this observation an be
entered as evidene. This an be done by instantiating the variable Level of Coordination
suh that P (Level of Coordination je) = (1; 0; 0)T :
In this hapter we fous on updating the belief distribution at some node X whih has
r states. X has k hildren Y1 ; Y2 ; : : : ; Yk and m parents V1 ; V2 ; : : : ; Vm , so that Pa(X ) =
fV1 ; V2; : : : ; Vm g. The universe U onsists of all variables in the network.
The belief distribution at X is represented as an r vetor BEL(X ). The omputations
will often be in the ontext of updating a single entry of this vetor, BEL(X = x), whih
we will often abreviate to BEL(x). Similarly (x) and (x) are salars whilst (X ) and
(X ) are vetors onsisting of the r values for (x) or (x) respetively.
We now look at how we an update our belief distributions at the hypothesis nodes
given the information we have reeived.
3.3 Belief Propagation
The objet of belief propagation is to alulate for eah variable X ,
BEL(X ) P (X je);
where e is the evidene available on whih we an update our beliefs. This is sometimes
alled the posterior distribution of X , as it is the distribution at X after having taken the
evidene into aount.
To failitate information propagation throughout the network we break down the task
into a series of loal propagations. That is, we use an iterative proedure where at eah
19
3.3.
BELIEF PROPAGATION
stage a node ombines the information available from its neighbours with information
from the previous step and then sends this updated information bak to its neighbours.
Under ertain onditions, whih are disussed later, this proedure will onverge and the
distribution at eah node will be equal to the posterior distribution P (X je).
The evidene X reeives is split into two disjoint subsets,
e+ - the evidene X reeives from its parents, and
X
eX - the evidene X reeives from its hildren.
In networks whih don't ontain loops we an take eX and e+X to be independent. These
an be split further into eX (Yj ) and e+X (Vl ) to represent evidene reeived at X from hild
Yj and parent Vl respetively. Note that we are never required to quantify these terms
expliitly, but they are used notationally to indiate the soure of the information being
used in a omputation. This allows us to keep trak of what information has already been
taken aount of in updating a belief, so that no information is ounted more than one.
If X is a root node then, as X has no parents, we initialise P (X je+X ) as P (X ). If X is a
leaf node and an evidene node we initialise P (X jeX ) to reet the evidene, for example
if X has the states x1 and x2 and we know that X = x2 , then P (X jeX ) = (0; 1)T . If X is
a leaf node and is not instantiated then we let P (X jeX ) assign equal probability to eah
state of X . This reets the fat that we have no evidene to suggest that X is more likely
to be in one state than any other.
Reall that a node X is onsidered a ause or inuene of its hildren and that this
inuene in general tends to follow the diretion of the arrows. Thus
P (X = xje+ )
(x) =
X
is sometimes alled the preditive support for X , as the information obtained from the
parents of X will have a large inuene on our belief in the state of X .
P (e jX = x)
(x) =
X
is a measure of the retrospetive support for X = x, that is the probability that X would
be reeiving suh information from its hildren given that X is in state x. If this is very
large then intuitively we would inrease our belief in X = x.
The loal messages that are sent between neighbouring variables are denoted X (vl )
and Yj (x).
X (vl ) = P (eVl (X )jVl = vl )
is the information sent from X to a parent Vl , and
Yj (x) = P (X = xje+Yj )
20
3.3.
BELIEF PROPAGATION
is the message sent from node X to hild Yj . The messages that are sent between neighbours at eah iteration are shown in Figure 3.2.
π X(V 1)
V1
π X(V 2)
λ X(v 1)
V2
λ X(v 2)
X
π Y (X)
π Y (X)
1
Y1
2
λ Y (x)
λ Y (x)
1
2
Y2
Figure 3.2: A setion of a Bayesian Network showing the messages that node X reeives
from its neighbours and the messages that X sends out to its neighbours at eah iteration.
In general, we have that
BEL(x) = P (xje+X ; eX )
P (xje+X )P (eX je+X ; x)
=
; by Bayes' rule
P (eX je+X )
P (xje+X )P (eX jx)
=
; as eX and e+X are independent given x
P (eX )
= :(x)(x);
(3.1)
where is a normalisation onstant. This gives us an expression for updating our belief
given that we know (x) and (x).
3.3.1
Propagation in Trees
In a tree, a node X an have at most one parent, V . To see how (x) an be obtained,
rst suppose X has hildren Y1 and Y2 . Then the support for X = x given the information
available from its hildren,
(x) = P (eX jX = x)
= P (eX (Y1 ); eX (Y2 )jx)
= P (eX (Y1 )jx)P (eX (Y2 )jx);
as, in a tree, the evidene reeived from the subtree rooted at Y1 is independent of the
evidene reeived from the subtree rooted at Y2 . In general, given that
Yj (x) = P (eX (Yj )jX = x);
21
3.3.
BELIEF PROPAGATION
independene of the subtrees rooted at the hildren of X implies that
(x) =
k
Y
j
=1
Yj (x);
(3.2)
where X has hildren Y1 ; Y2 ; : : : ; Yk . Hene we an alulate (x) at X given that we have
reeived the messages Yj (x) from eah hild Yj of X .
To alulate BEL(x) we also need (x). Consider the following expansion:
(x) = P (xje+X )
X
=
P (xje+X ; v)P (vje+X ) (onditioning on V )
v
=
X
v
=
X
v
P (xjv)p(vje+X )
(3.3)
P (xjv)X (v);
(3.4)
where (3.3) follows from the fat that, in a tree, all the information X reeives from
its anestors is ontained in V , as V d-separates X from all other anestors of X . The
probabilities P (xjv) an be obtained from the onditional probability table for node X . If
we let the entries of this table orrespond to the entries in the matrix MX jV say, so that
[MX jV ℄ij = P (X = xj jV = vi ), then we an write
(X ) = MXT jV X (V ):
(3.5)
We are now able to alulate BEL(X ) by (3.1), given that we have reeived the relevant
messages from the neighbours of X . In order to failitate information propagation, X must
ombine these messages with its urrent information to send out updated messages to its
neighbours. X will send the messages V (X ) to its parent V and Yj (X ) to eah hild Yj .
The information X sends to its parent,
X (v) = P (eV (X )jV = v)
X
=
P (x; eV (X )jv)
x
=
X
x
=
X
x
P (eV (X )jv; x)P (xjv)
P (eX jv; x)P (xjv);
as, beause we are onsidering propagation in a tree struture, the evidene that V reeives
from X must have ome from the hildren of X , assumming that as evidene nodes an
22
3.3.
BELIEF PROPAGATION
only be leaf nodes, no evidene is observed at X . Then, as X d-separates its desendents
from its parent V ,
X (v) =
X
x
=
X
x
whih we an write as
P (eX jx)P (xjv)
(x)P (xjv);
(3.6)
X (V ) = MX jV (X ):
The message X must send to hild Yj should inlude the evidene X has obtained
from its parent as well as from its other hildren. If we let eX (Y (j ) ) denote the ombined
evidene reeived at X from all hildren other than Yj , we an then write
Yj (x) =
=
=
=
=
P (xje+Yj )
P (xje+X ; eX (Y (j ) ))
:P (eX (Y (j ) )jx; e+X )P (xje+X ); by Bayes' Rule
:P (eX (Y (j ) )jx)P (xje+X ); as X d-separates Y (j ) from the anestors of X
:Y (j) (x)(x);
where is a normalising onstant. As the evidene X reeives from eah hild is independent,
Y (j) (x) = P (eX (Y (j ) )jx)
k
Y
P (eX (Yl )jx)
=1
=6
(x)
=
; from (3.2).
Yj (x)
=
l
l j
Hene
(x)(x)
Yj (x)
BEL(x)
:
= 0 :
Yj (x)
Yj (x) = :
(3.7)
To summarise, for every iteration of the belief propagation mehanism, eah node
performs the following three tasks:
1. Node X uses the messages reeived from its neighbours to alulate (X ) and (X ),
and then updates its belief distribution, BEL(X ).
23
3.3.
BELIEF PROPAGATION
2. X uses the information obtained from its hildren to send an updated message to
its parent.
3. X uses the information obtained from its parent and hildren other than Yj to send
a message to hild Yj , for j = 1; : : : ; k.
To arry out this proedure at node X , the only information required is X (V ); Yj (x)
(for all j ) and the xed matrix of onditional probabilities, MX jV .
For trees, these rules guarantee that equilibrium will be reahed in a time proportional
to the longest path in the network and that, at equilibrium, eah node will have a belief
distribution equal to its posterior distribution given all the available evidene [30℄.
We will now demonstrate how this is arried out, through a simple example. Suppose
there are two andidates standing for eletion, one from eah of the two major parties.
They eah have very dierent defene poliies, and whether a itizen is for or against the
defene poliy of a andidate is believed to be an indiator of for whom they will vote.
This an be modelled by the Bayesian Network shown in Figure 3.3 a). Here the variable Defene Poliy has states 1 and 2 indiating preferene for andidate 1 or andidate
2's defene poliy respetively. Vote also has states 1 and 2 indiating for whom a itizen
will vote. Based on knowledge obtained from pre-eletion polls, we have
P (V jD) 1
2
0
:
95
0
:
05
T
P (D) = (0:3; 0:7) and
1 0:95 0:05 and so MV jD = 0:15 0:85 :
2 0:15 0:85
As D is a root node and V an uninstantiated leaf node we initialise (D) = (0:3; 0:7)T
and (V ) = (1; 1)T = V (D). It is not important that the entries in () and () sum to
1 as the appropriate saling is done when normalising BEL().
D
a)
V
Defence
Policy
Vote
D
Defence
Policy
D
R
R
Rally
Rally
b)
V
Vote
c)
Defence
Policy
M
V
Party
Member
Vote
Figure 3.3: A Bayesian Network model to predit a itizen's vote in an eletion.
24
3.3.
BELIEF PROPAGATION
We rst update the belief distributions at the nodes.
=)
BEL(d) = (d)(d)
= (d); d = 1; 2
BEL(D) = (0:3; 0:7)T :
To alulate BEL(V) we need
(v) =
X
d
=
X
P (vjd)v (d); from (3.4)
P (vjd)(d):
d
That is,
(V ) = MVT jD (D)
= (0:39; 0:61)T :
Hene
=)
BEL(v) = (v)(v); v = 1; 2
BEL(V ) = (V )
= (0:39; 0:61)T :
This gives the belief distributions based on the available information. Suppose now that
a rally was held to allow people to demonstrate their opposition to the defene poliy of
andidate 1. The variable Rally an be added to our model as in Figure 3.3 b), with states
0 and 1 to indiate absene or presene at the rally respetively. Beause the state of D is
not known, knowing whether a itizen attended the rally gives us some information about
for whom they may vote.
Consider the ase of itizen 1 who was suspeted to have attended the rally. We hene
instantiate Rally to (0:2; 0:8)T , so that (R) = (0:2; 0:8)T , as we are 80% ertain that the
P (RjD) 0 1
itizen attended the rally. We have the onditional probability table
1 1 0 ;
2 0:8 0:2
whih represents the belief that only 20% of the people supporting andidate 2's defene
poliy will rally, and that no-one who supports andidate 1's poliy will rally against it.
The message R sends to D is
d (R) =
X
r
(r)P (rjd); d = 1; 2
=) D (R) = MRjD (R)
= (0:2; 0:32)T :
25
3.3.
BELIEF PROPAGATION
The evidene D reeives from its two hildren is then ombined to give
=)
(d) = R (d)V (d); d = 1; 2
(D) = (0:2; 0:32)T
and hene
=)
BEL(d) = :(d)(d); d = 1; 2
BEL(D) = :(0:06; 0:224)T
= (0:211; 0:789)T :
D must now send out a message to V , whih is, from (3.7)
:BEL(d)
V (d)
= :(d)R (d)
V (D) = :(0:211; 0:789)T :
V (d) =
=)
(3.8)
(3.9)
Notie that the form of (3.8) ensures that the information D reeived from V is not
inluded in the message sent bak from D to V . Now
(V ) = MVT jD V (D); from (3.5)
= (0:319; 0:681)T :
Beause V is a leaf node, (V ) = P (eV jV ) does not need to be updated and remains as
initialised. Hene,
BEL(V ) = :(V )(V )
= :(V )
= (0:319; 0:681):
After having taken into aount the information about the likelihood of itizen 1's presene
at the rally, we have inreased our belief that they will vote for andidate 2. We now believe
it is more than twie as likely they will vote for andidate 2 than for andidate 1.
3.3.2
Propagation in Polytrees
Reall that a polytree is a singly onneted network. That is, a node may have several
parents and hildren, but the network may not ontain loops. Information propagation
in polytrees is similar to that in trees exept that, as a node X may have more than one
parent, the information obtained from the parents of X must be ombined in order to
26
3.3.
BELIEF PROPAGATION
update the belief distribution. To send out a message to eah parent, X must ombine the
information introdued from its hildren and its other parents. In a polytree, as no parents
of any given node have a ommon anestor, this implies that the information introdued
from one parent is independent of the information obtained from any other parent.
To update BEL(X ) we need (X ) and (X ). We an write
(x) = P (xje+X )
= P (xje+X (V1 ); : : : ; e+X (Vm ));
where e+X (Vl ) is the evidene X obtains from the link with parent Vl . Conditioning on
V = fV1 ; : : : ; Vm g and using the fat that P (xjv1 ; : : : ; vm ; e+X ) = P (xjv1 ; : : : ; vm ) then
gives
(x) =
=
X
v1 ;:::;vm
X
v1 ;:::;vm
X
P (xjv1 ; : : : ; vm )P (v1 ; : : : ; vm je+X (V1 ); : : : ; e+X (Vm ))
P (xjv1 ; : : : ; vm )
P (xjv1 ; : : : ; vm )
m
Y
l
=1
m
Y
P (vl je+X (V1 ); : : : ; e+X (Vm ))
P (vl je+X (Vl )):
(3.10)
=1
The nal line follows beause we don't know the state of X and, as the onnetion at X
from its parents is onverging, the state of a parent is independent of the other parents.
If we let
X (vl ) = P (vl je+X (Vl ))
=
v1 ;:::;vm
l
and v = (v1 ; v2 ; : : : ; vm ), then we an write (3.10) as
X
P (xjv)
m
Y
X (vl ):
(3.11)
=1
Q
Note that the weight given to the evidene m
l=1 X (vl ) for some v is proportional to the
probability that X = x, given that onguration of its parents.
As (x) is independent of information from the parents of X , this remains as for trees,
that is
k
Y
(x) = Yj (x):
j =1
Hene
(x) =
v
l
BEL(x) = :(x)(x)
= :
k
Y
j
=1
Yj (x)
27
X
v
P (xjv)
m
Y
l
=1
X (vl ):
(3.12)
3.3.
BELIEF PROPAGATION
To alulate the messages X sends to its neighbours, the evidene reeived from all
neighbours is rst ombined and then redistributed in proportion to the evidential weight
on eah link. To alulate X (vl ), the message X sends to its lth parent, onsider all other
parents as a single information soure V(l) = VnfVl g. Then the evidene eVl (X ) that Vl
reeives from X is based on the evidene reeived at X from V(l) ,
[
e+X (V(l) ) = e+X (Vk );
k 6=l
and the evidene X has reeived from its hildren, eX . Note that e+X (V(l) ) is independent
of eX . Hene the support for Vl , given the evidene at X , is given by
vl (X ) = P (e+X (V(l) ); eX jvl )
XX
P (e+X (V(l) ); eX jvl ; v(l) ; x)P (v(l) ; xjvl );
=
x
vl
( )
where the double summation is over all states x of X and ongurations v(l) of the variables
in V(l) . As e+X (V(l) ) and eX are independent, and X d-separates V(l) and Vl from its
hildren,
X (vl ) =
=
XX
x v(l)
XX
x
vl
P (e+X (V(l) )jvl ; v(l) ; x)P (eX jvl ; v(l) ; x)P (v(l) ; xjvl )
P (e+X (V(l) )jv(l) )P (eX jx)P (v(l) ; xjvl )
( )
P (v(l) je+X (V(l) )) + (l)
P (eX (V ))P (eX jx)P (xjv(l) ; vl )P (v(l) jvl );
(l) )
P
(
v
x v(l)
by Bayes' Rule. Given that V(l) is independent of Vl , we have that P (v(l) jvl ) = P (v(l) ).
As v = v(l) [ vl ,
=
XX
X (v l ) = XX
x
v
P (v(l) je+X (V(l) ))P (eX jx)P (xjv)I(Vl [v℄ = vl );
where = P (e+X (V(l) )) is a normalisation onstant. Substituting (x) for P (eX jx) and
X (vk ) for P (vk je+X (Vk )) we obtain
X (v l ) = 8
X X <Y
x
v
:
6=
9
=
X (vk ) (x)P (xjv)I(Vl [v℄ = vl ):
;
k l
When X sends a message to hild Yj this should be based on all information obtained
at the previous iteration besides that reeived from Yj . Thus the preditive support hild
Yj reeives from its parent, Yj (x), is equal to P (xjeX neX (Yj )); where eX neX (Yj ) is the
28
3.3.
BELIEF PROPAGATION
evidene X has reeived from eah of its parents and hildren exluding Yj . As BEL(X )
is the belief at X after evidene from all neighbours has been onsidered, and the evidene
reeived from eah of the hildren of X is independent, we an write
BEL(x)
Yj (x)
= :BEL(x)jYj (x)=1 :
Yj (x) = :
To demonstrate this proedure, onsider again the example from Setion 3.3.1. Another
fator inuening for whom one will vote is a itizen's ommitment to a partiular party.
One way to model this fator is to add the variable Party Member as a parent of Vote, as
in Figure 3.3 ). This variable has states 0, 1 and 2 indiating no party membership and
membership to the party of andidate 1 or 2 respetively. As V now has two parents this
model is no longer a tree, though it is a polytree.
P (V jD; M ) 1 2
1; 0 0:8 0:2
1; 1 1 0
We have the onditional probability table
1; 2 0:3 0:7 .
2; 0 0:1 0:9
2; 1 0:6 0:4
2; 2 0 1
Suppose now that itizen 1 is a member of party 2. Hene we instantiate M to 2 and wish
to update our beliefs for whom they will vote. We have,
V (M ) = (M )
= (0; 0; 1)T
and V (D) = (0:211; 0:789), from (3.9). Hene
(V ) =
X
p(vjd; m)V (m)V (d); from (3.11)
d;m
=) (V ) = MVT jD;M :VM;D ;
where VM;D is the 6
That is,
1 vetor with entries V (m)V (d) for m = 1; 2; 3 and d = 1; 2.
(V ) = (0:063; 0:937)T :
Hene,
BEL(V ) = :(V )(V )
= (0:063; 0:937)T :
29
3.3.
BELIEF PROPAGATION
We are now very ertain that itizen 1 will vote for andidate 2. Note that, beause of the
onverging onnetion at Vote, the information introdued at Party Member will have no
eet on the distribution of Defene Poliy or Rally while the state of Vote is unknown.
In summary, to update BEL(X ) we require X (Vl ); l = 1; : : : ; m; Yj (x); j = 1; : : : ; k
and the onditional probabilities P (X jV1 ; : : : ; Vm ). X then alulates and sends a message
to eah hild and parent. Again use of this proedure guarantees onvergene in a time
proportional to the longest path in the network. However, in polytrees the presene of
multiple parents adds a further degree of omplexity to the neessary alulations. The
summation required in (3.12) is over all ombinations of states of the parent variables.
If the number of parents is large or they have many states, this summation an beome
intratable. In the next setion we present a model whih, under several simplifying
assumptions results in a losed form expression for BEL(X ), whih we derive.
3.3.2.1 The noisy OR-gate model for polytrees
In order to simplify the exposition of the noisy OR-gate model, assume that the parents
of a variable are its diret auses. We will also assume that eah variable or ause has two
states - either the ause is present or the ause is absent.
This model is based on two assumptions. The rst assumption of the model speies
that an event will not our if none of its auses are present. If a ause is present then
the event may or may not our. For example a ause of Broken Window may be Hit by
Ball. If this were the only parent of Broken Window we would be assuming that, were
the window to break it would be beause it were hit by a ball. If the window was hit by
a ball it may or may not break. In general we refer to the fators that may prevent an
event suh as breakage ourring, given that a ause is present, as inhibiting fators.
The seond assumption of the model is that the inhibiting fators of eah variable are
independent of the inhibiting fators of other variables. We an thus think of a variable
X as having several parents or auses. Only if there is at least one ause present whose
inhibitor is not ating will event X our. This is shown shematially in Figure 3.4. If we
let ql be the probability that the lth inhibitor is ative and denote by 1 and 0 the states
present and absent respetively, then
P (X = 1jVl = 1; Vk = 0
8 k 6= l) = 1 ql:
Note that if ql = 0 then the noise omponent is removed and an event will always our if
a ause is ating.
Let (pa(X)) index the set of all auses of X whih are present, when the parents of X
are in onguration pa(X ). That is (pa(X)) = fk : vk = 1; vk 2 pa(X )g. Then X = 0
30
3.3.
Inhibitor
Cause
Ii
I1
V1
BELIEF PROPAGATION
Vl
AND
AND
In
Vn
AND
OR
X
Figure 3.4: A graphial representation of the Noisy OR-Gate. Only if a ause is present
and its inhibitor is not ating will the event X our.
only if the inhibitors of Vk for all k 2 (pa(X)) are ating. As, by the seond assumption
of the model, the inhibitors are independent,
Y
P (X = 0jpa(X )) =
ql
2 (pa(X))
Y
and P (X = 1jpa(X )) = 1
ql :
l2(pa(X))
l Note that these expressions dene the onditional probability table for node X , whih is
subjet to the onstraints of the model.
We now use these assumptions to derive an expression for BEL(X ). Reall, from (3.1)
and (3.11) that
BEL(X = x) = :(x)(x)
X
Y
= :(x)
P (X = xjpa(X )) X (vl );
l
pa(X )
P
(3.13)
where pa(X ) represents summation over all possible ongurations of the parents of X ,
and X (vl ) = P (vl je+X ). If we denote the nal part of this expression by
KPa(X ) =
x
X
pa(X )
P (X = xjpa(X ))
m
Y
l
31
=1
X (vl );
3.3.
BELIEF PROPAGATION
then
X
0
KPa
(X ) =
f
Y
f
Y
pa(X ) l2(pa(X))
X
=
pa(X ) l2(pa(X))
ql g
m
Y
l
=1
ql g
X (vl )
Y
X (vl )
Y
2 (pa(X))
l62(pa(X))
Y
=
f
ql X (vl )g
X (vl ):
pa(X ) l2(pa(X))
l62(pa(X))
X
X (vl )
l Y
Summing over the states Vi = 0 and Vi = 1 of an element of Pa(X ), Vi , gives
0
KPa
(X ) =
X X
Y
f
ql X (vl )g
Y
X (vl )
pa(X ): l2(pa(X))
l62(pa(X))
Vi =vi
X
Y
Y
=
f
ql X (vl )gX (Vi = 0)
X (v l )
pa(X ): l2(pa(X))
l62(pa(X)) ;
Vi =0
l6=i
vi
+
X
pa(X ):
Vi =1
qi X (Vi = 1)(
Y
2 (pa(X));
l=
6 i
ql X (vl ))
l Y
62 (pa(X))
X (v l ):
l Note that if Vi = 0, then i 62 (pa(X)), and if Vi = 1, then i 2 (pa(X)). Denoting the
set Pa(X )nfVi g by Pai (X ), we an hene write
0
KPa
(X ) = (X (Vi = 0) + qi X (Vi = 1)) X
Y
(
ql X (vl ))
Y
pai (X ) l2(pai (X ))
l62(pai (X ))
0
= (X (Vi = 0) + qi X (Vi = 1))Kpa
(X ) :
X (vl )
i
Applying this reursively, and substituting 1 X (Vl = 1) for X (Vl = 0) = P (Vl = 0je+X ),
leads to
0
KPa
(X ) =
=
=
m
Y
l
=1
l
=1
l
=1
m
Y
m
Y
[X (Vl = 0) + ql X (Vl = 1)℄
[1 + (ql
1)X (Vl = 1)℄
(3.14)
[1 X (Vl = 1)(1 ql )℄:
(3.15)
32
3.3.
BELIEF PROPAGATION
1
To nd an expression for KPa
(X ) , note that
1
KPa
(X ) =
=
X
pa(X )
Y
(1
2 (pa(X))
l m
Y
X
ql )
m
Y
l
=1
X (vl )
0
X (vl ) KPa
(X )
pa(X ) l=1
X
0
=
P (pa(X )je+ ) KPa
(X )
pa(X )
0
= 1 KPa
(X ) :
(3.16)
Hene on substitution of (3.15) and (3.16) into (3.13), with (X ) = ((0); (1))T we obtain
BEL(X = 0) = :(0)
m
Y
=1
(
[1 X (Vl = 1)(1 ql )℄
l
BEL(X = 1) = :(1) 1
m
Y
)
[1 X (Vl = 1)(1 ql )℄ :
=1
The evaluation of this expression involves the alulation of a produt whih inreases only
linearly with the number of parents of a node, and hene avoids the problem of having to
sum over all ongurations of the parents as in (3.12). However the assumptions of the
model, most signiantly that the inhibitors of the auses of an event are independent, are
quite restritive. The eet this may have on the belief distributions should be weighed
up against the omputational advantages.
l
3.3.3
Networks ontaining Loops
In a tree, the subtrees rooted at a node are independent. Additionally, in a polytree the information reeived from a parent is independent of the information reeived from any other
parent. These independene relations allow messages to be sent out and reeived messages
ombined in suh a way that no evidene is ounted twie. If the message propagation
sheme used for trees and polytrees were used in a network ontaining loops, messages
may irulate indenitely around the loops and never onverge to a stable equilibrium.
The independene assumptions are no longer valid.
There are two methods ommonly used for message propagation in multiply onneted networks - lustering and onditioning. Eah of these methods, disussed in Setions 3.3.3.1 and 3.3.3.2 respetively, makes use of the independene properties of the
network to enable the use of loal omputations. However, in highly onneted networks
these methods an beome intratable and so approximate methods are required. In this
ase simulation is often used, and in Setion 3.3.3.3 we disuss how to apply simulation in
the ontext of Bayesian Networks.
33
3.3.
BELIEF PROPAGATION
3.3.3.1 Clustering
Clustering, as its name suggests, involves grouping nodes into lusters so as to form a
tree. The propagation method introdued for trees an be used on the lustered nodes,
and the updated beliefs then distributed bak to the original variables. The tree formed
as a result of lustering is termed a join tree. In order to desribe the formation of a join
tree we rst give some denitions.
A set of nodes is omplete if all nodes are pairwise linked, that is eah node is linked
to every other node. A omplete set is alled a lique if it is not a subset of another
omplete set. Note that a graph may have many liques whih do not have to be disjoint.
Additionally we say an undireted graph is hordal if every yle of length four or more
has at least one hord, that is an edge joining two nononseutive nodes.
We an form a join tree by the following method:
1. Connet all parents that share a ommon hild and remove the arrows from the
links. The resulting graph G is alled the Markov Network of the original Bayesian
Network.
2. Form a hordal graph G0 from G. This an be done by the graph triangulation
algorithm, as given in [30℄.
3. Identify all liques in G0 .
4. Order the liques C1 ; : : : ; Ct of G0 by rank of the highest order node in eah lique,
then onnet eah Ci to a predeessor Cj (j < i) sharing the highest number of
nodes with Ci .
In order to understand why this algorithm works, onsider rst the formation of the
Markov Network in Step 1. At a onverging onnetion, as in Figure 3.5 a), if we know
the state of X2 then X1 and X3 are not independent. If the arrows are removed from the
links to form the undireted graph given in Figure 3.5 b), then knowing the state of X2
renders X1 and X3 independent. When forming the Markov Network of a direted graph
D we would add a link between the parents X1 and X3 , as in Figure 3.5 ), whih ensures
there are no additional independene relations in the undireted graph. Hene it an be
seen that a Markov Network G is an I-map of the orresponding direted graph D.
Sine an I-map implies that any independenies represented in the graph are present
in P , by adding extra links whih may lessen the number of independenies represented by
G, G remains an I-map (though not a minimal I-map). Hene the hordal graph formed
in step 2 is still an I-map of P . If G is an I-map of P and is hordal, then P is said to be
deomposable relative to G [30℄. In addition, if P is deomposable relative to G, then any
34
3.3.
X1
a)
X3
X2
X1
b)
X3
BELIEF PROPAGATION
X1
X2
c)
X3
X2
Figure 3.5: The Markov Network of the direted network in a) is formed by removing the
arrows on the links and adding a link between nodes whih had a ommon hild.
join tree T of the liques of G is an I-map relative to P . That is, if hC1 jC2 jC3 iT , then the
variables in C1 are independent of the variables in C3 given the variables in C2 . A proof of
this statement is given in [30℄. Hene the join tree is an I-map of the original distribution
and we are justied in using the propagation algorithm introdued for trees.
As an example, onsider the network in Figure 3.6 a). The orresponding hordal graph
is shown in Figure 3.6 b) where the links between X2 and X3 and X5 and X4 were added
in step 1 of the algorithm and the link between X1 and X5 in step 2. There are 3 liques
in this graph, namely C1 = fX1 ; X2 ; X3 ; X5 g; C2 = fX1 ; X4 ; X5 g and C3 = fX4 ; X5 ; X6 g.
The highest order node X5 is in eah lique. Hene we an join C1 to C2 (as C1 and C2
share 2 nodes), and C2 to C3 to obtain the join tree shown in Figure 3.6 ).
In general, the join tree is not unique. Additionally, if we wished to introdue evidene
at a node, we would not instantiate a node of the join tree. Instead a dummy node is
added as a parent to the lique whih ontains the evidene node, and evidene is entered
at this node.
In the above example,
P (C1 jC2 ) = P (X1 ; X2 ; X3 ; X5 jX1 ; X4 ; X5 )
= P (X2 ; X3 jX1 ; X4 ; X5 )
= P (X2 ; X3 jX1 ; X5 );
as X2 and X3 are d-separated from X4 by X1 and X5 . In general it an be seen that
the dependene relationships between two liques Ci and Cj an be omputed from the
onditional probability of the variables unique to Ci onditioned on the variables Ci shares
with Cj .
The number of states for the lustered variables inreases exponentially with the number of nodes in a luster, and the omputational omplexity inreases aordingly. In
addition, the more highly onneted the network the more nodes eah lique will have in
ommon with others and so the average number of nodes per lique inreases.
35
3.3.
BELIEF PROPAGATION
X1
X1
X2
X3
X4
X2
X3
X4
X5
X5
a)
b)
X6
c)
X6
C3
C2
C1
Figure 3.6: The formation of a join tree ) from the Bayesian Network in a). This allows
the use of propagation methods for trees in what was originally a multiply onneted
network.
3.3.3.2 Conditioning
This method makes use of the d-separation riteria of a network whih an be used to blok
the ow of evidene along paths in suh a way as to render the network singly onneted.
The propagation shemes for trees or polytrees an then be used.
Consider the most simple loop as shown in Figure 3.7. If we instantiate X2 , that
X2
X3
X1
X4
Figure 3.7: A simple Bayesian Network ontaining a loop.
is x X2 to some state x2 , then X1 and X3 are d-separated and so information annot
`ow' around the loop. Hene at X4 we an assume the information reeived from X1 is
independent of that reeived from X3 and so an use the polytree algorithm.
36
3.3.
BELIEF PROPAGATION
In general we require a set of variables WI to be instantiated in order to enable this
algorithm to be used. We then update our beliefs at node X ,
BEL(X = x) = P (xje)
X
=
P (xjwI ; e)P (wI je);
wI
where the summation is over all possible ongurations of the variables in WI . This an
be seen to be a weighted sum of the probability that X = x when the variables in WI are
instantiated to eah possible onguration, the weights being given by P (wI je).
As instantiating the variables in WI renders the network singly onneted, the term
P (xjwI ; e) for eah wI an be omputed by the algorithm for polytrees. An appliation
of Bayes' Rule allows us to alulate the weights
P (wI je) = :P (ejwI )P (wI );
where P (ejwI ) an also be omputed using the polytree algorithm, by instantiating the
variables in WI to wI and updating the distribution at the evidene nodes. We hene
require 2 appliations of the polytree algorithm for eah onguration of the variables in
WI .
The required storage and omputation time for this method grows exponentially with
the number of nodes required to be instantiated, as the propagation algorithm must be
repeated for every ombination of the states of these variables. Hene this method also is
not tratable in large or highly onneted networks.
3.3.3.3 Stohasti Simulation
Simulation is typially used in networks too large for the eÆient use of the above methods.
Exat methods applied to large or densely onneted networks require a prohibitive amount
of either memory or omputation, and are not feasible. As mentioned previously, exat
inferene in Bayesian Networks an beome intratable, and is in fat NP-hard [6℄. With
the use of simulation, estimates of belief are obtained from observing the frequeny with
whih a onguration of the variables ours in a series of runs. Although approximate
inferene in Bayesian Networks has also been shown to be NP-hard [9℄, in some networks
simulation is the only method that an be used to get a result at all.
As the predened onditional probabilities give the probability that a variable will be
in a partiular state given the states of its parents, a simulation an proeed as follows:
1. Set m to the desired number of iterations.
37
3.3.
BELIEF PROPAGATION
2. Selet a state for eah root node by sampling from the probability distribution P (Xi ).
3. For eah node Xi in the network for whih the onguration of Pa(Xi ) is known,
selet a state for Xi by sampling from the probability distribution P (Xi jpa(Xi )).
Continue until all variables are instantiated, and reord the nal states of the variables.
4. Repeat steps 2 and 3 m times.
^ (Xi = k) =
If Nik denotes the number of runs for whih Xi was in state k, let BEL
for all variables Xi and states k.
Nik
m
,
As the state of a variable on any run is determined only by the onguration of its
parents, this method avoids the problems assoiated with loops enountered in the previous
methods. That is, indiret dependenies between variables whih render the network
multiply onneted have no eet on the alulation sheme.
The beliefs obtained by use of this method are only approximations. However, as the
^ (X ) onverges in probability
number of runs m tends to innity, the distribution BEL
to BEL(X ) [30℄. Although the number of omputations per run is modest, in pratie
the number of runs m may have to be quite large for the approximation to be adequate.
In a large network, the number of ongurations of the variables may be very large, for
example a network used by Cheng et al. in [2℄ has 179 nodes and 1061 ongurations.
Then if we take 106 samples for example, we an sample only 10 55 of the total sample
spae.
If a root node had some state that oured with a low probability, then many runs
would be required to obtain suÆient realisations of that state to obtain aurate estimates
for the distribution of its desendents. Consider the ase where a variable Y , say, is
instantiated to Y = y . Given this information we ould determine the distribution of an
intermediate variable X , P (X je), by onsidering the proportion of runs for whih X was
in state x given that Y was in state y , for all states x of X . However, the proportion
of runs for whih Y = y ould be small. As these are the only runs whih an be used
to ompute the posterior distribution at the other nodes, this proess is very ineÆient
and a large number of runs will be required to obtain suÆient auray. For example,
if P (Y = y ) = 0:05 we would expet to be able to ount only 5 perent of the runs
performed. We now present an alternate method based on Gibbs sampling whih handles
instantiated variables more eÆiently.
38
3.3.
BELIEF PROPAGATION
3.3.3.4 Gibbs Sampling
To begin the simulation all evidene nodes are set to their observed state, and all other
variables to an arbitrary initial state. We then assign some ordering to those nodes whih
are not evidene nodes. At eah stage we update the state of a variable by sampling from
its probability distribution onditional on the remaining variables being in their urrent
onguration. That is we irulate through the variables indenitely, and at eah node X
we alulate P (X jU(X ) = u(X ) ), where U(X ) is the set of all variables exluding X .
The most omputationally demanding step in this proess is the evaluation of the
distribution P (X ju(X ) ). However this an be simplied by onsidering the d-separation
properties of the network. We know that the parents of X d-separate X from all anestors
for whih the only path between these anestors and X is via Pa(X ). Hene if we know
the states of the variables in Pa(X ), knowing the states of the anestors d-separated from
X an give us no further information about the state of X . We an also obtain information
about X from its hildren. These d-separate X from any desendents for whih the only
path from X to the desendent is through the hildren of X . However, if a hild of X has
other parents, there is a onverging onnetion at this hild. Hene X is not d-separated
from the parents of this hild, given that the state of the hild is xed.
The set onsisting of those variables whih are either a parent of X , a hild of X or
a parent of a hild of X is known as the Markov Blanket of X , Ma(X ). If we know the
onguration of the variables in the Markov Blanket of X , knowing the state of any other
variable will give us no further information about X , that is Ma(X ) d-separates X from
the rest of the network.
In order to quantify the relationship P (X ju(X ) ) we begin with the hain rule. If we let
L be the set indexing all variables in U(X ) besides the hildren of X , Y1 ; Y2 ; : : : ; Yk , then,
by the hain rule for Bayesian Networks,
P (u) = P (x; u(X ) )
= P (xjpa(X ))
k
Y
j
=1
P (yj jpa(Yj ))
Y
l
2L
P (xl jpa(Xl )):
As the nal produt is independent of X we an write (3.17) as
P (x; u(X ) ) = :P (xjpa(X ))
k
Y
j
39
=1
P (yj jpa(Yj ))
(3.17)
3.3.
BELIEF PROPAGATION
and hene, as the marginal distribution P (u(X ) ) is also independent of X ,
P (x; u(X ) )
P (xju(X ) ) =
P (u(X ) )
k
Y
= 0 :P (xjpa(X )) P (yj jpa(Yj ));
j =1
where 0 is a normalising onstant. Hene we an evaluate the probability distribution of X
onditioned on all the variables in the network given that we know only the onguration
of those variables in Ma(X ), that is pa(X ); yj and pa(Yj ); j = 1; : : : ; k.
As time progresses the system generated by this method is guaranteed to reah a
steady state [30℄. That is, at any stage the probability that the variables are in some
onguration u is given by P (u), as dened by the onditional probability tables. This
will our whatever the initial states of the variables, though a period of initialisation is
required. That is, the proess should be run for a time before reording the ongurations
of the variables in order to allow the proess to stabilise. However, there is no way of
knowing exatly how long this will be and it is dependent on the starting states of the
variables.
This method requires that an additional k onditional probabilities be retrieved to
update the distribution at eah node ompared to the method presented in Setion 3.3.3.3.
However, as in this ase instantiated variables make no dierene to the preision of the
estimate for a given number of runs, Gibbs Sampling is far more eÆient when there are
many instantiated variables.
3.3.3.5 Importane Sampling
An alternative solution to the problem of estimates having large variane when there are
extremely unlikely instantiations of evidene is importane sampling.
Suppose we want to estimate the probability of some evidene, P (E = e), where E U
is a set of evidene nodes. For onveniene we denote the set of all other variables in the
network, that is UnE, by B. Then we an alulate the required probability by summing
over the states of the variables in B, while the evidene nodes remain in onguration e.
That is
P (E = e ) =
=
X
b2B
P (fB = bg \ fE = eg)
P (fB = bg \ fE = eg)
f (b);
f (b)
b2B
X
(3.18)
where f (B) is a probability distribution funtion over B and is referred to as the Importane funtion.
40
3.3.
BELIEF PROPAGATION
Cheng and Druzdel in [2℄ propose an adaptive importane sampling algorithm for
large Bayesian Networks (AIS-BN). The importane sampling algorithm is similar to the
sampling algorithm given in Setion 3.3.3.3 exept that eah root node is randomly instantiated to one of its possible states aording to the importane prior distribution for this
node, and the states of the remaining nodes are sampled from the importane onditional
probability distribution of this node onditioned on the states of its parents.
To estimate the sum (3.18) we an independently generate m samples s1 ; s2 ; : : : ; sm
from f (B) and let the value of our estimate be given by
m
1X
P (fB = si g \ fE = eg)
^
Pm (E = e) =
:
m i=1
f (si)
The distribution f (B) an be hosen to emphasise instantiations of the variables in B whih
are regarded as being important, by inreasing the probability of those instantiations over
that in the original distribution P (B). This means that under the sampling distribution
f (B), these instantiations will be observed relatively more often. However, simply taking
the mean of these observations as the value of our estimate would result in a biased estimate
for P (E = e). Therefore eah observation is rst multiplied by the weight 1=f (si ), whih,
from (3.18), an be seen to produe an unbiased estimator. Although sampling from the
density P (B) would also result in an unbiased estimate, a good hoie for the importane
funtion an redue the variane of this estimate substantially.
It an be shown that the variane of P^m is
2
X P (fB = bg \ fE = eg)
2
2 (P^m ) = f ; where f2 =
P (E = e) f (b):
m
f (b)
b2B
Hene, the hoie of importane funtion aets the variane of the estimator. It an be
shown that the optimal value for f , for all b 2 B, is suh that
P (fB = bg \ fE = eg)
P (E = e)
= P (B = bjE = e);
f (b) =
(3.19)
whih results in a value of f2 equal to 0. In pratie it is neessary to estimate (3.19), but
it is assummed that funtions whih are lose to optimal will redue the variane eetively. The AIS-BN algorithm uses heuristis to ontinually update the importane funtion (hene the word `adaptive') as more samples are obtained, and has shown to ahieve
over two orders of magnitude improvement in preision over both likelihood weighting and
self-importane sampling [2℄.
41
3.4.
ANSWERING QUERIES
3.4 Answering Queries
After we have formed a Bayesian Network to represent the probability distribution we
may then want to use the information stored on the links to answer queries about that
distribution, for example the probability that node X is in state x say. This an be done
through simple modiations to the network, namely by adding what we will refer to as
query nodes. Given a query, we introdue a query node with states orresponding to the
hypotheses `X is in state x' and the alternative `X is not in state x'. We then make the
query node Q a hild of X , as in Figure 3.8 a), with the link probabilities speied by
P (Q = 1jX ) =
1 if X = x
0 if X 6= x:
We then propagate the probabilities through the network and the answer to our query is
given by the belief that Q is in state 1.
X
X
Y
Z
Q1
a)
b)
Q
OR
Q2
AND
Figure 3.8: Networks with the addition of query nodes.
This method an be extended to answer ompound queries, for example to determine
the probability that node X is in state x or Y is in state y and that Z is in state z . This
is ahieved by adding query nodes Q1 and Q2 as in Figure 3.8 b), where the probability
table for the OR onnetive represented by Q1 has the form
P (Q1 = 1jX; Y ) =
1 if X = x or Y = y
0 otherwise.
The required probability is then given at node Q2 whih represents the AND onnetive
between nodes Q1 and Z . That is,
P (Q2 = 1jQ1 ; Z ) =
1 if Z = z and Q1 = 1
0 otherwise.
This approah an be extended in similar ways to answer more omplex queries. Note that
by adding the query nodes we are not altering the propagation of information through the
42
3.4.
ANSWERING QUERIES
remainder of the network. This is beause the onnetions at query nodes are always
onverging and, as no evidene is entered at these nodes, they d-separate their parents.
Hene this does not introdue an alternate path for the ow of information between the
variables in the network.
43
Chapter 4
Aspets of Struture
The previous hapter desribed how information travels through a Bayesian Network to
update the belief distributions of the variables. In this hapter we aim to determine the
eet on beliefs of hanges made to the network struture, and the value of the information
observed at the evidene nodes.
If we are using our network to make deisions and are unsure of a partiular aspet
of the struture, for example whether a partiular link should be inluded, it would be
unreasonable to then use this network if the presene or absene of the link would hange
the optimal deision. Often it happens that the formation of our network is highly subjetive, and estimates of the true parent sets for eah node are made. In Chapter 5 we
disuss learning the struture of the network when we have aess to data generated from
the underlying probability distribution. In this hapter we assume that, given a set of
variables to inlude, alloation of parents is performed subjetively, that is without using
information from data.
Although a Bayesian Network an be used simply as an eÆient way of storing information about a joint probability distribution, it is often used as a tool to aid deision
making. In this ontext we lass variables as being one of three types:
1. Hypothesis variables - these are variables on whose belief distribution the deision
will be based.
2. InformationnEvidene variables - these are leaf nodes1 whose state we may be able
to observe. We an then enter the observed state of the variable as evidene to be
propagated through the network. In pratie, the olletion of this information may
have some ost attahed to it, for example the ost of a test.
If the observable node is not a leaf node, the evidene an be represented as an adjoining leaf node
whih is instantiated
1
44
4.1.
CONCEPTS FROM INFORMATION THEORY
3. Intermediate variables - these are the remaining variables whih determine the relationship between the evidene variables and hypothesis variables and omplete the
struture of the network.
Thus when we refer to the eet of a strutural hange we are onerned with the
beliefs of a set of hypothesis nodes. If a hange has little eet on the distribution of these
beliefs then it will not signiantly aet our inferene and so is onsidered to be of little
importane.
In order to quantify hanges made to the struture and hene the probability distribution whih is represented by the network, we use measures developed in Information
Theory, namely entropy and mutual information. These onepts are introdued in Setion 4.1 and applied in Setion 4.2, where we look at the eet of hanging the links and
nodes of the network.
For reasons of omputation and ease of understanding, it is desirable to keep the
network as simple as possible whilst ensuring it is still an adequate representation of the
probability distribution at the hypothesis nodes. Hene we need to know whih hanges
will simplify the network whilst having as little impat on the belief distributions as
possible. In general this is a diÆult task. In Setion 4.3 we dene the problem and then
formulate an approximate method of solution.
Given that the aquisition of evidene has some ost assoiated with it, we may not
be able to aord to aquire all available evidene. Setion 4.4 disusses how to hoose the
optimal set of information variables, given a xed budget.
4.1 Conepts from Information Theory
In order to assess the eet of hanges to struture, some measure of the onsequent hange
in probability distribution is required. A strutural hange may aet the information
reahing the hypothesis variables and hene their distribution. Here we introdue several
onepts from Information Theory whih enable us to quantify these eets.
When making a deision we wish to know the state of the variables on whih we are
to base the deision. Given a probability distribution over the states of a hypothesis
variable, the more the probability mass is distributed over only one or two states, the
more ertain we will be about the state of this variable. Entropy is a measure of the
amount of unertainty assoiated with a random variable - the greater the entropy the more
information is required, on average, to ompletely determine the state of that variable.
Consider a variable Xi whih has ri states. Denote by P (xki ) the probability that Xi
45
4.1.
CONCEPTS FROM INFORMATION THEORY
is in state k. Then the entropy of a variable Xi , H (Xi ), is dened as
ri
X
H (Xi ) =
P (xki ) log2 P (xki )
=1
E [log2 P (Xi )℄:
k
=
When the probability mass is all assigned to a single state, then H (Xi ) = 0, whih indiates
omplete ertainty in the state of this variable. H (Xi ) is maximised when the probability
is uniformly distributed over the states of Xi . Then
ri
X
1
1
log2
r
r
i
i
k =1
1
= log2
ri
= log2 ri :
H (Xi ) =
The joint entropy, H (X1 ; X2 ) of a pair of disrete random variables with joint distribution
P (X1 ; X2 ) is dened analogously as
r1 X
r2
X
H (X1 ; X2 ) =
P (xk1 ; xl2 ) log2 P (xk1 ; xl2 )
k =1 l=1
E [log2 P (X1 ; X2 )℄:
=
It an be shown, for example in [8℄, that
H (X1 ; X2 ) = H (X2 ) + H (X1 jX2 );
(4.1)
where the onditional entropy
H (X1 jX2 ) = EX2 [H (X1 jX2 )℄
=
r2
X
l
=
=1
r2
X
l
=
P (xl2 )H (X1 jxl2 )
=1
P (x2 )
r2
r1 X
X
l
r1
X
k
=1
P (xk1 jxl2 ) log2 P (xk1 jxl2 )
P (xk1 ; xl2 ) log2 P (xk1 jxl2 ):
=1 l=1
H (X1 jX2 ) represents the average unertainty whih remains about the state of X1 if we
an observe the state of X2 . Applying (4.1) reursively leads to what is known as the
hain rule for onditional entropy,
k
H (X1 ; : : : ; Xn ) = H (X1 ) + H (X2 jX1 ) + : : : + H (Xn jXn 1 ; : : : ; X1 )
=
n
X
i
=1
H (Xi jX1 ; : : : ; Xi 1 ):
46
4.1.
CONCEPTS FROM INFORMATION THEORY
Relative entropy is a measure of the dierene between two distributions P1 (Xi ) and
P2 (Xi ) and is equivalent to the Kullbak-Leibler divergene between the distributions2.
Relative entropy is dened as
ri
X
P (xk )
P1 (xki ) log2 1 ki
P2 (xi )
k =1
P (X )
= EP1 [log 1 i ℄:
P2 (Xi )
If P1 (Xi ) is the `true' distribution and we base our deision on P2 (Xi ), the unertainty in
P
k
k
this deision is
k P1 (xi ) log 2 P2 (xi ). Hene the ineÆieny in assuming the distribution P2 when P1 is in fat the true distribution is given by
distK (P1 ; P2 ) =
ri
X
k
=1
P1 (xki ) log2 P2 (xki ) [
=
ri
X
k
=1
ri
X
k
=1
P1 (xki ) log2 P1 (xki )℄
P1 (xki )(log2 P1 (xki ) log2 P2 (xki ))
ri
X
P (xk )
P1 (xki ) log2 1 ik
P2 (xi )
k =1
= distK (P1 ; P2 ):
=
(4.2)
To show that the Kullbak-Leibler distane is non-negative, we make use of Jensen's
inequality3. This says that for a onvex funtion f (X ),
E [f (X )℄ f (E [X ℄):
Hene,
(4.3)
P (X )
distK (P1 ; P2 ) = EP1 [log2 1 i ℄
P2 (Xi )
P (X )
= EP1 [ log2 2 i ℄
P1 (Xi )
Xi )
log2 EP1 PP2 ((X
; from (4.3)
1 i)
ri
X
P (xk )
= log2 P1 (xki ) 2 ik
P1 (xi )
k =1
=
log2
ri
X
k =1
= log2 (1)
= 0:
P2 (xki )
(4.4)
Although we denote the Kullbak-Leibler divergene by distK (), it is not stritly a distane (metri)
as it does not satisfy the triangle inequality, nor is it symmetri, that is distK (P1 ; P2 ) 6= distK (P2 ; P1 ).
3
A proof of Jensen's inequality an be found in [8℄.
2
47
4.1.
CONCEPTS FROM INFORMATION THEORY
This result, together with (4.2), implies that it is always ineÆient to use less aurate
information than that whih is available. That is, if evidene has been olleted we should
always use this extra information.
We an use relative entropy to quantify the eet of making a hange to the network
whih may result in the distribution of belief at a hypothesis node hanging from P1 to
P2 say. If there exists more than one hypothesis variable in the network, the eet of a
strutural hange will be quantied by
X
Xh 2U
distK (P1 (Xh ); P2 (Xh ));
where Xh is the set of hypothesis variables of interest, and again the set U ontains all
variables in the network. This sum an be weighted to reet the inuene individual
hypothesis variables will have on the deision.
We now introdue several other measures from Information Theory used in subsequent
analyses. Mutual information, I (X1 ; X2 ), is a measure of the redution in unertainty of
one random variable due to knowledge of the other. It is the relative entropy between the
atual joint distribution and the joint distribution assuming the variables to be independent. That is,
I (X1 ; X2 ) =
=
r1 X
r2
X
k
k
=
=1 l=1
r1 X
r2
X
=1 l=1
P (xk1 ; xl2 ) log2
P (xk1 ; xl2 ) log2
r1 X
r2
X
P (xk1 ; xl2 )
P (xk1 )P (xl2 )
P (xk1 jxl2 )
P (xk1 )
P (xk1 ; xl2 ) log2 P (xk1 ) +
k =1 l=1
= H (X1 ) H (X1 jX2 ):
r1 X
r2
X
k
=1 l=1
P (xk1 ; xl2 ) log2 P (xk1 jxl2 )
(4.5)
Notie that I (X1 ; X2 ) = distK (P (X1 ; X2 ); P (X1 )P (X2 )) and so from (4.4) and (4.5) we
nd that H (X1 ) H (X1 jX2 ) 0, whih implies that
H (X1 ) H (X1 jX2 ):
This shows that, on average, obtaining information about some variable X2 distint from
X1 will redue our unertainty about the state of X1 . Only if X1 and X2 are independent
will knowledge of the state of X2 not redue the entropy of X1 .
Finally, the onditional mutual information of random variables X1 and X2 given X3 ,
I (X1 ; X2 jX3 ) = H (X1 jX3 ) H (X1 jX2 ; X3 )
P (X1 ; X2 jX3 )
:
= E log2
P (X1 jX3 )P (X2 jX3 )
48
(4.6)
4.2.
CHANGES TO THE NETWORK
This is a measure of the redution in unertainty at X1 by obtaining knowledge of X2 ,
when the state of X3 is already known.
4.2 Changes to the Network
As the struture of a Bayesian Network is ompletely determined by the set of parents for
eah node, modiations to struture are equivalent to a hange in the set of parents of at
least one node. In this setion we onsider the eet on the marginal distribution of the
nodes, most signiantly the hypothesis nodes, given some hange to the struture.
4.2.1
Changes to the Parents of a Node
Often the hypothesis nodes are root nodes. The network shown in Figure 4.1 is a simple
model whih will allow us to examine the eet of hanging the parents of a node in a
network whih is a tree. Here X1 is the hypothesis node, and eah node Xi has hildren
Xi+1 and Ci , where Ci is onsidered to be the root node of a subtree. That is, if any
evidene has been entered in the subtree rooted at Ci , then as the network is a tree this
information must pass through Ci before it an be reeived at Xi or X1 .
Suppose now we hange the parent set of C4 to Pa(C4 ) = fX2 g. We may make suh a
hange if we believe that X2 is a diret ause of C4 . Given that the subtree rooted at C4
ontains at least one evidene node, instead of this information rst being reeived at X4
and then propagated through X3 and X2 to X1 it is reeived diretly at X2 . Hene, as
X2 and X3 must proess this information before it an reah X4 , we would expet H (X4 )
to inrease. Similarly, and perhaps more signiantly, as X2 is loser to X1 than X4 , we
would expet X1 to reeive more diret information and so expet H (X1 ) to derease.
However, not only struture aets the propagation of evidene - the onditional probabilities also have a large eet. In the above disussion we assummed that the amount
of information reahing X4 from C4 was similar to the amount reahing X2 from C4 after
having made the hange Pa(C4 ) = fX2 g. That is, we assumed
H (X4 jC4 ; Pa(C4 ) = X4 ) ' H (X2 jC4 ; Pa(C4 ) = X2 ):
From (4.5) we know that
I (X4 ; C4 jPa(C4 ) = X4 ) = H (X4 ) H (X4 jC4 ; Pa(C4 ) = X4 )
and I (X2 ; C4 jPa(C4 ) = X2 ) = H (X2 ) H (X2 jC4 ; Pa(C4 ) = X2 ):
Hene if
H (X4 jC4 ; Pa(C4 ) = X4 ) << H (X2 jC4 ; Pa(C4 ) = X2 );
49
4.2.
CHANGES TO THE NETWORK
then I (X4 ; C4 jPa(C4 ) = X4 ) >> I (X2 ; C4 jPa(C4 ) = X2 ), and so it is possible that by
making this hange of parents, the information reahing X1 is redued. That is, the information (entropy reduing apaity) reahing X1 is determined not only by the struture
but also our beliefs in the strength of ausal relations between a node and its hild.
X1
C1
X2
X3
C2
C3
X4
C4
Figure 4.1: A tree struture.
In general, the entropy reduing apaity of evidene is dereased the further from the
hypothesis node it is reeived. This an be shown by the data proessing inequality [8℄
whih demonstrates that any manipulation at a node X2 of the information reeived
from a node X3 an not inrease the amount of information reahing X1 . That is, if
X1 ! X2 ! X3 , then I (X1 ; X2 ) I (X1 ; X3 ) with equality if and only if I (X1 ; X2 jX3 ) = 0.
If we put the variable X3 equal to some funtion of X2 , g(X2 ), then it follows that
I (X1 ; X2 ) I (X1 ; g(X2 )):
At eah intermediate node between variables X1 and Xk say, a manipulation of the information ours when being propagated to its neighbours. Hene at eah intermediate node
the amount of information reahing X1 from Xk will not be inreased, and will probably
be redued. That is, the value (entropy reduing apaity) of evidene to the hypothesis
variable is lessened the more times it is manipulated as it is propagated to the hypothesis
node.
In the example in Figure 4.1, if we had assigned Pa(C4 ) = fX2 g when X4 is the `true'
parent of C4 , we would be basing our deision at X1 on stronger information than is justied, and ould obtain more ertain results than we should. If we assign Pa(C4 ) = fXk g,
where k is greater than four, we would expet greater unertainty at X1 . This suggests
that if we are unsure of the orret parent for a variable Xi then, taking a onservative
approah, we should assign as the parent of Xi that node at whih the information would
have the least entropy reduing eet on the hypothesis node.
In general, for some hypothesis node X h , we say a node Xi is further from X h than
50
4.2.
Xj if
CHANGES TO THE NETWORK
I (X h ; Xj ) > I (X h ; Xi );
or equivalently, H (X h jXj ) < H (X h jXi ). We hoose as the parent of our node, or subtree,
that andidate parent whih is furthest from the hypothesis node.
However, note that this analysis is based around the mutual information,
P (X h ; Xi )
h
I (X ; Xi ) = E
:
P (X h )P (Xi )
As this is an expeted value, we would expet that, in the long run, the proedure suggested
for alloating parents to be onservative. However, this proedure may not be optimal in
every situation.
Consider the example illustrated in Figure 4.2. In this model we assume that an
infetion auses fever whih in turn auses tiredness. We wish to add the variable Alertness
to the model, but are unsure as to whether it should be a hild of Fever or Tired 4 .
Infection
Fever
Tired
?
Alertness
Figure 4.2: A Bayesian Network formed on the four variables Infetion (I), Fever (F),
Tired (T) and Alertness (A). The parent of Alertness is yet to be determined.
The onditional probability tables at Infetion, Fever and Tired are
P (T jF ) yes no
v. high 0.97 0.03 .
P (I ) = (0:05; 0:95);
high 0.65 0.35
normal 0.15 0.85
If we assume that Fever should be the parent of Alertness, then we add Alertness to
the model as a hild of Fever. Given that we dene the orresponding onditional probabilities as
P (F jI )
yes
no
v. high high normal
0.31 0.65 0.04 ,
0
0.1
0.9
P (AjF )
very high
high
normal
high average low
0.01
0.2
0.79 ,
0.05
0.5
0.45
0.3
0.6
0.1
This is a greatly simplied senario. It may be more realisti to have both Fever and Tired as parents
of Alertness, and to model some other auses of tiredness, for example.
4
51
4.2.
CHANGES TO THE NETWORK
then I (I ; A) = 0:03495, whih orresponds to a 12.2% redution in entropy at Infetion.
If instead Tired is alloated as the parent of Alertness, with the onditional probabilities
P (AjT )
yes
no
high average low
0.02
0.4
0.58 ,
0.3
0.6
0.1
this results in a value for I (I ; A) of 0.01288, whih orresponds to only a 4.5% redution in
entropy. Hene Tired is the furthest andidate parent from Infetion, as we would expet
from the data proessing inequality. Given the above disussion, these results imply that
we should add Alertness as a hild of Tired to avoid over ertainty in the state of Infetion.
However, onsider the results displayed in the table below. The entries in the rst olumn
give the redution in entropy at Infetion given the three possible ndings at Alertness,
that is H (I ) H (I jA), when the parent of Alertness is Fever. The entries in the seond
olumn are the orresponding values for the ase that Tired is alloated as the parent of
Alertness.
I(I;A = a) Pa(A) = F Pa(A) = T
high
0.2123
0.1433
0.0656
0.04
average
low
-0.3804
-0.2075
If we reeive evidene that the state of Alertness is low, then the entropy of Infetion is
lower if the parent of Alertness is Tired than if it is Fever. If Alertness is observed to
be average or high, then the entropy at Infetion is lower if Fever is the parent. Hene,
although I (I ; A) is greater when Fever is the parent of Alertness, we see that this is not
the ase for all possible instantiations of Alertness. This highlights the fat that, beause
mutual information is an expeted value, alloating the parent of a node using the above
method may not neessarily result in less information reahing the hypothesis node for all
instantiations of that node.
However, in some ases we may have extra information about the type of deision we
are making. For example, in this senario it may be reasonable to assume that in most
ases when we are required to make a deision about the hypothesis node Infetion, the
alertness of the patient will be average or low. Then if we wish to alloate Alertness so as
to have the least entropy reduing eet, the above results suggest it would be reasonable
to alloate Fever as the parent of Alertness, and not Tired as suggested by the previous
analysis.
52
4.2.
4.2.2
CHANGES TO THE NETWORK
Changes in Networks Containing Loops
The eet of modifying the set of parents in a network ontaining loops, or of speifying
dierent auses of a variable, is hard to determine in general. Even if a link is added
between nodes whih share a ommon hild so that the networks redue to the same join
tree, propagations through the join trees will dier. We may also wish to examine hanges
to the parent sets of a probabilisti network, for example if we try to simplify a network
over the given set of nodes by removing ertain links. A ompliation here is that the
existene of equivalene strutures is harder to determine in networks with loops.
Two networks are said to be equivalent if they represent the same onditional independene assumptions, that is every joint distribution enoded by one struture an
be enoded by the other and vie versa [20℄. For example, the two networks shown in
Figure 4.3 are equivalent as they both enode the same independene relation, namely
I (X1 ; X2 ; X3 ). Reall from Setion 2.2 that, given any ordering of the variables, we an
X3
X1
X3
X1
X2
X2
Figure 4.3: Equivalent network strutures
form a Probailisti Bayesian Network of the underlying probability distribution by assigning the variables in the boundary strata of a node Xi to be the parents of Xi . For
eah ordering it is likely that a dierent struture will result. Hene there are at least n!
dierent strutures for a Bayesian Network on n variables whih an represent the same
probability distribution. Additionally, how the onditional probabilities are adjusted to
ompensate for the hange is also an important onsideration. However, we do know that
if we remove a link from a Probabilisti Bayesian Network then the resulting struture will
not be equivalent to the original. This follows from the fat that, by denition, a Probabilisti Bayesian Network is a minimal I-map and so by deleting a link we are foring an
additional onstraint of onditional independene to hold.
Note that the two networks in Figure 4.3 an not be onsidered equivalent from a
ausal point of view. If we were to make an intervention and x the state of X3 say, then
the state of this variable is independent of its parents. Hene we an delete the links from
X3 to its parents, and our inferene would proeed on the networks given in Figure 4.4.
These networks are learly not equivalent.
53
4.2.
X3
X1
CHANGES TO THE NETWORK
X3
X1
X2
X2
Figure 4.4: The networks whih would be used for inferene if the state of X3 in the
networks in Figure 4.3 were xed.
4.2.3
Removal of Nodes
We now examine the issues relating to the removal of a node. We may remove a node
from the network in order to simplify the network, or wish to know the eet of failing to
inlude a variable in the model.
Consider the network in Figure 4.5. If we delete node X3 say, we ould simply delete
X1
X2
e
X3
X4
X5
X6
Figure 4.5: A Bayesian Network. If X4 is instantiated and X3 removed, then the path
from X1 to X6 will be bloked.
all links onneted to this node. Then
Pa(X6 ) = Pa(X6 )nfX3 g = fX4 ; X5 g:
However, if we do this then X2 will no longer be an anestor of X6 and so we may hoose
to add the link X2 ! X6 . X1 will still be an anestor of X6 through X4 . However, if X4
is instantiated then this will d-separate X1 from X6 and so we may need to link X1 to X6
also. That is, let
Pa(X6 ) = Pa(X6 )nfX3 g [ Pa(X3 ) = fX1 ; X2 ; X4 ; X5 g:
If we fail to inlude a node that may be observable, then we sarie information.
Consider now the general ase of forming a Probabilisti Bayesian Network on the
variables X1 ; : : : Xn , under that ordering, aording to the method given in Setion 2.2.
Suppose we leave out the variable Xk . Reall that the boundary strata for a node Xi
54
4.2.
CHANGES TO THE NETWORK
is dened as the set of nodes Bi U(i) suh that I (Xi ; Bi ; U(i) nBi ), where U(i) =
fX1 ; : : : ; Xi 1 g. On forming the new network whih does not inlude Xk we then need to
reassess these boundary strata. If we follow the same variable ordering as previously, then
for Xi suh that i < k, the boundary strata remain unhanged. Also, for Xi suh that
i > k and Xk is not in Bi , these boundary strata remain unhanged also. This is beause,
as U(i) nBi = fU(i) nfXk [ Bi gg [ Xk ,
I (Xi; Bi ; U(i) nBi ) I (Xi ; Bi; fU(i) nfXk [ Bigg [ Xk );
whih implies that I (Xi ; Bi ; U(i) nfXk [ Bi g). This an be written as I (Xi ; Bi ; U0(i) nBi ),
where U0(i) = fU(i) nXk g and so the boundary strata Bi is unhanged. For those Xi suh
that Xk is in Bi , the boundary strata must be reassessed. It is not possible to simply
substitute in Bk for Xk in Bi , as Bi may have d-separated Xi from Xl say, where k < l < i.
Hene this partiular substitution will only hold if Xk+1 ; : : : ; Xi 1 are all in Bi .
For example, onsider the network on variables X1 ; : : : ; X5 , formed in that variable
ordering, with
B1 = ;
B3 = fX2 g B5 = fX2 ; X3 g
B2 = fX1 g B4 = fX3 g:
The resulting network is shown in Figure 4.6 a). If we delete X3 say, the assertion that
X1
X1
X2
X3
X2
X4
X4
a)
X5
b)
X5
Figure 4.6: The network formed based on the node ordering X1 ; : : : ; X5 , with boundary
strata as given.
we replae X3 in B5 by B3 , and X3 in B4 by B3 to give B05 = fX2 g and B04 = fX2 g is
not valid. This would be equivalent to saying
I (X5 ; fX2 ; X3 g; fX1 ; X4 g) & I (X3 ; X2 ; X1 ) =) I (X5; X2 ; fX1 ; X4 g);
whih is not true. In Figure 4.6 b) we an see the additional independene relation this
reates, I (X5 ; X2 ; X4 ), whih does not hold in Figure 4.6 a). As there are no nodes
between X3 and X4 in the ordering, then we an dedue that B04 = B3 = fX2 g. However,
as X5 is after X3 in the node ordering and X3 is in B5 , we need to separately assess B05 so
55
4.3.
SIMPLIFYING THE STRUCTURE
that B05 is the minimal set satisfying I (X5 ; B05 ; fX1 ; X2 ; X4 gnB05 ). In this ase we should
dene B05 = fX2 ; X4 g.
This analysis shows that if we delete `sub-networks' or sub-trees, the struture of the
higher parts of the network will remain unhanged. Hene for a network initially formed on
some ordering of variables it would be desirable to begin by eliminating nodes towards the
end of this ordering to minimise the number of boundary strata needing to be reassessed.
4.3 Simplifying the Struture
In this setion we assume that we are given a network whih we then want to simplify
in order to minimise storage spae and inrease eÆieny. Our goal is to form a network
whih is as simple as possible, but whih still provides an adequate approximation at the
hypothesis nodes to the distribution of the original network.
To do this, eah time a strutural simpliation is made we must then searh through
all possible values for the onditional probabilities in order to nd that whih results in
the best approximation. Given that for even a network of moderate size there are many
interations to onsider, for example the onsiderations of deleting a node in Setion
4.2.3, we onlude that this is too omputationally intensive and instead take a dierent
approah.
In essene, we aim to nd the most simple network possible whilst keeping the distribution over the hypothesis nodes within a set distane from the original distribution. That
is, we wish to nd a network struture S with minimal omplexity that, given appropriate
onditional probabilities are dened, results in a distribution that satises some onstraint
on distane from the original distribution. Given a node Xi , we have seen that the size of
the onditional probability table required will depend on the number of parents of Xi and
the number of states of Xi and its parents. Here we denote the number of entries in the
onditional probability table at Xi by
Y
Sp(Xi ) = (jXi j 1)
Xk
2Pa(Xi )
jXk j;
where again jXi j denotes the number of states of Xi . Note that the omplexity of the
network depends on the number of nodes, the number of links and the number of states
of eah variable. We use as a measure of the omplexity of a network S [22℄,
Size(S ) =
X
Xi
2U
Sp(Xi ):
This is linear in the number of nodes, and Sp(Xi ) is exponential in the number of parents
of Xi . Suppose we denote the distribution resulting from a simpliation as P 0 . Then
56
4.3.
SIMPLIFYING THE STRUCTURE
if we let Xh be the set of hypothesis nodes and E the set of evidene nodes, we ould
onsider the problem
min
Size(S )
s.t. dist(P (Xh je); P 0 (Xh je)) t;
8 e 2 E;
(4.1)
As (4.1) is quite strit, given a prior distribution for E we may instead wish to minimise
Size(S ) subjet to a onstraint on
X
e
e dist(P (Xh je); P 0 (Xh je));
(4.2)
where e is the prior probability (or an estimate of the probability) that E is in onguration e. This may allow more sope for simpliation, though it also omes with added risk,
as it is possible that we may observe an instantiation of E that ours with low probability
but for whih the distane between P and P 0 is large.
An additional onstraint on simpliation is that we inlude the hypothesis nodes and
(in most ases) the information nodes. Our problem an then be posed as a minimum ost
network design problem subjet to these onstraints, where the ost is assoiated with the
omplexity of the network. This is similar to the Steiner Tree problem [21℄ whih has been
studied extensively. However, the diserning feature of our problem is that xed weights
or osts annot be assigned to the edges or to the nodes. For example a suitable value for
the node weight would be Sp(Xi ), but this is a funtion of the struture of the network
through the parents of Xi .
The presene of the onstraint adds another layer of diÆulty to nding an eÆient
method of solution. To avoid a onstrained problem, it is suggested, for example in [22℄,
that the term desribing adequay of t be built into the objetive funtion. This yields
an aeptane measure,
A(P; P 0 ) = Size(S 0 ) + dist(P; P 0 );
(4.3)
where S 0 is the simplied struture and some onstant. This an then be minimised
using standard methods. Here the tuning parameter is hosen by the user - large values
of orrespond to more heavily penalising models with a large distane. However, a
areless hoie for this parameter ould result in models with large size or an unaeptable
disrepany in distane being `aepted.' Notie that if we base our model seletion on
the aeptane measure, then we need not onsider as an improvement any network S 0
57
4.3.
SIMPLIFYING THE STRUCTURE
suh that Size(S 0 ) Size(S ). This follows as
A(S 0 ) = Size(S 0 ) + dist(P 0 ; P )
Size(S 0 ) + dist(P; P )
Size(S ) + dist(P; P )
= A(S ):
It was shown in the example onerning Figure 4.5 above that if we want to simplify
a given network, as well as onsidering deletions of links in the network, we must also
onsider the insertion of links not originally inluded. An exhaustive searh ould be done
by onsidering Size(S 0 ) and dist(P; P 0 ) for eah possible struture on the n nodes and
hoosing that struture for whih this is minimised. However, the number of possible
direted ayli graphs on n nodes is given in [7℄ by the reursive funtion
f (n) =
n
X
n
( 1)i+1 i 2i(n i) f (n i):
=1
Thus there are 25 possible strutures on three nodes, 29 000 on ve nodes and approximately 4:2 1018 strutures on just ten nodes. Hene exhaustive enumeration of all
network strutures is learly not feasible, and other methods have to be onsidered.
One way of onstraining the searh spae, as disussed in [22℄ for example, is to onsider
only those links within some strutural hierahy. For example if some variables an be
onsidered auses of disease and others symptoms, we an impose the onstraints shown
graphially in Figure 4.7, where links are allowed only within a luster or downwards in the
hierahy. Another way of onstraining the searh is to give the nodes a numeri labelling
and speify that a link from Xi to Xj may exist only if i < j: However, even with suh
i
C
D
S
Figure 4.7: A strutural heirahy. For example, C ould represent a luster of variables
representing auses, D ontain disease nodes and S ontain nodes representing possible
symptoms.
onstraints imposed, for networks of even moderate size, heuristis are needed.
For any of the model seletion riteria disussed earlier, after eah simpliation we
must be able to alulate the network size, update the onditional probabilities and then
evaluate the hosen distane measure. As we may make many hanges in searhing for the
optimal simpliation, it is important to be able to perform these alulations as eÆiently
as possible.
58
4.3.
SIMPLIFYING THE STRUCTURE
Consider rst the task of evaluating network size. At eah stage assign to every link
l the weight wl , where wl represents the amount by whih the size of the network will
be redued by the deletion of this link. Let S 0 be the network obtained from S by the
deletion of a link, and onsider the weight on the link from Xi to Xj say,
X
wl =
Sp(Xk )
Xk 2S 0
0
Sp (Xj ):
2
= Sp(Xj )
Xk S
X
Sp0 (Xk )
Here Sp0 (Xj ) represents the value Sp(Xj ) in the network S 0 . In the ase where Xj is a
leaf node with sole parent Xi in S , on simplifying the network by removal of the link from
Xi to Xj we an also remove the node Xj as it is disonneted from the network. If Xj
has sole parent Xi but is a parent of some other variable, we retain Xj . If Xj has parents
other than Xi , Sp0 (Xj ) = SpjX(Xi jj ) . Hene
8
>
<Sp(Xj )
if Pa(Xj ) = fXi g and Xj has no hildren
wl = Sp(Xj ) jXj j
if Pa(Xj ) = fXi g and Xj has at least one hild
>
:
1
Sp(Xj )(1 jXi j ) otherwise:
After having removed the link from Xi to Xj we must then update the weights. The
only links on whih the weights must be altered are those from the remaining variables in
Pa(Xj ) to Xj . In this ase the weights will hange to jXwli j .
To arry out model seletion we must hoose a measure of distane. A reasonable
distane measure, additive over the hypothesis nodes, is
X
dist(P (Xh ); P 0 (Xh )) =
Xi
2Xh
distK (P (Xi ); P 0 (Xi ));
where we have used the Kullbak-Liebler divergene. This allows for an eÆient method
of alulating the distane at eah simpliation. Based on (4.2),
dist(P; P 0 ) =
X
e
e
X
Xi
2Xh
distK (P (Xi je); P 0 (Xi je)):
(4.4)
Assumming we have the probabilities P (Xi je) from the original network, we an alulate (4.4) eÆiently by the use of a soring funtion. Dene the sore of a Bayesian
Network S 0 to be
(S 0 je) =
and
X X
Xi
2Xh
xi
~ 0 =
P (Xi = xi je) log P 0 (Xi = xi je);
X
e2E
e (S 0 je):
59
4.3.
Then we have that
~0 ~ =
=
X
e2E
X
e
e
X X
Xi
2Xh
X
e2E Xi 2Xh
= dist(P; P 0 ):
xi
SIMPLIFYING THE STRUCTURE
P (Xi = xi je)(log P (Xi = xi je) log P 0 (Xi = xi je))
distK (P (Xi je); P 0 (Xi je))
To obtain the required probabilities P 0 (Xi je), we must rst update the onditional probabilities of our new network, as desribed below. For eah instantiation of the evidene
variables we an then propagate this evidene through the network to obtain the required
probabilties at the hypothesis nodes.
If in simplifying the network we remove the link from Xi to Xj say, then we require
the onditional probabilities
P (Xj jPa0 (Xj )) = P (Xj jPa(Xj )nfXi g):
Consider the probability distribution for Xj given a partiular onguration of Pa0 (Xj ),
that is one row of the onditional probability table. Then the required probabilities are,
for all xj 2 Xj ,
P (xj ; pa0 (Xj ))
P (Xj = xj jpa0 (Xj )) =
P (pa0 (Xj ))
X P (xj ; pa0 (Xj ); Xi = xi )
=
P (pa0 (Xj ))
xi
X P (xj jpa0 (Xj ); Xi = xi )P (pa0 (Xj ); Xi = xi )
=
P (pa0 (Xj ))
xi
X
= : P (xj jpa0 (Xj ); Xi = xi )P (pa0 (Xj ); Xi = xi );
xi
where is the normalisation onstant for that partiular onguration of Pa0 (Xj ). As
Pa(Xj ) = Pa0 (Xj ) [fXi g, the probabilities an easily be obtained from the network prior
to the removal of the link by xing the states of the variables in Pa0 (Xj ) and summing
over the probabilities for eah state of Xi .
Given that we now have eÆient methods for alulating the size and distane required
to evaluate the eet of a simpliation, we an ombine these ideas into a proess for
network simpliation. If we were to use the aeptane measure, we ould searh for the
network whih minimises this, by using any standard searh method. We now present a
heuristi for the alternative formulation,
minimise
subjet to
Size(S )
dist(P; P 0 ) < t:
60
(4.5)
4.3.
SIMPLIFYING THE STRUCTURE
whih is based on a greedy method. The required input is a Bayesian Network, N , with
hypothesis nodes Xh , evidene nodes E and also a threshold t on the distane. We denote
the set of edges in the network N i at step i by Li . The heuristi returns a simplied
network Nbest.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Put L1 equal to the set of edges in the initial network N 1 .
Calulate weights wl for all edges l in L1 .
Put k = 1
While Lk 6= ;, do
sort wl for all edges l 2 Lk
lmax edge of maximum weight
Lk+1 Lk nflmax g
N 0 N k nflmax g
Size(N 0 ) Size(N k ) wlmax
update weights and probabilities on remaining links
If Size(N 0 ) < Size(N );
alulate dist(P ; P jN 0 )
If dist(P ; P jN 0 ) < t
N k+1 N 0
Nbest N k+1
else N k+1 N k
else N k+1 N 0
k k + 1.
Note that the initial network N 1 does not have to be the same as the network we wish
to simplify, N , and in general will be a omplete network. This allows links not in the
original network to be inluded in the simplied network, whih may result in a better
approximation than by simply deleting links from N . If time permits the heuristi should
be run for as many omplete networks as possible, and some thought should be given to
whih strutures might be likely to result in a good approximation.
The heuristi begins with the maximal number of allowable links and at eah stage
removes the link whose deletion results in the largest derease in the value of the objetive
funtion (4.5), that is the Size. If the Kullbak-Liebler divergene is used, the distane on
line 12, required to assess the feasibility of deleting lmax , may be alulated eÆiently by
the method presented earlier. If a deletion is not feasible the orresponding link is removed
from the pool of remaining links, and this is repeated until there are no more links whih
may be feasibly deleted. At eah stage the number of elements in Lk is redued by one,
61
4.4.
THE VALUE OF EVIDENCE
hene assuming the network has n nodes, the algorithm will loop at most n(n 1)=2 times.
Note that there is no onstraint that speies the evidene nodes and hypothesis nodes
should be inluded in the nal network. However, the onstraint on distane should be
tight enough so that any relevant evidene nodes are inluded, and that the hypothesis
nodes remain onneted to the network.
Many of the omputational diÆulties enountered above are due to the fat that
the number of possible strutures on even a reasonable number of nodes is huge. In
Chapter 5 we disuss methods that have been developed for learning the struture of a
Bayesian Network from data, and the tehniques, approximations and assumptions whih
are introdued to overome suh diÆulties.
4.4 The Value of Evidene
In this setion we assume that the set of evidene nodes E in our Bayesian Network is
non-empty. Ideally when making inferene at the hypothesis nodes we would base this
inferene on as muh information as possible, that is we would observe the state of eah
information variable. However often this observation may be assoiated with a ost. For
example, onsider the information variable Test Result shown as a hild of Glandular
Fever in Figure 4.8. Although Test Result is an evidene node, it may ost $40 to have
Glandular
Fever
Test Result
Fever
Tiredness
Thermometer
Reading
Figure 4.8: A Bayesian Network with hypothesis variable Glandular Fever and information
variables Test Result, Thermometer Reading and Tiredness.
the test performed. Evidene may be observed at Thermometer Reading for a very small
ost. Note that we make Thermometer Reading a hild of fever so that any disrepanies
between the thermometer reading and atual fever an be aounted for, and that similarly
Test Result is a hild of Glandular Fever in order to allow possible false positives or false
negatives to be modelled.
In Setion 4.4.1 we present a method to nd the optimal subset of evidene nodes at
whih to observe evidene, given a xed budget. In Setion 4.4.2 we then show that the
optimal set of evidene hanges with struture, or more preisely the distribution over the
62
4.4.
THE VALUE OF EVIDENCE
hypothesis nodes, and hene should be ontinually revised.
4.4.1
Seleting a Set of Evidene Nodes
Our objetive is to maximise the amount of evidene reahing the hypothesis nodes, subjet
to a budget onstraint. That is,
max
~ I (Xh ; E
~)
E
X
s.t.
(Xi ) C;
~
Xi 2E
~ E is a subset of the possible evidene nodes, (Xi ) is the ost assoiated with
where E
observing evidene at node Xi and C is some predetermined upper bound on the total
ost (the budget ).
As the instantiation of some variables may aet the entropy reduing apaity of
~ and
evidene at other nodes, we must onsider separately all feasible sets of evidene E
~ ).
determine whih set maximises I (Xh ; E
~ = fXn m+1 ; : : : ; Xn g say, and we then an
Suppose we have a set of evidene nodes E
observe evidene at an additional node Xn m . Denote the set fXn m ; Xn m+1 ; : : : ; Xn g
~ + . Then we have that
by E
~ + ) = H (Xh ) H (Xh jE
~ + ); from (4.5)
I (X h ; E
~ ) + H (Xh jE
~ ) H (Xh jE
~ +)
= H (Xh ) H (Xh jE
~ ) + I (Xh ; Xn m jE
~ ); from (4.6):
= I (Xh ; E
Thus in order to alulate the hange in information Xh reeives when a node Xn m is
~ ). However, this will
added to the set of evidene, we need only alulate I (Xh ; Xn m jE
still involve a summation over all states of the variables Xn m ; : : : ; Xn .
In order to nd the optimal solution to the problem above, one would have to perform
an exhaustive searh. That is, alulate the value of the information for eah of the possible
sets of evidene nodes with total ost C . The optimal set of evidene would then be
~ ).
that whih maximised I (Xh ; E
We now propose a greedy searh tehnique for the problem, where we begin with no
nodes from whih to ollet evidene, and at eah stage add the node whih is aordable
and has the greatest entropy reduing apaity at the hypothesis nodes. Although the
solution will not always be optimal, in the ase of there being many evidene nodes it is
omputationally eÆient.
Let the set of nodes hosen prior to step k at whih to observe evidene be denoted
k 1
~
E , let Wk E be the set of evidene nodes whih are andidates for seletion at step
63
4.4.
THE VALUE OF EVIDENCE
k, and Fk Wk be the set of all feasible nodes at step k, that is nodes whih an be
~ k without going over budget. C k = PX 2E~ k (Xi ) is the ost of olleting
inluded in E
i
~ k at the end of step k.
the evidene E
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
~0 ;
E
W0 E, the set of all evidene nodes
C0 0
k 1
for eah Xi 2 Wk 1
if (Xi ) C C k 1
Fk Fk [ fXi g
if F k = ;
~ E
~k 1
E
~ g and stop.
return fE
~ k 1)
I (Xh ; Xi jE
else X k argmax
Xi 2Fk
~k E
~ k 1 [ fX k g
E
C k C k 1 + (X k )
Wk Fk nfX k g
k k+1
go to line 5.
If C is large ompared to the ost for eah test, it may require fewer steps to begin a
similar searh starting with E and suessively removing nodes with the lowest sore. If
there exist evidene nodes whih when instantiated would d-separate other nodes in Wk
from Xh , then the mutual information of these d-separated nodes to Xh would be zero
and so they ould be removed from Wk . In many ases this ould substantially inrease
the speed of the searh as it would avoid many unneessary alulations of the mutual
information.
4.4.2
Updating the Set of Evidene Nodes
Suppose now the funtion we are trying to minimise is I (X1 jXn ; : : : ; Xn m ); where X1 is
a root node and a hypothesis node and Xn m ; : : : ; Xn are information nodes. We know
I (X1 ; Xn ; : : : ; Xn
m
) = I (Xn ; : : : ; Xn m ; X1 )
= H (Xn ; : : : ; Xn m ) H (Xn ; : : : ; Xn
64
m
jX1 );
(4.1)
4.4.
THE VALUE OF EVIDENCE
where, reall from Setion 4.1
H (Xn ; : : : ; Xn
m
) =
X
xn ;:::;xn
m
P (xn ; : : : ; xn
m
) log2 P (xn ; : : : ; xn
m
)
and also
P (x n ; : : : ; x n
m
) =
=
X
P (x1 ; : : : ; xn )
x1 ;:::;xn
m
1
x1 ;:::;xn
m
1
X
P (xn jpa(Xn )) : : : P (x2 jpa(X2 ))P (x1 ):
As the onditional probabilities are xed and hene independent of the observed evidene,
this last expression is a funtion of P (x1 ). Hene H (Xn ; : : : ; Xn m ) is a funtion of P (X1 ).
Similarly, H (Xn ; : : : ; Xn m jX1 ) is a funtion of
P (x n ; : : : ; x n
m
jx1 ) = P (xn; : P: :(;xxn) m ; x1 ) ; for every (x1 ; : : : xn)
=
X
x2 ;:::;xn
P
=
=
m
x2 ;:::;xn
X
x2 ;:::;xn
m
1
P (x1 ; : : : ; xn )
P (x1 )
1
m 1 P (xn jpa(Xn )) : : : P (x2 jpa(X2 ))P (x1 )
P (x1 )
P (xn jpa(Xn )) : : : P (x2 jpa(X2 )):
1
Hene, given the onditional probability table for eah node, these probabilities are xed.
Therefore
H (Xn ; : : : ; Xn
m
jX1 ) =
X
x1
P (x1 )
X
xn ;:::;xn
m
P (xn ; : : : ; xn
m
jx1) log2 P (xn; : : : ; xn mjx1 )
is a funtion of P (X1 ) and so, from (4.1), I (X1 ; Xn ; : : : ; Xn m ) is a funtion of P (X1 ).
This implies that the optimal set of evidene may hange as the distribution of X1 is
updated. For example, if at time t = 0 we have a prior distribution for X1 and observe
some evidene, at t = 1 we update the belief distribution of X1 to P (X1 je); and this
beomes the prior belief of X1 at the next time step. Hene we should look for an optimal
set of evidene nodes eah time the distribution of X1 hanges `substantially.'
Note that I (X1 ; Xn ; : : : ; Xn m ) is a onave funtion of P (X1 ) and there exists a value
for P (X1 ) whih results in a global maximum. Das, in [10℄, notes that as P (X1 ) moves
away from this optimal value, one must hange the speiations of the network in order
to maintain optimality. He goes on to say that this an be done in one of two ways, either
1. hange the set of evidene nodes so that the information obtained has a greater
degree of relevane to the present situation, or
65
4.4.
THE VALUE OF EVIDENCE
2. hange the intermediate nodes and therefore the links in the network so that \the
hannel of evidene propagation is more relevant to the prevailing situation."
By making the hanges suggested in point two, it seems that it is possible to redue the
unertainty in X1 by more than is justied by the situation. This may result in removing
genuine soures of unertainty in order that the information obtained has a greater eet
on reduing the unertainty of X1 . Hene to ensure a deision is not biased, but that the
available evidene is maximised, it seems best to ontinually monitor the optimal set of
evidene and update this when neessary.
If possible we should observe all evidene nodes at eah stage, though in pratie, for
reasons of ost or pratiality, we are restrited to hoosing a subset of nodes. For example,
if the deision to be made is some kind of diagnosis at the hypothesis node, then we may
be able to redue the unertainty in this deision by performing a series of tests, that is
observing evidene. After the information from a partiular test has been inorporated and
the distribution at the hypothesis node updated, there may be little value in performing
the same test again. Hene in this ase the evidene obtained by performing other tests
may beome more valuable/relevant, and we update the set of evidene nodes at the next
stage aordingly. This shows that point one above is intuitively reasonable, and agrees
with general pratie.
66
Chapter 5
Learning Bayesian Networks from
Data
5.1 Introdution
In previous setions we have seen how to form a Bayesian Network from ausal or subjetive knowledge, how information is propagated through the network and why Bayesian
Networks are an eÆient means of storing a probability distribution. Often it happens
that the domain one wishes to make inferene about is not well understood, or that one
may be ondent in identifying the independenies of the domain (and hene network
struture) but be less sure about the numerial speiation of the onditional probabilities. It has also been shown that the opinions of experts, whose knowledge may be used
to form a network, are rarely very aurate [18℄. In this hapter we will assume we have
aess to data believed to have been generated from the underlying probability distribution. We an use this data alone to form a Bayesian Network, or we may ombine the data
with expert or prior knowledge. After having formed a Bayesian Network in this way, we
have a model of the domain whih an be used to assist deision proesses or for other
inferential proesses. The struture of the network gives a graphial representation of the
relationships between the variables in the domain.
In this hapter we disuss the use of data to learn a Bayesian Network. In Setion 5.2
we introdue general onsiderations and the objetives of learning, then in Setion 5.3
give an overview of the multitude of approahes to this problem that have evolved from
dierent shools of researh. In Setions 5.4 and 5.5 we look at two of the more ommon
and widely aepted methods - namely the Information-Theoreti approah, whih was
one of the pioneering methods, and then Bayesian methods whih have been the fous of
muh reent researh in this area. Finally, in Setion 5.6 we look at an alternate approah
based on the bootstrap.
67
5.2.
CONSIDERATIONS OF LEARNING
5.2 Considerations of learning
Our goal when learning Bayesian Networks from data may be to learn the struture of
the network, the parameters of the network (onditional probabilities) or both the struture and parameters. The struture depends on the independene relations between the
variables of the domain, and the parameter speiation is dependent on the struture.
Generally there is no one model that stands out as the `orret' network based on the
data. Given that we have a random sample of ases there will ultimately be variation between samples, referred to as sample variation. A model is onsidered to be an aeptable
t to the data if it diers from the observed data by an amount whih an be explained by
sample variation [30℄. There should be enough nodes and links in the model to represent
the true underlying distribution, but not so many that the noise (whih arises from the
sample variation) is modelled.
A omplete graph would provide the best t to any data set, though a more simple model with fewer links ould provide a better representation of the true underlying
dependene struture and would generalise better. The idea of generalisation relates to
the ability of a model to predit an independent test observation, that is an observation
not used to t the model. Although a omplete graph will provide the best t to the
data, a model with good generalisation ability will more aurately model future ases.
This is important if we form a Bayesian Network based on past ases and wish to enter
information and make deisions using this network in the future.
One of the reasons many methodologies for this problem exist is beause of the neessity to treat learning problems dierently for what an be broadly lassed as small,
medium and large sample problems. If a reasonable approximation is to be found from a
small sample, the method must rely more heavily on prior or expert knowledge and the
data used to tune this or test for inonsistenies. When we have large amounts of data
prior knowledge is less signiant. Most instanes fall into the medium sample ategory
whih lies between these two extremes.
The number of possible strutures and the number of parameters to be estimated inreases exponentially with the number of nodes in the network. Hene, omputationally,
model seletion may require large amounts of time and storage spae. The larger the
domain we model the more possible strutures and the more parameters are to be learnt
so that we require more data for similar levels of ondene. It would seem desirable
that as the sample size grows large, learning methods should identify a model that is
loser to the true distribution, where loseness may be measured by some dissimilarity
measure. However, given large amounts of data and possibly many variables, the omputational omplexity inreases. In general, learning Bayesian Networks is an NP-omplete
68
5.3.
METHODS OF LEARNING BAYESIAN NETWORKS
problem with respet to the number of variables [18℄ and exhaustive enumeration of all
possible models is not possible, hene many approximate methods and heuristis have
been developed.
A ommon measure of omplexity was introdued in Setion 4.3 and was given to be
the Size of the network. In that setion our primary aim was to redue the omplexity of
an existing model, whereas here we aim to learn good models and then simplify them if
needed. Over-tting is not often a problem in learning Bayesian Networks as a result of
the onstraints imposed on omplexity a priori in order to onstrain the searh spae. It
is more often the ase that we risk oversimpliation.
In what follows we assume that we have a database D ontaining N observations
x1 ; : : : ; xN . In any observation xl = (xl1 ; : : : ; xln ); if there is no information reorded on
a variable this is referred to as a missing value and the data base is termed inomplete.
As missing variables ompliate the analysis they are treated in Setion 5.5.5 separately
to the ase of omplete data (no missing values) and we will otherwise assume our data
to be omplete. We also ontinue to assume the variables are disrete. This assumption
obviously restrits the models we are able to learn but as the required approah an dier
substantially for ontinuous data it will not be disussed.
5.3 Methods of Learning Bayesian Networks
5.3.1
Soring Funtions and Searh Methods
Soring funtions are used extensively for learning and model seletion. If the model spae
is small, the pratitioner an look at a number of the `best' models and hoose that whih
seems most appropriate, generally favouring the more simple models. However, when the
searh spae is large, as is the ase for most network problems, this proedure needs to
be automated. This an be done by dening a funtion whih assigns eah model some
sore. We an then employ a searh proedure to nd a model with a high sore.
Soring funtions generally onsist of two parts - a measure of t and a omplexity
measure, and often take the form
(measure of t) (measure of omplexity):
(5.1)
is a user dened parameter whih determines the relative inuene of the two terms in
the above expression. If we hoose to be very large this will result in a simple model
whih is probably too simple to adequately model the proess. If is very small this
will result in hoosing the model whih best ts the data and is likely to result in overtting. For example the Maximum Likelihood approah introdued in Setion 5.3.2 tends
to result in over-tting and so often a penalised likelihood sore is used whih penalises
69
5.3.
METHODS OF LEARNING BAYESIAN NETWORKS
overly omplex models. The aeptane measure (4.3) is another example of suh a soring
funtion. Often a model seletion proedure an be quite sensitive to the hoie of tuning
parameter , and are must be taken in determining its value. As mentioned earlier, when
learning Bayesian Networks the omplexity is often onstrained a priori and so in this
ase the soring funtion will onsist solely of a measure of t.
Almost all soring funtions used in learning Bayesian Networks are deomposable, that
is, they an be written as the produt of n fators, eah of whih is a funtion of only a
node and its parents. A soring funtion Æ() is deomposable if it an be written in the
form
n
Y
Æ(D; S ) = s(xi jpa(Xi ));
i=1
where Æ is a funtion of the observed data D and the struture S , and s() some speied
funtion. Often the log sore is used so that a deomposable funtion an then be written
as a summation.
5.3.1.1 Searh Methods
One a soring funtion has been dened the model seletion proedure is to searh over
the model spae for that with the best sore.
Any network struture an be modied by either adding or deleting a link, subjet to
the onstraint that there be no direted yles. If the soring funtion is deomposable,
to obtain the sore of the new network from the old we need only evaluate s(xi jpa(Xi )),
where Xi is the node to whih the ammended link would point.
For example, if we let e denote the hange in log Æ(D; S ) when adding or deleting
edge e, a ommon searh proedure is to evaluate e for all feasible hanges at every stage
and make that hange orresponding to the largest e. If no e is greater than zero we
stop, as the sore an not be improved at the next step. A problem with this method
is beoming stuk at a loal maximum so in pratise tehniques suh as multiple starts,
peturbation of struture and simulated annealing are used to overome this.
For the ase where every node an have at most one parent, polynomial-time algorithms
an be used to nd the highest soring network based on methods for nding a maximum
weight spanning tree. The ase where eah node has k parents, k > 1, is NP-hard with
respet to k [7℄. It has been shown that greedy algorithms and loal searh proedures
an perform well [1℄ and proedures suh as branh and bound have also been applied
suessfully.
A ommonly used heuristi is known as K2, rst proposed in [7℄. It begins by assuming
a node has no parents and at eah stage adds that parent whose addition inreases the
probability of the struture by the greatest amount, as measured by the likelihood of the
70
5.3.
METHODS OF LEARNING BAYESIAN NETWORKS
observed data. The sore used is a funtion of eah node, its parents and the observed data
suh that maximising the sore is equivalent to maximising the likelihood. In this ase
the omplexity is onstrained by assigning to eah node an upper bound on the number
of parents it may have.
5.3.2
Maximum Likelihood
Maximum Likelihood is used extensively in the eld of statistis. Given a xed model
with struture S and onditional probabilities , the sample likelihood is dened as
N
Y
L(S; jx) =
P (xl jS; );
(5.2)
=1
where we have assummed the observations x1 ; : : : ; xN are independent. If we were to
searh over all of S , using the priniple of maximum likelihood the optimal value for
(S; ) would be that for whih the likelihood (5.2) is maximised. The maximum likelihood
estimators S^ and ^ an be onsidered the most likely values of the orresponding parameters for the model onsidered, in the sense that the observed data is most likely under the
model dened by S^ and ^.
As an example, onsider the xed struture as in Figure 5.1. In this ase eah observation xl is a vetor (tl ; el ; nl ) and the parameters jS = (T ; E ; N ) where, for example,
E is the onditional probability table for node E . Then the likelihood for the parameters
l
Taxable
Income
T
Number of
Dependents
N
E
Expenditure on
leisure
Figure 5.1: A Bayesian Network for whih we wish to learn the parameters by the method
of maximum likelihood.
,
L(jx; S ) =
N
Y
l
=
=1
N
Y
=1
(
P (tl ; el ; nl jT ; E ; N )
P (tl jT )P (el jE )P (nl jN )
l
=
N
Y
l
=1
)(
P (tl jT )
N
Y
l
71
=1
)(
P (el jE )
N
Y
l
=1
)
P (nl jN ) :
5.3.
METHODS OF LEARNING BAYESIAN NETWORKS
The likelihood an be maximised by maximising eah term in the braes independently
and so is deomposable. However, this deomposition only applies if there are no missing
or hidden values [1℄.
Another way of onsidering the maximum likelihood approah is to see it as nding
the struture S whose likelihood over the parameters is greatest, that is
argmax max
S^ =
P (xj; S )
S
argmax ^
=
(jS ):
S
The maximum likelihood approah is best for medium sized problems as diÆulties
may arise when there is no data at ertain ongurations or when there is too muh data.
For example, if we need to estimate P (E = ejT = t) for some states e of E and t of T , and
there are no observations with T = t, then the maximum likelihood estimate is undened.
5.3.3
Hypothesis Testing
Many lassial statistial approahes involve hypothesis testing in some form or another.
In this ontext a hypothesis test may involve the hypotheses
H0 : a link exists from node X to node Y
Ha : no link exists from node X to node Y
or
H0 : S1 is no better than S2
Ha : S1 is better than S2 ,
for some strutures S1 and S2 .
The likelihood ratio hypothesis test is based on the ratio of the likelihoods
L(1 jx)
L(2 jx)
for two distint values of the parameters 1 and 2. If we ondition on the struture of
the network, then this determines the parameters to be learnt. The generalised likelihood
ratio test is based on the ratio
max L(jS2; x)
(5.3)
max L(jS1 ; x);
or equivalently the dierene in log likelihoods. Small values of (5.3) provide evidene
against the hypothesis that S1 is no better than S2 .
If S1 in the denominator of (5.3) is the omplete struture1 , whih will have the greatest
likelihood, then we an say that, for all S2 , S2 is nested in S1 . That is, all independenies
1
In this ontext a omplete network is the saturated model.
72
5.3.
METHODS OF LEARNING BAYESIAN NETWORKS
implied by S2 an be represented by S1 . The deviane is dened as twie the logarithm
of this ratio.
The edge exlusion deviane for testing if a single edge an be exluded from the
saturated model is given in [36℄ to be
N log(1 orrN (Xi ; Xj jXnfXi ; Xj g)2 );
where orrN (Xi ; Xj jXnfXi ; Xj g) is the sample orrelation oeÆient of Xi and Xj given
the remaining variables. This deviane has an asymptoti hi-square distribution with
one degree of freedom. Hene, when n is small, we an ompute the p-values for all
n
2 edge exlusion devianes from the omplete network struture and drop any nonsigniant edges. This proedure an be supplemented by an iterative proedure whih
further evaluates the edge inlusion devianes of those edges left out of the model, to see
if any should now be inluded. See [36℄ for details.
However, if this test were implemented at the 95% level of ondene say, then we
would expet that 5 out of 100 times we would delete a link where the true model would
ontain one. For a network with a reasonable number of nodes there are many links, and
so repeatedly applying this test will result in a large number of errors. Bonferroni [12℄
suggested that, for an overall ondene level of (referred to as the family-wise error
rate) the signiane level of eah test should be set to =t, where t is the number of
tests to be performed. However, suh a stringent rejetion riteria an lead to tests with
low power, that is we inlude links whih ould be deleted, hene inreasing omplexity
unneesarily.
Proedures for exluding edges from the omplete graph are based on the theory of
variable subset regression, where subsets of the maximal model are hosen and ompared
in a systemati fashion in order to hoose one best subset. In this ontext a subset is a
set of links and the best subset is that with as few members as possible whih provides an
adequate t to the data. [36℄ derives the relationship between the deviane for testing if
the independenies implied by a Bayesian Network struture hold, and the F-statisti for
testing that a subset of the regression oeÆients is zero.
5.3.4
Resampling
One tehnique that has been developed over the last twenty years, as omputational power
has inreased, is resampling. This term generally overs any iterative proedure in whih
a sample is drawn from the data or from a distribution dened by the data at every stage.
The Gibbs Sampler, rst introdued in Setion 3.3.3.4, is often used for learning Bayesian
Networks in the ase of missing data and is disussed in this ontext in Setion 5.5.5.1.
73
5.3.
METHODS OF LEARNING BAYESIAN NETWORKS
The bootstrap is another resampling tehnique and involves repeatedly drawing samples from the original sample. It an be used to estimate the sampling distribution of
any statisti. An appliation of the bootstrap to learning Bayesian Networks is given in
Setion 5.6.
5.3.5
Bayesian Methods
The main fator that dierentiates Bayesian from non-Bayesian methods is the speiation of a prior distribution, P (), for the parameters . This is possible beause, unlike
in the usual lassial approah when the parameter to be estimated is onsidered xed,
the Bayesian philosophy is to onsider the parameter as a random variable. The prior
distribution expresses all the information we have about before having observed the
data. We then ombine this prior knowledge with data believed to have been generated
from the underlying distribution to alulate the posterior distribution - the distribution
of the parameters after having onsidered the information available from the data. The
maximum a posteriori (MAP) approah is to selet as the best model that with the largest
posterior probability.
Given that we observe some data D, the posterior probability is given by Bayes' rule
as
P (jD) =
/
P (Dj)P ()
P (D)
P (Dj)P ():
(5.4)
Hene we an ompare two models with parameters 1 and 2 by onsidering the posterior
odds,
P (Dj1 )P (1 )
;
(5.5)
P (Dj2 )P (2 )
the produt of the Likelihood Ratio, P (Dj1 )=P (Dj2 ) and the prior odds P (1 )=P (2 ).
Hene it may be that we have two models dened by 1 and 2 suh that P (1 ) > P (2 )
but that, beause the data is most likely under the model dened by 2, the posterior
probability of 2 is greater than that of 1.
If represents the vetor of parameters (1 ; : : : ; n ), where i speies the onditional
probability table for node Xi , and the i are independent given some network struture,
then
P (jD) =
/
n
Y
i
=1
i
=1
n
Y
74
P (i jD)
P (Dji )P (i ):
5.4.
THE INFORMATION THEORETIC APPROACH
Hene the posterior probability for the network is deomposable and so an be maximised
by nding i to maximise P (i jD) for eah individual node.
In general, a prior distribution is needed for both the struture and the onditional
probabilities. A prior distribution for struture assigns some probability to every possible
struture over the variables, though often many strutures are assigned a prior probability
of zero or the struture is assumed known. Although prior distributions provide a natural
way to inlude domain or expert knowledge, their speiation an also be mathematially
omplex. As model seletion may be sensitive to this speiation, a poorly hosen prior
distribution an make a Bayesian method perform poorly against alternative tehniques [1℄,
or result in inferene based on spurious assumptions. Prior distributions an be lassed
as informative or non-informative. Informative prior distributions are used when it is
believed some parameter values are more likely than others. Non-informative priors are
hosen if there is no prior evidene to suggest that one partiular struture or parameter
speiation would be any more likely than another. The hoie of prior distributions for
Bayesian Networks is disussed in detail in Setion 5.5.4.
The Bayesian approah an be useful when dealing with sparse data. In this ase
a prior distribution an be dened for the parameters with all values having a positive
probability, in order to avoid undened values.
We have already seen some of these ideas in the ontext of information propagation
in Chapter 3. In that ontext the prior distribution is the distribution over the states of
the variables before we have reeived evidene. We then observe some evidene e and,
through the propagation algorithm, update the distribution at eah node to the posterior
distribution P (X je). At those leaf nodes where data was not observed we assigned the
uninformative uniform prior distribution.
This is also an example of sequential updating. At eah stage of the algorithm a node
updates its belief given the information it reeives, by omputing the posterior distribution.
This posterior distribution is then used as the prior distribution at the next stage when
new messages are reeived, until after several iterations all the information has been taken
into aount.
5.4 The Information Theoreti Approah
One of the onstraints often imposed on the struture when learning Bayesian Networks
is that eah node an have at most k parents, for some integer k > 0. In 1968, in what
was one of the rst attempts to learn the struture of a probabilisti network, Chow and
Liu [3℄ presented a method to nd the optimal network to approximate a probability
distribution given that eah node an have at most one parent, that is the network must
75
5.4.
THE INFORMATION THEORETIC APPROACH
be a tree. A similar method for learning a polytree struture from data was explored
by Pearl [30℄, though in this ase an optimal approximation an be found only if the
generating distribution an be represented as a polytree. Here we disuss Chow and Liu's
method, and then look at extensions to this approah.
5.4.1
Chow and Liu Trees
Chow and Liu were motivated by the large storage requirements of the information neessary to alulate P (X). As has been disussed, networks whih apture the independenies
amongst the variables make for more eÆient storage, and alulation of probabilities an
be arried out more eÆiently by use of the hain rule. A further advantage of enforing
a tree struture is that, for a given set of data, the ratio of data to the number of unknown parameters is greater than for more ompliated strutures, and the probability of
enountering missing data is redued.
Although a tree struture results in more eÆient storage of a distribution, the strutural onstraints may orrespond to assumming independenies whih do not hold. Hene
for distributions that annot be represented by a tree, enforing a tree struture results in
an approximation. It is desirable to make this approximation as good as possible.
Chow and Liu wished to nd the spanning tree2 whose resulting distribution P t (X)
most losely mathed the distribution from whih the data was generated, P (X). They
hose the Kullbak-Liebler divergene as their measure of optimality. The Kullbak-Liebler
divergene between the two distributions P and P t an be expanded as follows:
distK (P; P t ) =
=
=
=
X
x
X
x
P (x)(log P (x) log P t (x))
X
P (x) log P (x)
H (X)
H (X)
X
x
"
P (x)
n X
X
i
=1 x
x
n
X
i
n
Y
P (x) log[
=1
i
=1
P t (xi jpa(Xi ))℄
#
log P (xi jpa(Xi )) ;
t
(5.1)
P (x) log P t (xi jpa(Xi ));
where H (X) is the entropy of X as dened in Setion 4.1. As P t (xi jpa(Xi )) is onstant
2
A spanning tree of a set of nodes X is a tree whih onnets all nodes. If jXj = n, a tree is a spanning
tree if and only if the number of links is equal to n 1.
76
5.4.
THE INFORMATION THEORETIC APPROACH
for all ongurations of the variables in XnfXi ; Pa(Xi )g,
distK (P; P t ) =
H (X )
n
X
i
X
x
=
=1
H (X )
H (X)
H (X)
pa(Xi )
X
=1 xi ;pa(Xi )
n
X
X
=1 xi ;pa(Xi )
n
X
X
i
log P t (xi jpa(Xi ))
xi ;
n
X
i
=
X
4
#
P (x)I(fXi [ Pa(Xi )g[x℄ = (xi ; pa(Xi ))
i
=
2
=1 pa(Xi )
log P t (xi jpa(Xi ))P (xi ; pa(Xi ))
(5.2)
P (xi jpa(Xi ))P (pa(Xi )) log P t (xi jpa(Xi ))
P (pa(Xi ))
X
xi
P (xi jpa(Xi )) log P t (xi jpa(Xi )):
(5.3)
P
The nal summation in (5.3) is of the form x f (x) log g(x). Using Lagrange multipliers to
P
nd the value of g whih maximises this expression, subjet to the onstraint rj=1 g(xj ) =
1 gives
f (x1 )
f (x r ) T
;:::;
= (1; : : : ; 1)T ;
g(x1 )
g(xr )
that is, f (xj ) = g(xj ) for all j .
Then
r
X
j
=1
g(xj ) = 1 =) =
so g(xj ) = f (xj ) for all j . That is,
hene (5.3) minimised, when
r
X
=1
= 1
f (xj )
j
P
xi
P (xi jpa(Xi )) log P t (xi jpa(Xi )) is maximised, and
P t (Xi jPa(Xi )) = P (Xi jPa(Xi )) for all Xi :
(5.4)
This implies that the best approximation based on the Kullbak-Liebler divergene is
realised when the onditional probabilities for the tree are the same as those obtained
from P .
This is a somewhat suprising result. Reall that, as the tree struture is likely to be a
simpliation of the original struture, the parent set of a node Xi in the tree is unlikely
to be the same as in the original network. Hene we might imagine that to obtain the
best approximation of P , the onditional probability distributions whih dene P t may
in some way need to ompensate for the strutural hanges. Instead we nd that the
77
5.4.
THE INFORMATION THEORETIC APPROACH
required probabilities for P t an be alulated diretly from P , and onsequently, if the
loal struture of a node is the same in P t as it is in P , that is Pat (Xi ) = Pa(Xi ), then
the onditional probability table at Xi does not hange.
On substitution of the optimal probabilities dened by (5.4) into (5.2), we obtain
distK (P; P ) =
n
X
H (X )
t
i
X
=1 xi ;pa(Xi )
P (xi ; pa(Xi )) log P (xi jpa(Xi )):
(5.5)
Using the relation
P (xi ; pa(Xi ))
log P (xi jpa(Xi )) = log
log P (xi ) + log P (xi )
P (pa(Xi ))
P (xi ; pa(Xi ))
+ log P (xi )
= log
P (xi )P (pa(Xi ))
(5.6)
in (5.5) gives
distK (P; P ) =
t
H (X)
n
X
i
=
n
X
P (xi ; pa(Xi ))
P (xi ; pa(Xi )) log
P
(xi )P (pa(Xi ))
i=1 xi ;pa(Xi )
X
=1 xi ;pa(Xi )
H (X)
X
n
X
i
=1
P (xi ; pa(Xi )) log P (xi )
I (Xi ; Pa(Xi ))
n X
X
i
=1
xi
P (xi ) log P (xi ):
As the rst and third terms are onstant over all strutures, minimising distK (P; P t )
P
is equivalent to maximising ni=1 I (Xi ; Pa(Xi )). Hene if we dene the weight on any
branh between Xi and its parent to be I (Xi ; Pa(Xi )), the optimal spanning tree is then
the maximum weight spanning tree.
Any algorithm for nding a maximum weight spanning tree an be used to nd the
optimal tree approximation. Using Kruskal's algorithm [5℄ this is done as follows:
1. Compute the distributions P (Xi ; Xj ) for all pairs Xi ; Xj .
2. Compute all weights I (Xi ; Xj ) and order aording to magnitude.
3. Take the branh of largest weight not already in the tree and add it to the tree unless
it forms a loop, in whih ase disard it.
4. Repeat step 3 until a spanning tree is formed, that is there are n 1 branhes in the
tree.
78
5.4.
THE INFORMATION THEORETIC APPROACH
This result was rst proved when the probability distribution P is known. However,
it is shown in [3℄ that by using the sample frequenies as estimates of P (Xi ; Xj )3 to ompute weights I^(Xi ; Pa(Xi )), the optimal tree distribution is also the maximum likelihood
distribution and the orresponding onsisteny properties hold. That is, if the data is generated from an underlying distribution with a tree representation, as the size of the data
set inreases the resulting distribution will onverge to the true underlying distribution.
Note that here we are minimising distane over all nodes in the model and not just the
hypothesis nodes, as was the ase in Setion 4.3. It may be the ase that the distribution
dened by the optimal tree is not an adequate approximation to the original distribution
for every purpose. In 2000, Williamson [37℄ looked at extending the ideas of Chow and
Liu to show, in general, that maximising the mutual information weight gives the best
approximation of a probability distribution.
5.4.2
More general networks
Another onstraint and an obvious extension to Chow and Liu's work is to speify that
the graph is singly onneted, that is, a polytree.
In [30℄, Pearl presents a method for onstruting the optimal polytree from data, given
that the underlying probability distribution has a polytree representation. This proedure
has several short-falls:
1. It is only valid if we know the underlying distribution an be represented as a polytree.
2. When forming the network from data the proedure is not onsistent in the sense of
the Chow and Liu algorithm.
3. It returns only a partially direted network.
Although the rst two points are harateristis of the method, the third point is
due to the ourrene of equivalene strutures, as disussed in Setion 4.2. Note that
the Chow and Liu algorithm returns a tree with undireted links. This is beause, as
pointed out previously, the relations X ! Y ! Z , X Y
Z and X Y ! Z are
indistinguishable with regard to independene struture. In a tree there are no onverging
onnetions, at whih we would be able to determine the diretion of the links. As this
latter form of onnetion does exist in polytrees, there are some links able to be direted
but not all. In general, if a user desires a direted network, the remaining arrows must be
assigned manually, for example to represent notions of ausation.
3
The sample frequeny estimate of P (xi ; xj ) is given by the proportion of observations in whih Xi = xi
and Xj = xj .
79
5.4.
THE INFORMATION THEORETIC APPROACH
Williamson [37℄ onsiders a more general approah, where the onstraint on struture
is now generalised to speifying that no node has more than k parents. He onsiders
searhing for the spanning network of maximum weight, where the weights are dened as
previously. The justiation that a maximum weight network will be a good approximation
to the given probability distribution follows from the derivations of Chow and Liu.
To obtain a measure of the distane between the underlying distribution and the approximate distribution P 0 dened by the network, we apply the results from (5.4) and (5.6)
in (5.1) to obtain
distK (P; P 0 ) =
H (X )
H (X )
=
X
x
X
x
P (x)
n
X
i
=1
log P 0 (xi jpa(Xi ))
n
X
P (xi ; pa(Xi ))
P (x) log
P (xi )P (pa(Xi ))
x
i=1
X
P (x)
n
X
i
=1
log P (xi ):
(5.7)
Applying the result
X
x
P (x)
n
X
i
=1
log P (xi ) =
"
n
X
X
i
=
=1 " x
n
X
X
i
=
=1
xi
n X
X
i
=1
xi
#
P (x) log P (xi )
log P (xi )
X
x
#
P (x)I(fXi g[x℄ = xi )
P (xi ) log P (xi )
in (5.7), then we get
distK (P; P 0 ) =
H (X)
n X
X
i
=1
xi
n
X
P (xi ; pa(Xi ))
P (xi ; pa(Xi )) log
P (xi )P (pa(Xi ))
i=1 xi ;pa(Xi )
X
P (xi ) log P (xi )
n
X
n
X
H (Xi ):
i=1
=1
As the rst and third terms are again independent of the hoie of network, we see that
P
in general the distribution will be minimised when I (Xi ; Pa(Xi )) is maximised, that
is when the alloation of parents is suh that eah node reeives maximum information
from its parents. Note that the weight here is attahed to nodes rather than to ars and
the weights hange with the alloation of parents. The proedure Williamson adopts is to
=
H (X)
I (Xi ; Pa(Xi )) +
i
80
5.5.
THE BAYESIAN APPROACH
attah a weight to eah ar from parent Vl to hild Xi of the form I (Xi ; Vl jV (l) ), where
V (l) = Pa(Xi )nfVl g: This is then used as the basis for a greedy searh method whih at
eah stage adds the ar of maximum weight whih an be inluded without violating the
onstraints or adding yles. This proedure stops when no more ars an be added. As
the weights hange at eah stage, the nal network will not neessarily be the maximum
weight network, though the examples given in [37℄ show the algorithm often returns lose
to optimal results.
5.5 The Bayesian Approah
When learning a Bayesian Network it is obviously desirable to base our hoie of network
on as muh information as possible. We saw in Setion 2.3 how networks an be formed
on expert or ausal knowledge alone, and in the previous setion we showed how an
information theoreti approah an be used when the only information utilised is that
from the data. In Setion 4.3 prior knowledge was used when enforing independene
relations whih are suspeted to be true, in order to onstrain the searh spae. In general,
if we possess prior knowledge and data it is advantageous to use both. Bayesian methods
provide a solid framework for ombining prior knowledge with data, and a well developed
approah to model seletion.
Initially, in Setions 5.5.2 and 5.5.3, we assume our data is omplete. In Setion 5.5.2
we onsider the `simplest' ase - learning the parameters of a xed struture, and then
in Setion 5.5.3 we disuss methods for learning both the struture and the parameters.
Throughout these setions it will be assumed that all neessary prior distributions have
been speied, and in Setion 5.5.4 we look at how this an be ahieved. Speially
we show that under several assumptions all neessary prior distributions an be obtained
from the speiation of a prior network and one other assessment. Setion 5.5.5 disusses
methods for dealing with inomplete data. As this ase an be omputationally demanding,
we introdue a large sample Gaussian approximation, and by a further approximation the
more eÆient Bayesian Information Criterion for model seletion.
5.5.1
Notation
Our objetive is to learn the struture S and orresponding parameters of a Bayesian
Network, given that we have observed a random sample of N ases (observations) whih
are stored in a data base D. We assume there are n variables, X1 ; : : : ; Xn , with eah
variable Xi having ri states. We use Nijk to denote the number of observations in D for
whih Xi is in state k and its parents are in onguration j .
For a struture S the parameters are denoted S 2 S , where S = (1 ; : : : ; n ) and i
81
5.5.
THE BAYESIAN APPROACH
ontains the probabilities assoiated with node Xi . In general, we require the probability
that Xi is in state k given that its parents are in onguration j , for all i; j and k. This
is denoted by
ijk = P (xki jpa(xi )j );
and we will also refer to the vetor
ij = P (xki jpa(xi )j ); k = 1; : : : ; ri ;
and matries
i = P (xki jpa(xi )j ); k = 1; : : : ; ri ; j = 1; : : : ; qi ;
where qi is the number of ongurations of the parents of Xi .
5.5.2
Known Struture
Here it is assumed that the struture of the network is xed. We enode our prior unertainty about S in the expression P (S jS ) for some hypothesised struture S , and must
then ompute the posterior probability P (S jD; S ). The maximum a posteriori approah
speies that we selet the S whih maximises this expression.
Consider rst the observations from a single variable Xi . If we assume that the proess
generating the data is onstant over time, then for any given observation the probability
P (X = k). Further, the
that Xi will be in state k an be given by a salar fi=k =
i
parameters fi = (fi=1 ; fi=2 ; : : : ; fi=ri ) render the observations independent. A sequene
whih satises these onditions is known as an (ri 1)-dimensional multinomial sample
with parameters fi [20℄.
We now have, from equation (5.4)
P (fi jD) = :P (Djfi )P (fi )
= :P (fi )
ri
Y
fiN=ikk ;
(5.1)
=1
where Nik is the number of observations for whih Xi was in state k, and is a normalising
onstant.
If we assume the prior probability of the parameters follows a Dirihlet distribution4, then
by denition
Pri
ri
0 Y
0
k =1 Nik
P (fi ) = Qri
(5.2)
fiN=ikk 1 ; Ni0k > 0
0
k =1 (Nik ) k =1
k
This assumption is justied as, under the assumptions of parameter independene and likelihood
equivalene introdued later in this setion a Dirihlet distribution on the network parameters is inevitable.
See [20℄ or [14℄ for a derivation.
4
82
5.5.
THE BAYESIAN APPROACH
where () is the gamma funtion. The Ni0k are user dened parameters, the evaluation
of whih is disussed briey in Setion 5.5.4, and in more detail in [20℄.
Hene, from (5.1) we have that the posterior distribution after having observed a
multinomial sample, for a single variable with a Dirihlet prior distribution is given by
Pri
ri
ri
0 Y
Y
Ni0k 1
k =1 Nik
Q
P (fi jD) = : ri
fi=k
fiN=ikk
0 )
(
N
k =1
ik k =1
k =1
i
Y
0
= 0 : fiN=ikk +Nik 1 :
r
=1
Note that, given a Dirihlet prior distribution for fi the posterior distribution for fi also
has a Dirihlet distribution, with parameters Nik + Ni0k ; k = 1; : : : ; ri . We say that the
Dirihlet distribution is onjugate under multinomial sampling. As the posterior distribution is of the same funtional form as the prior distribution, this greatly simplies the
mathematis of omputing the posterior distribution, and in general onjugate distributions are useful for sequential updating.
We now introdue two more assumptions whih we assume to hold for the remainder
of this setion. The rst assumption is that of global parameter independene. This says
that the parameters assoiated with a variable in a Bayesian Network are independent of
the parameters assoiated with any other variable. Hene we have
k
P ( jS ) =
S
n
Y
i
=1
P (i jS ):
The seond assumption is that of loal parameter independene. This says that the parameters assoiated with a variable are independent for eah onguration of the parents.
That is,
qi
Y
P (ij jS ); i = 1; : : : ; n:
=1
The validity of this assumption is more questionable than that of the rst, though has
shown to be reasonable for many problems [20℄.
The above two assumptions are together referred to as parameter independene. Under
these assumptions
P (i jS ) =
j
qi
n Y
Y
P (ij jS )
=1 j =1
and Bayes' Rule, along with the assumptions of a multinomial sample and Dirihlet prior
P ( jS ) =
S
i
83
5.5.
THE BAYESIAN APPROACH
distribution gives
P ( jD; S ) = :P (Dj ; S )
S
S
qi
n Y
Y
i
= :
= :
P (ij jS )
=1 j =19
8
qi ri
n Y
<Y
Y
:
i
=1 j =1 k=1
qi
n Y
Y
i
= 0 :
=1 j =1
P (ij jS )
qi ri
n Y
Y
Y
i
Nijk
ijk
=1 j =1 k=1
qi
n Y
=Y
;
ri
Y
k
=1
i
=1 j =1
Nijk
ijk
0 +Nijk
Nijk
ijk
P (ij jS )
1
;
(5.3)
where the last line follows from the generalisation of (5.2). Given that we have speied
0 , we an use (5.3)
a prior distribution on the parameters and so have values for the Nijk
in onjuntion with a searh method to nd the value for S with the largest relative
posterior probabiltiy.
5.5.3
Unknown Struture
We ould use the above method for seleting the parameters of a Bayesian Network with
xed struture when the struture is unknown, by applying this method to every possible
struture and seleting the struture S and parameters S whih result in the largest value
of P (S jD; S ). However, this approah does not utilise prior knowledge and beause of the
large number of possible strutures is not omputationally feasible. Prior knowledge about
struture may be used to onstrain the searh spae. Alternatively, taking the Bayesian
approah, we speify a prior distribution on the struture and again look for some measure
of posterior probability P (S jD) that we an use as a soring funtion.
In taking this approah we begin by an appliation of the hain rule,
P (DjS ) =
N
Y
P (xl jDl ; S );
(5.4)
=1
where xl is the lth ase in the data base and Dl denotes the rst l 1 ases. P (xl jDl ; S ) is
hene the probability distribution of the lth ase given those observed so far, assumming
the struture S . Conditioning on S we have
l
P (xl jDl ; S ) =
Z
P (xl jDl ; S; S )P (S jDl ; S )dS :
Given that we know S , the probability distribution of xl does not depend on Dl . Hene
84
5.5.
THE BAYESIAN APPROACH
the multinomial assumption gives us
P (xl jDl ; S; ) =
S
n
Y
i
=
=1
qi ri
n Y
Y
Y
i
where I = I(i; j; k; l) =
We then have
P (xki jpa(Xi )j )
=1 j =1 k=1
I ;
ijk
1 if xi = k and pa(Xi ) = j in xl
0 otherwise.
P (xl jDl ; S ) =
=
=
8
Z <Y
qi ri
n Y
Y
:
i
=1 j =1 k=1
qi Z
n Y
Y
I
ijk
9
qi
n Y
=Y
;
i
=1 j =1
P (ij jDl ; S )
i
=1 j =1
i
=1 j =1 k=1
qi ri
n Y
Y
Y
ri
Y
=1
E (ijk jDl ; S )I ;
P (ij jDl ; S )dij
I dij
ijk
k
(5.5)
where E (ijk jDl ; S ) is the expeted value of ijk with respet to P (ij jDl ; S ). Hene, on
substitution of (5.5) into (5.4) we have
P (DjS ) =
qi ri N
n Y
Y
YY
i
=1 j =1 k=1 l=1
E (ijk jDl ; S )I :
(5.6)
E (ijk jDl ; S ) is equal to the probability that Xi will be in state k, given that its parents
are in onguration j at the next observation xl . For any j this is, intuitively, given by
the ratio of the number of eetive observations in whih Xi was in state k and its parents
in onguration j , to the total number of eetive observations in whih the parents of Xi
were in onguration j . That is,
!
0 + Nijk
Nijk
E (ijk jD; S ) =
;
Nij0 + Nij
P
(5.7)
where Nij = rki=1 Nijk . We use the term `eetive observations' to indiate that this ratio
0 , otherwise known as the eetive sample
depends on the hoie of the parameters Nijk
0 indiate
size. From (5.3) it an be seen that the parameters of the prior distribution Nijk
that our prior knowledge is equivalent to having observed the sample N 0 prior to observing
our present sample with ounts Nijk .
As I = 1 only if xi = k and pa(Xi ) = j in xl , eah time I = 1 in (5.6) for some k
in the produt over l, Nijk will inrease by 1. Hene substituting (5.7) into (5.6), using
85
5.5.
THE BAYESIAN APPROACH
the fat that () = ( 1)! for = 1; 2; : : : and simplifying, results in an expression for
P (DjS ). We hene obtain the Bayesian Dirihlet (BD) metri
P (D; S ) = P (S )P (DjS )
= P (S )
ri
0 + Nijk )
(Nijk
(Nij0 ) Y
0 + Nij )
0 ) :
(
N
(
N
ij
ijk
i=1 j =1
k =1
qi
n Y
Y
(5.8)
Given a prior distribution for the struture, we an maximise (5.8) by nding for eah
variable the parent set that maximises the seond produt of this expansion. Hene for
eah variable Xi ; i = 1; : : : ; n; we utilise a searh proedure as disussed in Setion 5.3.1
to nd that set of parents whih maximises this sore. This results in the network with
greatest posterior probability. We an then use (5.3) to assign the network parameters.
5.5.3.1 The BD metri and Network Equivalene
Reall that two Bayesian Network strutures S1 and S2 are independene equivalent if
they enode the same independene assumptions. Independene equivalene is an equivalene relation and indues a set of equivalene lasses over the possible strutures for the
variables in U [19℄. Two strutures are distribution equivalent if every joint probability
distribution enoded by one struture an be enoded by the other and vie versa. In
this ase, if the networks are aausal it does not make sense to dierentiate between the
two strutures, and so hypothesising the struture S1 is equivalent to hypothesising the
struture S2 . This is referred to by Hekerman, Geiger and Chikering [20℄ as hypothesis
equivalene. Given this property we would also expet that equivalent strutures S1 and
S2 satisfy likelihood equivalene, that is P (DjS1 ) = P (DjS2 ) and sore equivalene, that is
P (D; S1 ) = P (D; S2 ). However, for ausal networks we annot validly assume hypothesis
equivalene sine a hypothesised struture inludes the hypothesis that a node's parents
are its diret auses.
The above BD metri does not satisfy the assumption of likelihood equivalene. Hekerman et al. [20℄ derive a metri they all the BDe metri whih does satisfy likelihood
equivalene and whih simplies the onstrution of a prior distribution for the parameters. The form of the BDe metri is idential to (5.8) exept that speiation of the N 0 is
subjet to ertain onstraints. The details are given in [20℄ and will be disussed further
in the following setion.
5.5.4
Prior Distributions
For the disussion in Setions 5.5.2 and 5.5.3 we assumed that we had speied prior
distributions both for the parameters of the network, P (S ), and the network strutures,
86
5.5.
THE BAYESIAN APPROACH
P (S ). In this setion we show how we an derive all prior distributions from the formation
of a prior network and an assessment of the equivalent sample size N 0 .
5.5.4.1 The Prior Network
The assumptions of parameter independene and likelihood equivalene onstrain the parameters of a omplete network struture to have a Dirihlet distribution [20℄, where the
parameters of this distribution must satisfy
0 = N 0 P (xk ; pa(xi )j jS );
Nijk
i
(5.9)
where S denotes a omplete network struture and N 0 is the equivalent sample size. In
0 for our prior distribution, we should use as muh of our prior knowledge
speifying the Nijk
as possible. This prior knowledge an be represented ompatly in what is alled a prior
network. A prior network is a Bayesian Network whih the user reates for the domain
based on their knowledge. This is done essentially as in Setion 2.3 when it was disussed
how Bayesian Networks ould be formed based on expert knowledge alone. The dierene
is that, in this setion, we then ombine data with the prior knowledge enoded in the
prior network to obtain the posterior network. In our prior network we speify both the
struture, whih indiates our beliefs in the independene relations between the variables,
and the parameters.
However, to speify our prior distribution P (S ) we are required, as in (5.9), to ondition on a omplete network. Given any struture S , we an form a omplete network
struture S from S , whih enodes the same assumptions of independene, by adding in
dummy links to omplete the network. Dummy links hange the struture of the network
but not the underlying distribution. To see how this is done, onsider the prior network
(whih is not omplete) on binary variables X1 ; X2 and X3 shown in Figure 5.2 a). The
X2
a)
X3
X2
X1
b)
X3
X1
Figure 5.2: Diagram illustrating the addition of dummy links to the network in a) to form
the omplete network b).
onditional probabilities are speied as
87
5.5.
THE BAYESIAN APPROACH
0 1
P (X3 ) = (0:5; 0:5); P (X2 ) = (0:4; 0:6) and P (X1 jX2 ) = 0 0:3 0:7 :
1 0:2 0:8
To make this into a omplete network we need to add 2 dummy links. For example we
ould add the links X3 ! X1 and X2 ! X3 as in Figure 5.2 b). The only onstraint on
this stage of the proess is that we ensure the network remains ayli. The orresponding
onditional probabilities are then
0 1
0 1
0 0 0:3 0:7
P (X1 jX2 ; X3 ) = 0 1 0:3 0:7 P (X3 jX2 ) = 0 0:5 0:5 P (X2 ) = (0:4; 0:6):
1 0 0:2 0:8
1 0:5 0:5
1 1 0:2 0:8
Hene the variable X3 remains independent of X1 and X2 . Note that a omplete network
formed by the addition of dummy links is not a Probabilisti Bayesian Network as it is
not a minimal I-map.
Given that we have a omplete network orresponding to our prior network, by using
standard Bayesian Network inferene we an ompute the probabilities required in (5.9).
0 and
The hoie of a suitable value for N 0 then allows us to ompute the parameters Nijk
hene our prior distribution P (S) for the parameters of our `ompleted' prior network.
5.5.4.2 Prior distributions for the network parameters
In addition to assuming parameter independene and likelihood equivalene we will make
two additional assumptions whih greatly simplify the speiation of a prior distribution.
The rst of these, likelihood modularity, says that given any struture S ,
P (xi jpa(xi ); i ; S ) = P (xi jpa(xi ); i )
for all Xi . That is, the probabilities at Xi depend only on its parent set, and not on the
remaining struture of the network.
The seond assumption, prior modularity, says that given any two strutures S1 and
S2 suh that Xi has the same parents in S1 and S2 ,
P (i jS1 ) = P (i jS2 ):
These two assumptions formalise the notion that the likelihood of Xi and the parameters
at Xi depend only on the struture loal to Xi . That is, if Xi has the same parents in two
dierent network strutures, these values will be the same.
The issue of struture equivalene now arises. If two strutures are distribution equivalent the two hypotheses S1 and S2 should satisfy prior equivalene, that is the prior
probabilities of equivalent strutures should be equal. Instead of pretending suh ases do
88
5.5.
THE BAYESIAN APPROACH
not exist and assoiating a prior probability with every network struture, we assoiate
eah hypothesis with an equivalene lass of strutures. Therefore we are atually learning
an equivalene lass of strutures, and not every struture is in the hypothesis spae.
As a omplete struture represents no assertions of onditional independene, all omplete strutures are independene equivalent. Given likelihood equivalene we an ompute
P (DjS; S ) and P (SjS ) for any omplete struture S from the likelihood and prior
distribution for another omplete struture. We do this by performing a hange of variables from those in the joint likelihood speied by one network to the variables in the
required joint likelihood.
X
Y
(a)
X
Y
X
(b)
Y
(c)
Figure 5.3: All strutures on two variables.
For example, onsider the network strutures over two binary variables X and Y , as
shown in Figure 5.3. There are three possibilities: X Y; X ! Y or X Y , the nal two
of whih are equivalent. The density funtion for the joint variables xy ; xy and xy, where
xy = P (X = x; Y = y), is given by P1 (xy ; xy ; xy). Suppose that we want to obtain the
parameters for the struture X ! Y , P2 (x ; yjx ; yjx ). The inverse transformation from
P2 to P1 is given by the relations
xy = x yjx; xy = (1 x )yjx and xy = x yjx;
and the Jaobian of the transformation is
xy =x xy =x xy=x yjx
J = xy =yjx xy =yjx xy=yjx = yjx
xy =yjx xy =yjx xy=yjx 1 yjx
= x (1 x)
> 0:
x
0
0 1 x
x
0
Hene we an obtain the required values
P2 (x ; yjx ; yjx ) = P1 (x yjx; (1 x )yjx ; x (1 yjx)):x (1 x ):
In general we an use the relation
x1 ;:::;xn =
n
Y
i
=1
xi jx1 ;:::;xi 1 ;
89
5.5.
THE BAYESIAN APPROACH
and if S is any omplete struture over the variables in the domain the Jaobian for the
transformation from the joint likelihood of this domain U to S is given by
n
Qn
Y1 Y
(xi jx1;:::;xi 1 ) j=i+1 rj 1 ;
J=
i=1 x1 ;:::;xi
where rj is the number of states of Xj . A derivation of this result is given in the appendix
of [20℄.
Given that we an now ompute the prior distribution for any omplete struture, under
the assumptions of parameter modularity and likelihood independene we an onstrut
the prior distribution P (S jS ) for any struture S . To do this reall that we an express
n
Y
P (i jS );
=1
by the assumption of global independene. To determine the terms P (i jS ) in this expression we rst nd a omplete network struture Si suh that Xi has the same parents in
both S and Si . We then use the proedure desribed above to obtain P (iSjSi ) from the
parameters of our ompleted prior network P (SjS ). We an then use global independene to obtain P (i jSi ), whih by the modularity assumptions is equal to P (i jS ).
Hene, under the above assumptions, given that we have speied a prior network we
an ompute the prior parameters for any other struture.
P (S jS ) =
i
5.5.4.3 Prior distributions for network struture
The most simple prior distribution is often the uninformative uniform prior distribution
whih assigns equal probability to every struture. However, as every struture is then
onsidered equally likely, this makes no use of any prior knowledge we may have about
the struture. This method an be rened through use of prior knowledge to enfore
some onstraints on the struture or node ordering. Those strutures to be disallowed
are assigned a prior probability of zero and the remaining probability is then distributed
uniformly over those strutures whih are allowable.
Another approah (attributed to Buntine, [1℄) is to assign an ordering to the variables
and a probability assessment of the presene or absene of eah of the n2 links, onsidered
to be independent. In this way the prior probability of any struture under that ordering
an be obtained.
A similar method an be used when a prior network has been dened. Hekerman
et al. [20℄ propose a method whih penalises a network aording to how muh it diers
from the prior network, the struture of whih is onsidered to be most likely. Here the
dierene between some struture S and that of the prior network S p is measured in terms
90
5.5.
THE BAYESIAN APPROACH
of the number of ars by whih they dier, denoted by Æ, and S is penalised by a onstant
fator for eah suh ar. That is,
P (S ) = Æ ;
where 0 < 1 is the user dened penalty and the normalisation onstant. Note that
in this ase a network whih is equivalent to S p will not have the same prior probability
and in general it an be seen that this speiation does not satisfy prior equivalene.
Hene this method should not be used for aausal networks.
5.5.5
Inomplete Data
When we do not have a omplete data set, that is some of the xli in D are not reorded, it
is important to asertain why this is so. If the absene of the observation is dependent on
the state of the variable then missing data should be handled dierently to if the absene
is independent of state, for example if the variable is hidden. An example of the former
ase is non-response in a survey where subjets may hoose not to respond to a question
dealing with drug use if they are heavy users, for fear of reprimand or some other sensitive
reason. In this setion we assume all missing data is due to hidden variables or is otherwise
independent of state.
Suppose there exists a single inomplete observation in our data base. If we let Y U
denote the observed variables and Z = fUnY g the unobserved variables, then the posterior
distribution an be expressed as
P (ij jy; S ) =
X
z
P (ij jz; y; S )P (zjy; S ):
(5.10)
It turns out, under the Dirihlet assumption, that the posterior distribution (5.10) is a
linear ombination of Dirihlet distributions [18℄. If we observe further inomplete ases
some or all of these Dirihlet distributions will themselves beome linear ombinations of
Dirihlet distributions and so the number of terms in the posterior will inrease exponentially with the number of inomplete observations. Hene, in general, exat inferene is
intratable and approximations need to be made.
5.5.5.1 Gibbs Sampling
When we have missing data the Gibbs sampler is often used to approximate the posterior
distribution P (S jD; S ) by repeatedly sampling values for the missing data to form a
omplete data base. To arry out this proedure eah missing observation in D is randomly
assigned a value. We then iterate through the unobserved ases and reassign the state of
91
5.5.
THE BAYESIAN APPROACH
eah by sampling from the probability distribution
P (x0 ; D nx jS )
P (x0li jD nxli ; S ) = P li 0 li
;
x0li P (xli ; D nxli jS )
where Xi was unobserved in ase l and D nxli is the urrent ompleted data base exluding
xli . Eah time we have reassigned all missing values to obtain a new D , we ompute
P (S jD ; S ) by the methods presented in Setion 5.5.2. This proedure is then iterated G
times and the approximation is taken to be the average,
G
1X
S
^
P ( jD; S ) =
P (S jDg ; S );
G g=1
where Dg is the ompleted data base from the gth iteration. In the limit as G tends to
innity, this estimate will onverge to the expeted value of P (S jD; S ), though in pratie
onvergene an be quite slow.
When the struture is unknown, Gibbs Sampling an be used to approximate P (DjS )
by the expression
P (D; S jS )
P (S jD; S )
P (S jS )P (DjS ; S )
:
=
P (S jD; S )
P (DjS ) =
Given that a prior distribution has been speied for the parameters, we an ompute the
numerator using inferene on the Bayesian Network dened by S and S , and an ompute
the denomimator using Gibbs sampling as above.
At eah of the G iterations, it is neessary to form and then sample from a probability
distribution for eah missing value, and then ompute P (S jD ; S ). From (5.3) we see
that this last step requires that a prior distribution P (S ) be speied. Additionally, to
ompute P (DjS ) we are required to use inferene in a Bayesian Network, whih is NP-hard
in the number of nodes.
In the next setion we derive a large-sample Gaussian approximation for the distribution of the parameters, and then go on to show how this an simplify the alulations
when we have large amounts of data.
5.5.5.2 Gaussian Approximation
Here we show how one an approximate the distribution P (S jD; S ) by a multivariate
Gaussian distribution. A d-dimensional multivariate Gaussian distribution with mean
vetor of dimension d 1 and variane matrix of dimension d d is denoted Nd (; )
92
5.5.
THE BAYESIAN APPROACH
and has a probability density funtion of the form
P (S ) =
1
1
exp (S )T 1 (S
2
1
=
2
2
2 jj
d=
):
P
As we are approximating S , the dimension d is given by ni=1 qi (ri 1), the sum over
all nodes of the number of entries to be learnt for the onditional probability tables. We
rst assume that the struture S is xed. From (5.4), the maximum a posteriori (MAP)
onguration for P (S jD; S ) is that whih maximises P (DjS ; S )P (S jS ), or equivalently
log(P (DjS ; S )P (S jS )):
g(S ) =
If we let ~S be the MAP estimate, then a Taylor series approximation to g(S ) about ~S ,
trunated after two terms, is
1
g(S ) ' g(~S ) + (S
2
~S )H (S
~S )T ;
where H is the Hessian of g(S ) evaluated at ~S . Thus, with H 0 = H ,
expfg(S )g ' expfg(~S )g expf
1 S
(
2
~S )H 0 (S
~S )T g:
Hene
P (DjS ; S )P (S jS )
' P (Dj~S ; S )P (~S jS ) expf 21 (S ~S )H 0(S ~S )T g
= : expf
1 S
(
2
~S )H 0 (S
~S )T g;
(5.11)
so that we an approximate P (S jD; S ) by the Gaussian distribution with mean ~S and
variane matrix (H 0 ) 1 .
Had we hosen to take the Taylor series expansion about some value other than ~,
then this would be the mean of our Normal approximation. The MAP estimate ~ was
hosen as it is an intuitively reasonable estimator.
To use this approximation we are required to evaluate ~S and also H 0 , whih requires
onsiderable omputation. The Expetation-Maximisation algorithm, introdued later in
this setion, an be used to evaluate ~S . Given the omputations involved it seems there is
little benet in this approah over the Gibbs sampler. However, omputation of the MAP
estimate ~S is more eÆient than the Gibbs sampler when our data base is very large. It
also allows for the development of an eÆient approximation to the distribution P (DjS )
when the struture is unknown, whih we derive in the following setion.
93
5.5.
THE BAYESIAN APPROACH
5.5.5.3 Laplae's Approximation
In Setion 5.5.5.2 we assumed the struture was xed. In the ase of unknown struture
we wish to approximate
P (DjS ) =
=
Z
Z
P (D; S jS ) dS
P (DjS ; S )P (S jS ) dS :
From (5.11) we have that P (DjS ; S )P (S jS )
(5.12) yields
P (DjS )
Z
Nd (~S ; (H 0 ) 1 ). Substituting this into
1 S ~S 0 S
( )H (
2
Z
1
S
S
~
~
= P (Dj ; S )P ( jS ) expf (S ~S )H 0 (S
2
= P (Dj~S ; S )P (~S jS )(2d=2 )jH 0 j 1=2 ;
'
(5.12)
P (Dj~S ; S )P (~S jS ) expf
~S )T g dS
~S )T g dS
where we have used the fat that j(H 0 ) 1 j = jH 0 j 1 . Hene
1
d
log jH 0 j:
(5.13)
log P (DjS ) ' log P (Dj~S ; S ) + log P (~S jS ) + log(2)
2
2
This approximation, known as Laplae's approximation, is aurate to order (1=N ) and
so an be very aurate for large samples [18℄. Again the most omputationally intensive
stage of this aproah is in alulating H 0 and ~S .
For large samples the prior distribution has a relatively small inuene on the posterior
distribution and so in this ase ~ an be approximated by the maximum likelihood estimate
^. To obtain a more eÆient (and less aurate) approximation to (5.13) we an drop the
seond and third terms to leave only those that inrease with N , and substitute ^ for ~
and d log N for log jH 0 j (as jH 0 j inreases with N d ). We then obtain what is known as the
Bayesian Information Criterion (BIC),
d
log N:
2
This is of the form given in (5.1) of a measure of t plus a omplexity penalty, where the
penalty inreases in proportion to the number of parameters to be estimated. As the BIC
does not depend on a prior distribution we an use this riterion without needing to assess
a prior distribution. However, this is beause we have assumed that our sample is large
enough to render any prior knowledge insigniant. If this is not the ase the BIC should
not be used.
We now show how ~S and ^S an be alulated.
log P (DjS ) ' log P (Dj^S ; S )
94
5.6.
CONFIDENCE MEASURES ON STRUCTURAL FEATURES
5.5.5.4 The Expetation-Maximisation Algorithm
In this ase we wish to estimate maximum likelihood or MAP values when we have missing
data. In general, the Expetation-Maximisation (E-M) algorithm an also be used to
simplify diÆult maximum likelihood problems.
To initialise the algorithm we assign values to the ijk to obtain a onguration for
S . For the Expetation step we then ompute the expeted values of the ounts Nijk for
a omplete data set5 . This is given by
E (Nijk ) =
N
X
P (xki ; pa(Xi )j jyl ; S ; S );
(5.14)
=1
where yl denotes the possibly inomplete lth ase in D. Note that the expetation is with
respet to the joint density for X onditioned on the assigned S and observed data D.
Further, (5.14) an be evaluated using inferene on the Bayesian Network speied by
parameters S and S and with evidene yl .
In the Maximisation step we treat the E (Nijk ) as if they were atual values and nd
the onguration of S that maximises P (S jD ; S ), where D is the ompleted data set.
As in (5.7) this is given by
l
~ijk =
0 + E (Nijk )
Nijk
P
:
Nij0 + rki=1 E (Nijk )
We then iterate these two steps. Under ertain regularity onditions disussed in [26℄, the
E-M algorithm will onverge to a loal maximum.
5.6 Condene Measures on Strutural Features
Most of the previous disussion has involved induing networks with high sores. This
is a global approah in that the soring funtion ompares entire network strutures.
Edge exlusion devianes were also mentioned briey in Setion 5.3.3. This is a more
loal approah where links are onsidered individually and inluded or disarded based on
whether the data gave enough support to the hypothesis that the link exists. To do this,
an asymptoti distribution is used even though in pratie we may not have enough data
for this to be an adequate approximation. The approah we now onsider falls somewhere
between these two methods and is based on appliation of the bootstrap.
As we have a random sample of observations, any inferene made about the underlying
probability distribution and struture has some unertainty assoiated with it due to the
hane these observations are not representative. Beause of this unertainty, statistiians
5
The Nijk are suÆient for the parameters ijk .
95
5.6.
CONFIDENCE MEASURES ON STRUCTURAL FEATURES
tend to assoiate with every estimate a measure of unertainty or a ondene measure.
For example, a 95% ondene interval for a point estimate suh as the expeted value of
a random variable means that we an be 95% ertain that the true value lies within this
interval but, due to sampling variation (the variation between samples of size N drawn
from the distribution) there is a small hane that the true value is outside this interval.
The sampling distribution is the probability distribution of the statisti of interest
when alulated from a (random) sample of size N . In theory we ould estimate the
sampling distribution by repeatedly drawing samples of size N from the population, and
alulating the value of the statisti for eah sample. However, given that in pratie
we have aess to only the N observations in our original sample, we an estimate the
sampling distribution by repeatedly drawing samples of size N , with replaement, from
this sample. This is known as bootstrapping.
The bootstrap uses the empirial distribution funtion (that represented by the data)
as an approximation to the atual distribution funtion and then, by resampling from this
distribution, hopes to estimate the atual sampling distribution. Hene, given a data set
D, we an alulate ondene measures for any harateristi, or feature, of that sample.
In the ontext of Bayesian Networks the features to be estimated may be the existene
of an edge between two variables or the Markov Blanket6 of a given node.
In their paper `Data Analysis with Bayesian Networks: A Bootstrap Approah' [13℄,
Friedman, Goldszmidt and Wyner propose a method for obtaining ondene measures
on suh features.
As the features we are interested in are harateristis of the network learnt from our
data, eah time we resample a learning algorithm must be run so that we an observe the
indued network. Whatever the feature f of interest, we an assign f the value 1 if that
feature is present in the indued struture and the value 0 otherwise. If we take many
samples of size N and on eah sample run the same learning algorithm, due to sampling
variation we will indue diering strutures. The value of interest is the probability that
a feature f exists in the network indued from a sample of size N .
Using the bootstrap approah we an obtain an estimate of this value by the following
algorithm:
For i = 1;2; : : : ; m
Draw, with replaement, a sample of size N from the data D.
Denote this sample by Di .
Indue a network struture S^i by applying some learning algorithm to Di .
6
Reall that the Markov Blanket of a node Xi is the set of nodes whih ontains the parents of Xi , the
hildren of Xi and all parents of the hildren of Xi .
96
5.6.
CONFIDENCE MEASURES ON STRUCTURAL FEATURES
For every feature of interest f , dene
P
^
PN (f ) = m1 m
i=1 f (Si );
where
(
1 if f appears in S^i
f (S^i ) =
0 otherwise.
Friedman et al. tested this approah by generating data from known networks. They
were interested primarily in three types of feature: existene of links, members of a node's
Markov blanket and partial ordering of variables, that is whether one node is an anestor
of another. The indution proedure used the BDe metri with a uniform prior distribution and equivalent sample size of 0.5. The searh proedure was a greedy hilllimbing
algorithm with random restarts. The authors were able to draw several onlusions,
1. The bootstrap samples are `autious.' The number of false positives is generally
small ompared to the number of true positives and false negatives. Thus if the
ondene in a feature is high it is likely to exist in the underlying network.
2. Establishing the Markov blanket of a node and partial orderings are more robust
than features suh as the existene of an edge.
As well as simply providing ondene measures on indued features it is suggested that
suh an approah ould be used to onstrain the searh spae over network strutures.
By re-sampling to obtain ondene measures those features with high ondene an
be retained and other features assummed not to exist to hene narrow the searh. For
example, if the ondene that X is an anestor of Y has ondene greater than some
threshold (the authors used 0.8), the indued network must respet this order.
The results show that for small training sets slightly better networks an be found in
this way. However, as the main omputational aspet of this algorithm is the network
indution at eah iteration and the improvement in sore using this approah is very
small, it may be more eÆient omputationally to simply alloate more time to the searh
proess. This remains to be shown.
Another interesting appliation of this approah may be in deteting hidden variables,
that is variables whih are present in the underlying network but not inluded in the
indued network. If we an nd a group of variables whih with high ondene are in
eah other's Markov blanket, but the edge relationships are unlear, this may be indiative
of a hidden variable.
97
Chapter 6
Inuene Diagrams for Deision
Making
6.1 Deision Senarios
In previous hapters we have looked at how to enter evidene and update the marginal
distributions at the hypothesis nodes, prior to making a deision. In this hapter we look
at deision making more formally. Namely, given a deision senario, we wish to determine
the optimal deision.
If we hoose to take some ation, this amounts to xing the state of a variable. Note
that this is fundamentally dierent to simply observing the state of a variable as in previous
hapters. When we observe that a variable is in some state, it is in that state beause
of the inuene of other variables in the network. When we make the deision to x the
state of a variable, the state of this variable is then independent of the states of the other
variables.
However, when we make a deision, that is set the state of some variable, then suh a
hoie normally alters the probability distribution of another set of variables. This is the
onsequene of that deision. The proess of attahing a numerial value to a onsequene
or outome is known as utility theory, and the atual set of values attahed to the possible
outomes is alled a utility measure. We may attah a utility measure to one or more
variables to represent the desirability of the outome at those variables. This is done by
direting a link from eah node with an outome of interest to a utility node U say. We
then attah a utility to U for eah onguration of the parents of U . The priniple of
Maximum Expeted Utility is to hoose as the optimal deision that whih maximises the
expeted utility.
Consider the one deision ase. There is one deision node D say, represented in
Figure 6.1 by a square node. This has states orresponding to the possible deisions. The
98
6.1.
DECISION SCENARIOS
utility nodes U1 ; U2 and U3 are represented in the gure by diamond shaped nodes. The
round nodes are, as before, intermediate random variables, whih in this ontext we refer
to as hane nodes.
X3
X1
U2
X4
D
X2
X6
X5
U3
U1
Figure 6.1: A Bayesian Network with a deision node D and utility nodes U1 ; U2 and U3 .
The expeted utility at U2 ,
EU (U2 ) =
XX
x3
x4
U2 (x3 ; x4 )P (x3 ; x4 jD; e);
where e represents any other information we have available. Similarly,
EU (U3 ) =
and
XXX
x4
x5
x6
U3 (x4 ; x5 ; x6 )P (x4 ; x5 ; x6 jD; e)
EU (U1 ) = U1 (D):
Utilities an be used to represent all osts and rewards. For example U1 may represent
the ost of implementation of the deision and U2 and U3 represent the desirability of the
outomes.
In general, given that we have k utility nodes, the expeted utility of a deision D an
be alulated as
EU (Dje) =
X
pa(U1 )
U1 (pa(U1 ))P (pa(U1 )jD; e) + : : : +
X
pa(Uk )
Uk (pa(Uk ))P (pa(Uk )jD; e);
where again pa(Ui ) represents a onguration of the variables in the set of parents of Ui ,
Pa(Ui ).
We then hoose that deision d whih results in the maximum expeted utility, that
is
argmax
d = d EU (D = dje):
99
6.1.
DECISION SCENARIOS
In general there an be many deision nodes, often with a temporal ordering whih
speies that some deisions must be taken before others.
For example, onsider the position of a young man who is at the tiket oÆe to buy
tikets for himself and his new girlfriend to attend a movie on their rst date. He has
forgotten his student ard whih would have entitled him to a 30% disount. At the tiket
booth his rst deision is whih of two queues to join, and hene by whih teller he will
be served. The teller may or may not ask for his student ID. He must then deide to buy
tikets for either the ation or omedy movie. He plaes a value on how muh his girlfriend
enjoys the movie, but not quite as muh value as he plaes on the ost for her enjoyment.
This deision senario is represented in Figure 6.2. Here D1 represents the deision of
E
U2
E
U2
E
U2
comedy
E
U2
action
E
U2
comedy
E
U2
action
E
U2
comedy
E
U2
action
U1
D2
yes
comedy
ID
action
no
queue 1
D2
D1
D2
queue 2
yes
ID
no
D2
U1
Figure 6.2: A network representing the two deisions whih must be made in order to buy
a movie tiket.
whih queue to join. The ar labels represent the deision made - one for eah possible
deision. The variable ID is a hane node whih represents whether he was asked for ID.
The utility U1 is dened as follows,
Yes No
ID
.
Utility -100 0
D2 represents the deision ation or omedy, and E is a hane node representing the girl's
enjoyment of the lm. We dene the utilities U2 ,
100
6.1.
Enjoyment
Utility
DECISION SCENARIOS
Low Average High
.
-90
0
90
In addition, we require the onditional probabiltities P (Enjoymentjation),
P (Enjoymentjomedy), P (IDjqueue 1), and P (IDjqueue 2). A Bayesian Network may be
used to determine the probabilities P (E jD2 ) and P (IDjD1 ) and may inorporate other
variables, suh as in Figure 6.3. Then
P (E jD2 ) =
XX
m
a
P (E jD2 ; m; a)P (m; a):
M
D2
E
A
Figure 6.3: Here A represents the variables Attration to boyfriend, M represents the
variable Mood, and E represents the variable Enjoyment.
Given that all utilities and onditional probabilities have been dened, we an then use
probability theory to alulate the expeted utility of taking deision D1 and the expeted
utility of D2 given ID and D1 1 . We hene have the optimal deision sequene (d1 ; d2 ).
There are several points to note from this example,
1. Utilities are arbitrary, but utilities at eah node should be dened on the same sale.
2. Bayesian Networks an be used to determine the onditional probability distributions.
3. Eah deision results in the same sequene of deision-observation senarios, no
matter what deision was made.
A deision senario whih satises point 3 is alled symmetri. A symmetri deision
senario an be represented by a hain of variables, as in Figure 6.4. Note that although ID
has 2 hildren, this is still a hain of variables as utility nodes do not represent variables.
The hain in Figure 6.4 allows us to asertain the order of the deisions, namely that
D1 must be made prior to D2 , and also the observations whih are made between deisions.
In a network, to indiate graphially whih observations have been made prior to a deision
For a desription of how to alulate the optimal deision in a deision tree suh as that in Figure 6.2,
see for example [32℄.
1
101
6.2.
D1
ID
D2
INFLUENCE DIAGRAMS
E
U2
U1
Figure 6.4: A hain of variables representing the deision senario in Figure 6.2.
we an add a link from the observed node to the next deision node. Suh links are alled
Information Links as they indiate the information we have available when making that
deision. Similarly, to indiate the order of deisions we an add links whih are direted
from a deision node to the next deision node. Suh links are known as Preedene Links.
Figure 6.5 represents the deision senario of Figure 6.2 with information links and
preedene links added. As the addition of preedene links means that the network is no
longer a tree, we an inlude the variables Attration and Mood from Figure 6.3 in our
model expliitly. The resulting diagram is alled an Inuene Diagram and ontains all
A
D1
ID
D2
M
E
U2
U1
Figure 6.5: The deision senario of Figure 6.2 represented as a network with added
information and preedene links.
the information in Figures 6.2, 6.3 and 6.4 in a muh more ompat form.
6.2 Inuene Diagrams
An Inuene Diagram is a direted ayli graph over a set of deision nodes, hane nodes
and utility nodes, with the following properties [22℄,
1. There is a direted path enompassing all deision nodes.
2. The utility nodes are leaf nodes.
102
6.3.
SOLUTION OF INFLUENCE DIAGRAMS
3. The deision nodes and hane nodes are disrete, with a ninte number of mutually
exlusive states.
4. Eah hane node Xi is assoiated with a onditional probability table whih speies
P (Xi jPa(Xi )).
5. Eah utility node U is assoiated with a real-valued funtion over pa(U ).
Inuene Diagrams were originally developed as a ompat representation of symmetri
deision senarios. However, they are now onsidered an extension of Bayesian Networks
to deision senarios.
There are several restritions on what is able to be represented by an Inuene Diagram. Firstly, point 1 in the above denition of an Inuene Diagram speies there must
be a direted path between deision nodes and hene there must be some way of ordering
the deisions. This is not always possible and it may be the ase that a better solution an
be found if the deisions are onsidered in an alternate ordering. The seond assumption
impliit in an Inuene Diagram is that of no-forgetting, that is at eah stage the deision
maker has omplete knowledge of what has gone before.
Inuene Diagrams are very losely related to Bayesian Networks and hene we an
use a modied hain rule for Inuene Diagrams. If we denote the set of hane nodes by
XC and the set of deision nodes by XD , then
P (XC = xC jXD = xD ) =
Y
X
2XC
P (X = fX g[xC ℄jPa(X ) = Pa(X )[xC ; xD ℄);
where the parents of X may ontain both hane nodes and deision nodes.
6.3 Solution of Inuene Diagrams
In general, we onsider the task of solving an Inuene Diagram or deision tree to be
equivalent to alulation of the optimal deision sequene. As for Bayesian Networks, there
are many methods for solving Inuene Diagrams. One again these methods make use
of the d-separation relations in the graph, though d-separation in an Inuene Diagram is
slightly dierent to that in Bayesian Networks. When examining the d-separation relations
we ignore the utility nodes and also the links into deision nodes, that is we ignore the
information and preedene links.
The most intuitive way to solve an Inuene Diagram is to `unfold' the graph to a tree
representation. In our movie example this would be the tree shown in Figure 6.2 where
we use the Bayesian Network of Figure 6.3 to alulate the probabilties P (E jD2 ). We an
then use tehniques for solving deision trees (see for instane [32℄) to nd the optimal
103
6.3.
SOLUTION OF INFLUENCE DIAGRAMS
deision sequene. However, for even moderately small networks, or networks in whih the
variables have a large number of states, this is omputationally infeasible, as the number
of branhes inreases exponentially with the number of possibilities at eah node.
The most eÆient method for solving Inuene Diagrams is to use a lustering method
similar to that used for Bayesian Networks and disussed in Setion 3.3.3.1. This tehnique builds what is referred to by Jensen [22℄ as strong juntion trees whih, beause
of the temporal ordering present in Inuene Diagrams, involve what is known as strong
triangulation - where order matters.
The major omplexity issue in solving Inuene Diagrams is that, beause of the noforgetting assumption, the past is often intratably large. There exist several approximate
methods whih an be used to get around this by making use of d-separation, for example
by the use of information bloking or what are alled LIMIDs - Limited Memory Inuene
Diagrams.
Information bloking uses d-separation to simplify the network. As an example onsider a simplied version of the movie tiket senario where the only deision required to be
made is whih genre of movie to buy a tiket for. If the boy has a series of dates, a relevent
Inuene Diagram would be that depited in Figure 6.6, where the girl's enjoyment of the
past movie inuenes her enjoyment of the next movie (for example she may only enjoy
omedy) and the deision of her boyfriend as to whih tiket to buy for the next date. Note
A
M
A
M
E
D2
E
D2
U2
A
M
E
D2
U2
U2
Figure 6.6: Inuene Diagram representing the deision of whih tiket to buy on a sequene of dates.
that suh a diagram, taken over several time slies, is often represented ompatly as in
Figure 6.7, where a double link represents the existene of a link between those variables
from one time slie to the next. To use the information bloking approximation we ignore
the link between enjoyment nodes from one time slie to the next. Then beause we know
whih deision was made at D2 at the previous time step, this node d-separates the two
time slies and they beome independent, signiantly reduing the omputational burden
of solution.
104
6.3.
SOLUTION OF INFLUENCE DIAGRAMS
A
M
E
D2
U2
Figure 6.7: Compat representation of the Inuene Diagram in Figure 6.6.
When we use a LIMID we drop the no-forgetting assumption and represent diretly
what is remembered at eah deision node by the use of information links. Hene if
we want a memory whih is good for N steps say, then we would diret links from the
relevant variables in the past N steps to the deision at present. A LIMID is hene an
extension of the information bloking approah. Although the solution to a LIMID is an
approximation to the solution of the orresponding innite memory inuene diagram, it
is a good ompromise between auray and omputational eeieny.
In general, solving Inuene Diagrams is more omplex than solving Bayesian Networks, but many of the same tehniques an be used. One a model has been formed
there exist software pakages whih an be used to assist in the solution of the Inuene
Diagram.
105
Chapter 7
Conlusions and Remarks
In Chapter 2 we introdued several types of Bayesian Networks, namely Probabilisti
Bayesian Networks, Bayesian Belief Networks and Causal Bayesian Networks. These are
the most general types of Bayesian Network and make lear the distintion between ausal
and non-ausal models. Bayesian Networks an be used to model a wide range of situations,
and so more spei types have been developed. For example Bayesian Networks are
partiulary useful for models involving hidden variables, and in this ontext are often
referred to as Dynami Bayesian Networks [29℄. However, all are simply speialisations of
the more general types introdued here and so the algorithms and general properties hold.
In Chapter 3 we disussed methods for alulating the distribution after new information had been reeived. The method presented in Setions 3.3.1 and 3.3.2 is exat
and omputationally feasible in trees and polytrees of reasonable size, though in pratial
situations, unless we hoose to simplify the network rst using the tehniques disussed in
Setion 4.3 for example, most networks will ontain loops. The join tree method disussed
in Setion 3.3.3 was originally developed by Shafer et al. [35℄. This is still one of the
most eÆient methods available when exat alulation is feasible, though Lauritzen and
Speigelhalter [27℄, and Jensen et al. [24℄ developed a slightly more eÆient method where
the onditional probabilities of the network are hanged dynamially (referred to as the
Hugin method). Most reent researh is foused around nding good approximations by
use of Monte Carlo Methods. Gilks et al. developed Gibbs sampling for Bayesian Networks, see [16℄, whilst the AIS-BN algorithm mentioned in Setion 3.3.3.5 appears to be
the most eÆient for unlikely instantiations of the evidene nodes [2℄.
The value of evidene and the eet of hanges to network struture, disussed in
Chapter 4, was motivated by the issue of sensitivity analysis. That is, when there is
unertainty regarding the struture (or parameters) of a network, we wish to know how
sensitive our onlusions are to hanges in those parameters within a reasonable range.
More speially, if is a set of parameters for a Bayesian Network, we may be interested
106
7. CONCLUSIONS AND REMARKS
in how P (X h je) varies with . This is an important issue as, obviously, we do not want to
pretend onlusions drawn from a model are valid, unless they are robust to small hanges
of that model. It turns out that under a general assumption, one-way sensitivity analysis,
in whih we determine the eet on P (X h je) of varying only one parameter and holding
the remainder onstant, requires less than two propagations through the network and
simple alulations. More sophistiated analyses have also been developed. Jensen [22℄
gives a good overview, and more details an be found in [25℄ and [23℄.
In Setion 4.4.2 we showed that it is best to ontinually update the set of evidene
nodes, based on the eet of evidene observed at the previous time-step. This leads into
the theory of nding an optimal trouble-shooting strategy. In that ase we have some
problem, whose ause is unertain, that we wish to retify. We an perform a series of
ations (whih may x the problem) and tests (to help deide whih ations may be best),
eah of whih has some assoiated ost. The optimal trouble shooting strategy is that
sequene of ations and tests whih results in the minimum expeted ost of repair, that
is the minimum expeted ost of the ations and tests whih need to be performed before
the problem is xed. Inuene Diagrams an be used to nd the optimal sequene, and
details of their appliation to this problem an be found in [22℄.
The Bayesian method for learning Bayesian Networks, disussed in Setion 5.5, allows a
way to inorporate prior knowledge and unertainty about the parameters of the network
into the model. Although other methods for learning networks exist, as eluded to in
Setion 5.3 for example, the Bayesian Method is most popular at present. A good soure
of referenes to the literature on the topi of learning Bayesian Networks is [1℄.
In Chapter 6 we introdued the idea of extending Bayesian Networks to Inuene
Diagrams, when we wish to model a deision senario. Although in order to be omputationally feasible these must satisfy several strit assumptions, many pratial problems
are still able to be modelled. Prominent appliations of Inuene Diagrams are nding
optimal trouble shooting strategies and, in management, to model symmetri deision
senarios. See, for example, [22℄.
Although Bayesian Networks are primarily a numerial model on whih to base inferene, the graphial aspet of Bayesian Networks is often utilised. For example, given data,
a learning algorithm an be run to determine a suitable struture. Using a method suh
as the bootstrap disussed in Setion 5.6, the `strength' of eah of the indued links an be
estimated. This graphial representation allows one to easily determine the relationships
between the variables. For example, the network available online at [38℄ produed from a
study of gene expression data from a miroarray analysis allows one to see immediately
the relationships between the genes - whih pairs are strongly related and whih genes
are independent of the others, for example. Hene a Bayesian Network provides a way of
107
7. CONCLUSIONS AND REMARKS
extrating easily interpretable information about the independene relations between the
variables from very large data bases. The graphial omponent of a Bayesian Network
is also used by Sebastiani et al. in [34℄. In this ase a learning algorithm was run on
a database of survey results. The network displayed the relationships between the variables, and this ould be published without any risk of loss of ondentiality to the survey
partiipants.
Throughout this thesis we onsidered only the ase of disrete variables. Bayesian Networks an also be extended to use with ontinuous variables (Gaussian) and a mixture of
both disrete and Gaussian variables. See for example [28℄ and [29℄. Continuous variables
ompliate information propagation but an simplify learning - in the Gaussian ase there
are only two paramters (the mean and variane) to be learnt for eah onguration of the
parents of eah variable.
In general, we have primarily disussed what a Bayesian Network is, how they an be
formed and the mehanisms underlying the propagation of information through a network.
Rather than fous on a spei appliation, we hose instead to investigate and present
aspets of the large theoretial basis whih underpins the appliation of Bayesian Networks to modelling real-world problems. By having disussed the mathematis behind the
appliation of Bayesian Networks it is hoped that the suitability of a Bayesian Network for
any spei modelling task an be assessed, and the broad range of appliations realised.
By understanding the impliations of the struture on the relationship between the
variables we are better able to understand the onstraints of our partiular model. Furthermore, we are able to understand the soure of the omplexity issues whih may arise
when modelling real problems, how best to deal with them, and the eet of any subsequent simpliations.
Understanding how information is stored by and propagated through a network allows
us to understand the eet of information on the distributions at the hypothesis nodes
and hene analyse the value of evidene to our partiular problem.
After having gained an understanding of the theory of Bayesian Networks in the rst
5 hapters, it was then natural, in Chapter 6, to extend this to Inuene Diagrams.
This illustrates the ability of Bayesian Networks to be extended to model many dierent
senarios. Bayesian Networks form a very broad lass of models and generalise many
others. They are primarily used for sequential updating, lassiation, normal regression,
hidden Markov Models and, through the learning algorithms, as a `data mining' tool to
extrat information about the relationships between variables in a domain. It is hoped
this thesis has allowed the reader to appreiate the exibility of modelling using Bayesian
Networks, and furthermore that the approah taken has led to a better understanding of
the issues involved in probabilisti reasoning under unertainty.
108
Bibliography
[1℄ Buntine, W. (1996) A guide to the Literature on Learning Probabilisti Networks
from Data, IEEE Trans. on Knowledge and Data Engineering, vol 8, no. 2, 195210.
[2℄ Cheng, J. and Druzdel, M. (2000) AIS-BN: An Adaptive Importane Sampling
Algorithm for Evidential Reasoning in Large Bayesian Networks, Journal of Artiial Intelligene Researh vol. 13, 155-188.
[3℄ Chow, C. K. and Liu, C. N. (1968) Approximating Disrete Probability Distributions with Dependene Trees, IEEE Trans. on Info. Theory vol. IT-14, 462-467.
[4℄ Chow, C. K. and Wagner, T. J. (1973) Consisteny of an Estimate of TreeDependent Probability Distributions, IEEE Trans. on Info. Theory, vol 19, 369371.
[5℄ Clark, J. and Derek, A. H. (1991) A First Look at Graph Theory, World Sienti.
[6℄ Cooper, G. F. (1990) The Computational Complexity of Probabilisti Inferene
using Bayesian Belief Networks, Artiial Intelligene vol. 42, 393-405.
[7℄ Cooper, G. F. and Herskovits, E. (1992) A Bayesian Method for the Indution
of Probabilisti Networks from Data, Mahine Learning vol 9, 309-347.
[8℄ Cover, T.M. and Thomas, J.A. (1991) Elements of Information Theory, John
Wiley and Sons.
[9℄ Dagum, P. (1993) Approximating Probabilisti Inferene in Bayesian Belief Networks is NP-hard, Artiial Intelligene vol 60, 141-153.
[10℄ Das, B. (1999) Representing Unertainties Using Bayesian Networks, DSTO-TR0918.
[11℄ Diestal, R. (1997) Graph Theory, Springer-Verlag, New York.
109
BIBLIOGRAPHY
[12℄ Ewens, W. J. (2001) Statistial Methods in Bioinformatis: an introdution,
Springer, New York.
[13℄ Friedman, N., Goldszmidt, M. and Wyner, A. (1999) Data Analysis with Bayesian
Networks: A Bootstrap Approah, Pro. of the Fifteenth Conferene on Unertainty in Artiial Intelligene.
[14℄ Geiger, D. and Hekerman, D. (1994) A Charaterisation of the Dirihlet Distribution, Mirosoft Researh MSR-TR-94-16.
[15℄ Geiger, D., Verma, T. and Pearl, J. (1990) Identifying Independene in Bayesian
Networks, Networks vol 20, 507-534.
[16℄ Gilks, T. and Spiegelhalter, D. (1994) A language and a program for omplex
Bayesian modelling The Statistiian, vol 43, 169-178.
[17℄ Hastie, T., Tibshirani, R. and Friedman, J. H. (2001) The Elements of Statistial
Learning:Data Mining, Inferene and Predition, Springer.
[18℄ Hekerman, D. (1995) A Tutorial on Learning with Bayesian Networks, Mirosoft
Researh MSR-TR-95-06.
[19℄ Hekerman, D. and Geiger, D. (1995) Likelihoods and Parameter Priors for
Bayesian Networks, Mirosoft Researh MSR-TR-95-54.
[20℄ Hekerman, D., Geiger, D. and Chikering, D. (1995) Learning Bayesian Networks: The ombination of Knowledge and Statistial Data, Mahine Learning,
vol 20, 197-243.
[21℄ Hwang, F. (1992) The Steiner Tree Problem, Annals of Disrete Maths vol. 53.
[22℄ Jensen, F. (2001) Bayesian Networks and Deision Graphs, Springer-Verlag, New
York.
[23℄ Jensen, F., Aldenryd, S. and Jensen, K. (1995) Sensitivity Analysis in Bayesian
Networks, Leture notes in Artiial Intelligene, vol. 946, 243-250, Springer.
[24℄ Jensen, F., Lauritzen, S. and Olesen, K. (1990) Bayesian Updating in ausal
probabilisti networks by loal omputations Computational Statistis Quarterly,
vol. 4, 269-282.
[25℄ Kjaerul, U. and van der Gaag, L. C. (2000) Making Sensitivity Analysis Computationally eÆient, Proeedings of the 16th Conferene on Unertainty in Artiial Intelligene, 317-325, Morgan Kaufmann Publishers.
110
BIBLIOGRAPHY
[26℄ Lauritzen, S. (1995) The Expetation-Maximisation algorithm for Graphial Assoiation models with missing data, Computational Statistis and Data Analysis,
vol 19 no. 2, 191-201.
[27℄ Lauritzen, S. and Spiegelhalter, D. (1988) Loal omputations with probabilities
on graphial strutures and their appliation to expert systems, Journal of the
Royal Statistial Soiety, Series B, vol. 50, 157-224.
[28℄ MMihael, D., Liu, L. and Pan, H. (1999) Estimating the Parameters of Mixed
Bayesian Networks from Inomplete Data, Pro. of the International Conf. on
Information, Deision and Control.
[29℄ Murphy, K. (1998) A Brief Introdution to Graphial Models and Bayesian Networks, www.ai.mit.edu/~murphyk/Bayes/bayes.html
[30℄ Pearl, J. (1998) Probabilisti Reasoning in Intelligent Systems, Morgan Kaufmann
Publishers, California.
[31℄ Pearl, J. (2000) Causality : models, reasoning and inferene, Cambridge University Press.
[32℄ Raia, H. (1968) Deision Analysis: Introdutory Letures on Reasoning Under
Unertainty, Addison-Wesley.
[33℄ Rie, J. A. (1988) Mathematial Statistis and Data Analysis, Duxbury Press.
[34℄ Sebastiani, P. and Ramoni, M. (2001) On the Use of Bayesian Networks to Analyze Survey Data, www.iteseer.nj.ne.om/514636.html
[35℄ Shafer, G. and Shenoy, P. (1990) Probability Propagation Annals of Mathematis
and Artiial Intelligene 2, 327 - 352.
[36℄ Whittaker, J. (1989) Graphial Models in Applied Mathematial Multivariate
Statistis, John Wiley and Sons.
[37℄ Williamson, J. (2000) Approximating Disrete Probability Distributions with
Bayesian Networks, Pro. of the International Conf. on Artiial Intelligene
in Siene and Tehnology.
[38℄ www.s.huji.a.il/~nirf/GeneExpression/top800/
111
Download