a
Shetty
Engineering
Abstract A Bayesian network model is a popular technique for data mining due to its intuitive interpretation. This paper presents a semantic genetic algorithm (SGA) to learn a complete qualitative structure of a Bayesian network from a database.
SGA builds on recent advances in the field and focuses on the generation of initial population, crossover, and mutation operators. Particularly, we introduce two semantic crossover and mutation operators that aid in faster convergence of the SGA.
The crossover and mutation operators io SGA incorporate the semantic of the Bayesian network structures to learn the structure with very minimal errors. SGA bas been proved to perform better than existing classical genetic algorithms for learning
Bayesian networks. We present empirical results to prove the fast convergence of SGA and the predictive power of the obtained
Bayesian network structures.
Index Term- Data mining, Bayesian networks, genetic algorithms, structure Ieaming
1.
ne of the most important steps in data mining is to build a 0 of the database being mined. To do so, probability-based approaches have been considered
effective tool because of uncertain nature of descriptive models. Unfortunately high computational requirements and the lack of proper representation have hindered the building of probabilistic models. To alleviate the above twin problems, probabilistic graphical models have been proposed. In the past decade,
variants of probabilistic graphical models have been developed, with the simplest variant being Bayesian networks (BNs) [l]. Bayesian networks are employed to reason under uncertainty, with wide varying applications
the field of medicine, finance, and military planning [I] [3]. A
Bayesian network is a popular descriptive modeling technique for available data by giving an easily understandable way to see relationships between attributes of a set of records.
Computationally, Bayesian networks provide
efficient way to represent relationships between attributes and allow reasonably fast inference of probabilities. A lot
research has been initiated towards learning these networks fiom raw data as
The authors would like to thank the support from ECE Department of Old
Dominion University. the traditional designer of a Bayesian network may not be able to see all of the relationships between the attributes.
Learning and reasoning are the main
of analyzing
Bayesian networks. We focus on the structure learning of a
Bayesian network fiom a complete database. The database stores the statistical values of the variables as well as the conditional dependence relationship among the variables. We employ a Genetic algorithm
technique to learn the structure of Bayesian networks. Genetic algorithms have been successfully applied in optimization tasks [7] [XI.
Bayesian network learning problem can be viewed as
The
optimization problem where a Bayesian network has to be found that best represents the probability distribution that has generated the data in a given database.
The rest of the paper is organized as follows. Section 2 introduces the background of Bayesian network and GA, and the related work in Bayesian network structure learning. In
Section 3 we discuss the details of our approach for structure learning in a Bayesian network structure using a modified GA.
In Section
algorithms.
The first one is the genetic algorithm with classical genetic operators. In the second algorithm, we extended the standard mutation and crossover operators to incorporate the semantic of the Bayesian network structures. Section
results for the two genetic algorithms under the two constraints.
Finally, Section 5 concludes the paper and proposes future research.
11.
A. Structure learning of Buyesiun networks
Formally, a Bayesian network consists of a set of nodes which represent variables, and a set of directed edges between the nodes. The nodes and directed edges constitute a directed acyclic graph (DAG). Each node is represented by a finite set of mutually exclusive states. The directed edges between the nodes represent the dependence between the linked variables.
The strengths of the relationships between the variables are expressed as conditional probability tables (CPT). A Bayesian network efficiently encodes the joint probability distribution of its variables. The encoding can also be viewed as a conditional decomposition of the actual joint probability of variables. Fig.
1 depicts a hypothetical example of a Bayesian network.
0-7803-8932-8/05/$20.00 Q 2005 E E E 454
Authorized licensed use limited to: Old Dominion University. Downloaded on October 12, 2008 at 16:57 from IEEE Xplore. Restrictions apply.
a. DAG
.45
I
I
I b. CPT
I
Figurc 1. A Baycsian network with k e e variables
I based search. In the HEP algorithm, the search space of DAGS is constrained in the sense that each possible DAG only connects two nodes if they show a strong dependence in the available data. The
a population of
DAGs to
a solution that minimizes the MDL score, The common drawback to the algorithms proposed by
and
Larranga is that the crossover and mutation operators they used were classical in nature and the operators do not help the evolution process to reach the best solution.
algorithm differs from the above algorithms in the design of crossover and mutation operators.
The three variables shown in Figure 1 are
X
= { x , X > ,
= { y 7 j 7 } ,
= { z , T } , where X and Y are parents of 2. The DAG for the example is shown in Fig. l a and the CPT for node 2 is shown in Fig. lb. The joint probabilities can be determined as p ( X , = p ( X ) . p ( U ) . p ( Z
I X ,
.
the figure, it is clear that once a structure is found, it is necessary to compute the conditional probabilities for node 2 with parents X and Y for all possible combinations and their prior probabilities.
In general, the DAG and CPT specify the Bayesian network and represent the distribution for the n-dimensional random variable(X
,,...., Xn) where xi represents the value of the random variable X , and p a ( x , ) represents the value of the parents of X i . Thus, the structure learning problem of a Bayesian network is equivalent to the problem of searching the opt i " in the space
all
DAGs. During the search process, a trade-off between the stnrctural network complexity and the network accuracy bas to be made. The trade-off is necessary as complex networks suffer fiom over fining making
run time of inference very long. A popular measure to balance complexity and accuracy is based
the principle of minimal description length (MDL)
information theory [6]. In this paper the Bayesian network learning problem is solved by searching for a
that minimizes the MDL score.
B Related Work
Larranaga et al. proposed a genetic algorithm based upon the score-based greedy algorithm
a DAG is represented by a connectivity matrix that is stored as a string (the concatenation of its rows). Recombination is implemented as one-paint crossover
these strings, while mutation is implemented as random bit flipping. In a related work, Larranaga et al. [lo] employed a wrapper approach by implementing a CA that searches for an ordering that is passed
to K2 [4], results of the wrapper approach were comparable to those of their previous GA. Different crossover operators have been implemented in a GA to increase the adaptiveness of the learning problem with good results [ l l ] . Lam et al. 161 proposed a hybrid evolutionary programming (HEP) algorithm that combines the use of independence tests with a quality
111. MODIFIED GA APPROACH
A Representative Function and Fitness Function
The first task in a GA is the representation of initial population.
represent a BN as
GA individual, an edge matrix or adjacency matrix is needed.
set of network structures for a specific database characterized by n variables can be represented by an nxn connectivity matrix C. Each bit represents the edge between two nodes where c.
'I
={
1,
0, i f j is a parent of i otherwise
The two-dimensional array of bits can be represented as an individual of the population by the following string
Z...C2n...CnlCn2...C,, n bits represent the edges to the first node of the network, and so on. It can be easily found that C, are the irrelevant bits which represent an edge fiom node k to itself which can be ignored by the search process.
With the representative function decided, we need to devise the generation of the initial population. There are several approaches to generate initial population. We implemented the
Box-Muller random number generator to select how
parents would be chosen for each individual node. The parameters for the Box-Muller algorithm
the desired average and standard deviation. Based
these two input parameters, the algorithm generates a number that fits
distribution. For our implementation, the average corresponds to the average number of parents for each node in the resultant
BN. After considerable experimentation, we found that the best average was 1.0
standard deviation
0.5. Though this approach is simple, it creates numerous illegal DAG due to cyclic subnetworks. An algorithm to remove or fur these cyclic structures has to be designed. The aIgorithm simply removes a random edge
a cycle until cycles are not found in a DAG individual. -
Now that the representative function and the population generation have been decided, we have to find a good fitness function. Most of the current state-of-the-art implementations use the fitness function proposed in the algorithm K2 [4]. The
K2 algorithm assumes an ordered list of variables as its input.
The K 2 algorithm maximizes the following function by searching for every node from the ordered list a set of parent nodes:
455
Authorized licensed use limited to: Old Dominion University. Downloaded on October 12, 2008 at 16:57 from IEEE Xplore. Restrictions apply.
where yi represents the possible value assignments
(
..., vjG for the variable with index i, Ngk represents the number of instances in database in which a variable value , and qi of unique instantiations of
1
‘
Iv. SIMULATIONS
In this section we present the overall simulation methodology to verify SGA. Fig. 2 shows the overall simulation setup to evaluate our genetic algorithm. The following are the main steps carried out:
Step 1. Determine a Bayesian network (structure
+ conditional probabilities) and simulate it using a probabilistic logic sampling technique [I31 to obtain a database D, which reflects the conditional independence relations between the variables;
Step 2. Apply our SGA approach to obtain the Bayesian network structure B, , which maximizes the probability
B Mutation and Crossover Operators
We introduce two new operators SM (semantic mutation) and
SPSC (single point semantic crossover) to the existing standard mutation and crossover operators. The SM operator is a heuristic operator that toggles the bit value of a position in the edge matrix. After the toggle process, the fitness function g ( x i , p , ( x i ) ) is evaluated to ensure that the new individual with the new toggled value leads to a better solution. The new crossover operator is specific to
representation function.
As the function is a two-dimensional edge matrix consisting of columns and rows, our new crossover operator operates on either columns or rows. Thus the crossover operator generates two offsprings by either manipulating columns or rows. The
SPSC crosses two parents by a) manipulating columns or parents and maximizing the function g ( x , , p , ( x , ) ) , and b) manipulating rows or children and maximizing the function
.
combining SM and SPSC, we
I implement our new genetic algorithm: semantic GA (SGA).
Following is the pseudocode for the semantic crossover operation. The algorithm expects an individual as input and
modified individual after applyng semantic crossover operations.
P ( D I B s i ;
Step 3. Evaluate the fitness of the solutions.
The simulations were implemented by incorporating our
SGA algorithm into the Bayesian Network Tools
Java
[9]. BNJ is an open-source suite of software tools for research and development using graphical models of probability. A learning algorithm based on SGA was implemented and incorporated into the tool kit.
was implemented as a separate module using the BNJ MI. To depict the Bayesian network BNJ visually provides a visualization tool to create and edit the network. The Bayesian networks of different network sizes were used in the simulations. The network sizes are 8, 12, 18, 24, 30 and 36. The 8 node Bayesian network used
the simulations is fiom the ASlA networks [12]. The remaining networks were created by adding extra nodes to the basic ASIA network. The ASIA network introduced by Lritzen and Spiegelhater illustrate their method of propagation of evidence, considers a small of fictitious qualitative medical
knowledge. Fig. 3 shows the structure
of ASIA network.
Smoking c;:
Psuedocode for Semantic Crossover
Step 1 , Initialization
Read input individual and populate parent table for each node
Step
Generate new individual
2.1 For each node in the individual do the following n times
2.2 Execute the Box Mueller algorithm to find how many parents need to be altered.
2.3Ensure that the nodes selected as parents do not form cycles. If cycles are formed repeat 2.2
2.4 Evaluate the network score of the resultant structure.
2.5 If current score is higher than previous score, then the chosen parents are the new parents of the selected node.
2.6 Repeat 2.2 through 2.5.
Step 3. Return the final modified individual.
Dyspnea
Figure 3. Structure of thc ASIA network.
There are several techniques for simulating Bayesian network. For
experiments - w e have adopted the probabilistic logic sampling technique. In this technique, the data generator generates random samples based off the ASIA network’s joint probability distribution table. The data generator sorts nodes topologically and picks a value for each root node using the probability distribution, and then generates values for each child node according to its parent’s values in the joint probability table. The root mean square error (RMSE) of the data generated compared to the network it was generated from is approximately 0. This indicates that the data was generated correctly. We have populated the database with
456
Authorized licensed use limited to: Old Dominion University. Downloaded on October 12, 2008 at 16:57 from IEEE Xplore. Restrictions apply.
2000, 3000, 5000 and 10,000 records. This was done to measure the effectiveness
the Ieaming algorithm in presence of less information and more information.
algorithm used the following input:
0
Population size A.
experiments have been carried out with A = 100 and A = 150.
Crossover probability p , , p ,
= 0.9.
Mutation rate p,,, , p , =O. 1 .
The fitness function used by our algorithm is based on the formula proposed by Cooper and Herskovits [4].
each of the samples (2000, 3000, 5000,
we executed 10 runs with each of the above parameter combinations. We considered the following four metrics to evaluate the behavior of our algorithm.
Averageptness value
-
This is an average of finesses function values over 10 runs.
Bestfitness value
~
This value corresponds to the best fitness value throughout the evolution of the CA.
Average graph errors - the graph errors between the best Bayesian network structure found in each search, and the initial Bayesian network structure. Graph errors are defined to be
addition, a deletion or
reversal of
edge.
Average number of generations - This represents the number of generations taken to find the best fitness hnction.
To evaluate SGA, we also implemented the ciassical genetic algorithm (CGA) with classical mutation (CM) and single point cyclic
crossover (SPCC) operators. Fig. 4
plots the average fitness values for the following parameter combination. The average and best fitness values are expressed in terms of logp(DIBs)
.
The numbers of records are 10,000. The population size is set at 200 and the probabilities for the crossover operators SPCC and SPSC are kept at 0.9 and for mutation operators SM and CM are kept at 0.1. Tbe figure also shows the best fitness value for the whole evolution process.
One can see that SGA performs better than CGA in the initial 0-
20 generations. AAer 20 generations, the genetic algorithm
both operators stabilizes to a
fitness value. The final fitness value is very close to the best fitness value. An important observation is that the average fitness value does not deviate a lot even after 100 generations. The best fitness value is carried over to every generation and does not get affected.
Figs. 5
6 plot the final learned network after the evolution process. The final learned Bayesian network was constructed from the final individual generated after 100 generations. The representation
the individual is in the form of a bit string representing an adjacency matrix. The conversion from this bit-string to the equivalent Bayesian network is trivial. Fig. 5 refers to the learned Bayesian network graph after 100 generations for
shows the learned Bayesian network graph for 10,000 records.
It can be observed that for both the scenarios, the learned network differs from the actual generating network shown in
by a small number of graph errors. It is also worth noting that the numbers
graph errors reduce when the total numbers of records increase. This could mean that to reduce the total number of graph errors, a large number of records need to be provided.
-2zm
564 --3
CGA
-m
-22450
Figure 4. Plot of generations versus avcrage fitness values (lo000
\ J
TbOrCa
K d' pspnea
Figure 5. Lcamed network after 100 gencmtions for 5,000 records -graph errors = 3.
VisitAsia c)
Figure 6. Learned network after 100 generations for 10,000 records -graph mors = 2.
Tables 1 and 2 provide the average number of generations and the average -graph errors for the different number of records, It is obvious that for 2000 records the total number of generations taken to achieve the stabilized fitness value is very high. Also the average number of graph errors is too high. For
3,000, 5,000, and 10,000 records the values for the
are
457
Authorized licensed use limited to: Old Dominion University. Downloaded on October 12, 2008 at 16:57 from IEEE Xplore. Restrictions apply.
very reasonable and acceptable.
compare the performance of SGA with CGA in the presence of larger BN structures, we modified the 8-node ASIA network and generated 5 additional
BN with node sizes 12, 18, 24, 30, and 36. Tables 3-7 show results
simulations carried out on
additional BNs. The tables compare the average graph errors in both approaches.
The accuracy of SGA does not deteriorate under increased network sues. parallelizing the genetic algorithm. There are two computationally expensive operations in our genetic algorithm.
The first deals with initializing the first generation of individuals and evaluating fitness values for each of them. A strategy to divide the initial populations
a cluster of workstations could be devised to speed up the computational process and reduce the
running time.
3000
1
5000
1
10000
I
20
I
20
I
15
I
15
I
Records SGA(8) CGA (8)
,
I I
2000
3000
5000
7
3
2
8
4
I
SGA(12) CGA(l2)
7
3
2
I
4
I
10000
Records SGA(30) CGA(30)
3000 53 63
5000
10000
60
70
66
72
SGA(36)
60
70
79 v.
Table 1. Average number of generations
Records
Table 3. Average Graph errors for 18 and 24 nodes
CONCLUSIONS AND
52
WORK
61
CGA(36)
68
74
We have presented a new semantic genetic algorithm (SGA) for BN structure learning. This algorithm is another effective
list
structure
Our results show that SGA discovers BN structures with a greater accuracy
existing classical genetic algorithms. Moreover,
large network sizes, the accuracy of SGA does not degrade.
This accuracy improvement does not come with
increase of search space. In all our simulations, 100 to 150 individuals are used in each of the 100 generations. Thus 10,000 to 15,000 networks are totally searched to learn the BN structure.
Considering the exhaustive search space is of n2
networks, only a small percentage of the entire search space is needed by our algorithm
learn the
structure.
One aspect for h t u r e work is to change the current random generation of adjacency matrices for the initial population generation. The second future work is to improve scalability by
J. Pearl, ‘Probabilistic Reasoning
Lntelligent Systems:
Networks of Plausible Inference,” Morgan Kuufman, San
Mateo, 1988.
P. Larraiiaga, M. Poza,
C.
Kuijpers, “Structure learning of bayesian networks by genetical algorithms: a performance analysis of control parameters,” IEEE Trans. On Pattern Analysis and MuchineInnfeIIigence, vol 18,
9, 1996.
F. V. Jensen, “Introduction to Bayesian networks,”
Springer-YerJag New York, Inc., Secaucus, NJ, 1996.
G . Cooper and
Herskoviis, “A Bayeisan Method for the Induction of Probabilitsic Networks from Data,”
Machine Learning, Vol. 9, N O 4, pp. 309-347, 1992.
D. Heckerman, D. Geiger,
D.M. Chickering,
“Learning Bayesian Networks: The combination of knowledge and statistical data,” Muchine Learning, Vol.
20, pp. 197-243,1995.
Lam, and F. Bacchus, “Learning Bayesian Belief
Networks -An Approach Based on the MDL Principle,”
Computational Intelligence, Vol. 10, No. 4, 1994.
R. F. Hart, “A global convergence proof for a class of genetic algorithms,” Universit;v of Technology,
1990.
Chakraborty, and D.G. Dastidar, “Using reliability analysis to estimate the number of generations to convergence in genetic algorithms,” Information
Processing Letters, Vol. 46, No. 4, pp. 199-209, 1993. http://bnj.sourceforge .net
P. Larranaga,
Kuijpers,
Murga,
Yurramendi, “Learning Bayesian network structures by searching for best ordering with genetic algorithm,” IEEE
on Systems, Man, and Cybemetics, Vol. 26,
No. 4, pp. 487-493, 1996.
Cotta,
J. Muruzabal, “Towards more efficient evolutionary induction of Bayesian networks,”
Proceedings of the Parallel Problem Solvingpom Nature
VII, pp. 730-739. Springer-Verilag, 2002.
S . L. Lauritzen,
D. J. Spiegehalter, “Local computations with probabilities
graphical structures and their appiication on expert systems,” J. Royal Stat..
Vol. 50, No.2, pp. 157-224, 1988.
M. Henrion, “Propagating uncertainty
Bayesian networks by probabilistic logic sampling,” Uncertainty in
Artificiullntelligence 2 , pp. 149-163, 1988.
458
Authorized licensed use limited to: Old Dominion University. Downloaded on October 12, 2008 at 16:57 from IEEE Xplore. Restrictions apply.