Boolean Networks, Rule Association Mining and the Peano Count Trees (PTrees) for Gene Expression Profiling Valdivia Granda, W.A (*1); Perrizo, W.2; Deckard, E.L.1 1 Plant Abiotic Stress, Genomics and Bioinformatics Group 2 Department of Plant Sciences, 2 Department of Computer Science, 3 Information Technology Services, * Corresponding author: Willy.Valdivia@ndsu.nodak.edu North Dakota State University. Fargo, ND 58105, United States of America Abstract Gene expression profiles and signaling pathways are conformed by several positive and negative inputs that alter the expression of other genes and consequently signaling pathways. The constitution of an organism (K) is represented by the total number of genes belonging to two different groups. The group X constitutes genes (X1,...,Xn) and they can be represented as 1 or 0 depending on whether the gene was expressed or not. The function of many of these genes is conserved among organisms. The second group of Y genes (Y1,…,Yn) are expressed or repressed at different levels. However, the genes of the group Y are specie specific and modulated by the products and combination of genes of the group X. This approach allows us to develop evolutionary relationships and apply Boolean biological models and rule association mining methods. Since many gene biological functions are conserved and similar pathways exist among diverse species, we devised the “super chip.” The super chip contains data from microarray experiments from different organisms arranged in a multi-dimensional plane where sequence, temporal and spatial gene expression are represented. In the super chip we represent each spot of a microarray as a pixel with its corresponding red and green ratios. The use of Association Rule Mining is proposed as a mean to derive meaningful rules of gene interaction. We propose organize signal transduction pathways taking in consideration signaling functions and evolutionary relationships. For us, the genetic Introduction: The rediscovery of Mendel's laws of heredity in the opening of the 20th century sparked a scientific quest to understand the nature and content of genetic information. Living systems are very complex and their understanding will remain a challenge for biologists, mathematicians, computer scientists and biological modelers. Identification of genetic regulatory networks and genetic signal transduction pathways is one of the key problems in molecular biology (Silvescu and Hanovar, 1997). In the past two decades we have witnessed rapid advances in the development of more accurate, sensitive and powerful devices for gene sequencing, gene expression measurement and protein structural analysis. This common transitivity and the molecular interactions are studied in different organisms using DNA microarrays. Combining the information of different microarray experiments performed in different species and experimental settings to identify ancestral and key players in biological networks functioning. Biological systems are governed by complex signaling networks. From this information one can try estimate to estimate the complexity of the whole network and to a certain degree the amount of cross talk in biochemical pathways (Wagner et al., 2001). DNA microarrays have produce a paradigm shift that is transforming biological experimentation methods used for the understanding of gene expression changes. DNA microarrays are becoming one of the most widely techniques for the assessment of gene expression. Their small size, high densities, and compatibility with fluorescence labeling make microarray technology ideal to perform comparative analysis between the control and labeled material in a parallel fashion. The advances in genomic data collection had lead the development of novel genetic models that try to analyze the system as a whole in which pleotropic and multigenic interactions can be identified (Silvescu and Hanovar, 1997). Various statistical and computational techniques had been proposed, these techniques range from simple criteria defining fold change cut-off as change in gene expression to complex unsupervised to supervise analysis. Nevertheless a technique widely used has been not adopted yet. Thanks to gene expression using microarray technology we can start to question how does the form and function of networks changed during evolution? However to develop our model first we need to take certain definitions. This paper centers on the development of a model in which the components of the network act simultaneously and provide positive and negative feedback. The genome of an organisms is constituted by two groups of genes. The first can be represented as a Boolean function where if it is expressed the value is one or not expressed and the value is cero. Boolean Networks and Association Rule Mining Techniques for Microarray Data The functional state of an organism is a dynamical process where the cell has to be capable to sense through different mechanisms complex external stimuli and trigger the activation of different signaling pathways. The recent developments in gene expression analysis using DNA microarray technology is providing an approach for the exploration of genomic structure, gene function and to understand the implications of the temporal and spatial expression levels of thousands of genes. Microarray technology supply three basic kinds of information: i) gene expression ii), level of gene expression or repression and iii) depending if we use supervised or unsupervised techniques, the interaction of a particular gene with other genes. Various creative computational and statistical methods have been proposed and develop by both public and public initiatives. These tools analyze range from simple criteria to define gene expression changes in a fold change cut-off to complex analysis using machine learning approaches. Nevertheless none has yet gain widespread acceptance. It has been argued that clustering methods for gene expression analysis have significant drawbacks that make them unsuitable for detecting complex relationships especially when the size and intricacy of the data growths dynamically. Strict phylogenetic trees are best suited for the true hierarchical descent (such as evolution of species) and they are not designed to reflect multiple ways in which gene expression patterns can be similar. Clustering is successful if there is already a wealth of knowledge about a pathway in question. However, it is less successful when this knowledge is sparse (Fiehn et al. 2001; Quakenbush, 2001). Without biological basis to interpret these results there is not way to decide which group is right and which group is wrong. In addition, clustering methods impose structures whether or not they exist in the data. Consequently the expression vector that represents the cluster might no longer represent any of the genes in the cluster (Tamayo et al. 1999, Quackenbush, 2001). At last, clustering techniques order gene expression with other genes with similar level of expression, however many genes which have similar expression level are part of a pathway or a gene part of a pathway form part of other biological process. In order to help elucidate the functional relationship among genes association rule mining present an interesting alternative. Association-rule mining is a widely used technique for mining with applications in market basket research, insurance fraud investigation, climate prediction and remote sensing research (Valdivia-Granda et al., 2002). An association rule is a relationship (X Y) where X is the antecedent item set and Y is the consequent item set. In mining for these rules the user defines a threshold, called confidence, which the implication, measured in terms of a conditional probability, must exceed. The primary task here is to identify frequent item sets, that are sets of data points occurring with at least a minimum frequency, called minimum support. Once the frequent item sets are identified, the association rules are formulated. To accomplish this task different algorithms have been developed including Apriori, Charm, FT-growth, Closet, MagnumOpus, etc. One of the greatest challenges in use association rule mining for gene expression data analysis is the immense large theoretical rules need to be considered. This is a consequence of all possible combinations of gene expression under different conditions (temporal, spatial and experimental). In addition, not all association rules discovered within a transaction set are interesting or useful (Zheng et al. 2001). However genome sequencing projects and comparative biology tools have allowed us to demonstrate that the biological function of many genes is conserved among different species and that it is expected a degree of similarity regarding the intricate connection among many biological networks. The initial comparisons between complete eukaryotic genomes (yeast and worm) revealed a surprising fraction of gene orthology between them. About 12% of the worm (Caenoevirens elegans) genes (~18,000) encoded proteins whose biological function could be inferred from their similarity to 27% of the yeast genome. Nearly 20% of the fly Drosophila melanogaster genes (~13,600) have putative orthologs in both worn Caenoevirens elegans and yeast Saccharomices cerevisiae. Therefore current biological information may provide the mathematical models necessary to simplify and prune the number of rules generated by association rule mining algorithms. The mathematical representation of these biological processes has been demonstrated trough different equations including Boolean switching and graphical models (Wagner 2001). To develop the model, first is necessary consider the mechanisms of gene expression process. An organism is a system conformed by genetic networks. The network is regulated through gene circuits of signaling pathways (Somogny et al. 2001). Then the genetic constitution of an organism (K) is represented by a total number of genes belonging to two different levels (Valdivia-Granda et al., 2002). There are constitutive genes expressed or repressed during the life span of an organism. Each of these genes is defined as a node. Node genes have a similar level of temporal and spatial expression; in many cases are orthologs in other species. They are the first gene group of our model and constituted by genes (X1,...,Xn) and are represented as 1 or 0 when the gene was expressed or not. The combination of node genes group produces an input switch that activates the other level in our model (genes Y). These genes are expressed or repressed in spatial and temporal fashion (specific tissue, organ, developmental stage and environmental condition). The genes belonging to this group are specie and genotype specific. These groups of genes are regulated at different levels by the different input switches combinations. The elements of these processes have direct or indirect positive or negative effect on other process and itself. To develop the mathematical assumptions we defined the K = (X,Y) = n, where n is the total genes under consideration. The data from a series of experiments M involving n genes can be represented in the “superchip” in which both gene expression and sequence are mapped simultaneously (Fig. 1). In one side of the “superchip” an expression table E is conform by n represented as a columns and M as row. The maximum number of input switches required for the computing of a signaling pathway as F. This function is based in the number of occurrences of the node genes {0,1} across the cube. We denote the circuit complexity (C) of F by C(f1,…,fm) where the inputs are X1,…,Xn and the outputs are f1(X1,…,Xn),…,fm(X1,…,Xn). Rules R (expression pattern) consisting of a table of all possible variations of the n, then the rule for a subset R F. The rules have to be constructed from the assumptions underlying the reactions, namely that separate reactions are independent, that reactants and enzymes are all needed (AND) and inhibitors are (NOT) needed, if inhibitors are used. Without regard the gene circuit complexity, Let F a collection of Boolean functions f1,…,fn. Grouping would lead to a (OR) relation inside each entry in and, but will be disregarded for the first simple models. We denote the minimum depth (D) to be the maximum decrease of F in a chain C. Where D[(f1(X1,…,Xn),AND, NOT, OR fm(X1,…,Xn)]. This process makes faster and easier interrogation of the biological process and the development of the “phylogenomic classification” “superchip” This approach will allow us to draw evolutionary relationships to determine the functional effects of plant population variance and identify gene targets and their and is their structural characterizes and interfere their effect on different metabolic pathway involved in sensing and tolerance to hypoxic and anoxic stress in plants. Phylogenomic comparison will illuminate those genes that play an important role in the protein structure or gene regulation and how alternative pathways are integrated. Phylogenomic comparisons will identify the genes that are crucial for the development of hypoxic stress in plants. P-tree technology already has demonstrated that it would be possible reduce the processing time to achieve this task. As the data integration grows, the learning machine will generate a virtual chip containing DNA sequences that need to be considered in a microarray to interrogate a specific pathway. Integrating gene expression and sequence in a multi-dimensional plane: Each gene can receive several inputs from other genes or itself. Gene expression patters are represented as variables, while signaling functions are determined by gene sequence (Somogyi et al., 2001). Consideration on use Boolean networks Not all the variables that can influence gene expression are observable (Silvescu and Hanovar, 1997). The bSQ Format and the P-tree Data Structure for Microarray Data The expression level of each gene is indirectly recorded by the measurement of the fluorescence level emitted by each dye (red/green) attached to the cDNA. Each spot on the microarray emitting a signal is a pixel with byte number ranging from 0 to 255. Different bits can make different contributions to the values that are used for gene expression profiling. Therefore, a microarray image can be organized into an 8-separated bit sequential (bSQ) format. The intensity of each band (red/green) is stored in two separated bSQ files. The primary key attribute of the bSQ format consists of the pixel location (x-y coordinates of the spot on the microarray) and its corresponding gene identification. The subsequent attributes consist of the bSQ values for each signal. There are several reasons to use the bSQ format. First, different bits have different degrees of contribution to the intensity value. In some applications, we do not need all the bits because the high order bits give us enough information. Second, the bSQ format facilitates the representation of a precision hierarchy. Third, the bSQ format facilitates better data compression. This point is relevant for the integration of genomic data and the performing of faster data mining applications. Fourth, and most importantly, the bSQ format facilitates the creation of an efficient, rich data structure denominated Peano Count Tree (P-tree) that accommodates algorithm pruning based on a one-bit-at-a-time approach. The Peano Count Tree is a lossless tree representation where the root of a P-tree contains the 1-bit count of the entire bit-band representing the microarray spot. At the next level, each quadrant is partitioned into sub-quadrants and their 1-bit counts in raster order constitute the children of the quadrant node. This construction is continued recursively down each tree path until the sub-quadrant is pure (entirely 1-bits or entirely 0-bits), which may or may not be at the leaf level (1-by-1 sub-quadrant). Our approach is to recursively divide the entire image into quadrants and then record the count of 1-bits in each quadrant, thus, forming a quadrant count tree (Fig. 1). 1 2 3 bSQ format (2 files) 4 5 Band-1 or B1 127 (0111 1111) 254 Band-2 or B2 37 (0010 0101) (1111 1110) 55 (111000101101 ) 55 depth=0 lev el=3 ____________/ / \ \___________ Fig. 1. Representation levels in a bSQ format / _____/ of \ ___gene expression \ 16 ____8__ / / | \ 3 0 4 1 //|\ //|\ 1110 0010 _15__ / | \ \ 4 4 3 4 //|\ 1101 16 depth=1 lev el=2 depth=2 lev el=1 depth=3 lev el=0 We give a very simple illustrative example with only 2 data bands for a microarray image having only 2 rows and 2 columns (both decimal and binary representation are shown) (Fig. 2). BAND-1 254 (1111 1110) 14 (0000 1110) B11 1 0 0 1 127 (0111 1111) 193 (1100 0001) BAND-2 37 (0010 0101) 200 (1100 1000) 240 (1111 0000) 19 (0001 0011) bSQ format (16 files) B12 B13 B14 B15 B16 B17 B18 B21 B22 B23 B24 B25 B26 B27 B28 1 1 1 1 1 1 0 0 0 1 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 0 1 1 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 1 1 BSQ format (2 files) Band 1: 254 127 14 193 Band 2: 37 240 200 19 Fig. 2. Two bands of a 2-row-2-column image and its bSQ formats In the example of Fig. 1, the root level is labeled as level 0. The numbers at the next two levels (level 2) are, 16, 8, 15 and 16, are the 1-bit counts for the four major quadrants. Since the first and last quadrants are composed entirely of 1-bits (called a “pure 1 quadrant”), we do not need sub-trees for these two quadrants, so these branches terminate. Similarly, quadrants composed entirely of 0-bits are called “pure 0 quadrants” which also terminate. This pattern is continued recursively using the Peano or Z-ordering of the four sub-quadrants at each new level. Every branch terminates eventually (at the “leaf” level, each quadrant is a pure quadrant). If we were to expand all sub-trees, including those for pure quadrants, then the leaf sequence is just the Peano-ordering (or, Z-ordering) of the original raster image. Thus, we use the name Peano Count Tree (P-tree). This structure provides compression and embedded information that is needed to do genomic data mining. P-trees defined above can be combined using simple logical operations [AND, NOT, OR, COMPLEMENT] to produce additional P-trees from the original values from each band, b, and value, v, where v can be expressed in 1-bit, 2-bit,.., or 8-bit precision (Perrizo et al. 2001; Perrizo et al. 2001a). Using this approach we derive expression P-trees (EP-trees) and repression P-trees (RP-trees) defined by the red/green intensity of each spot that are significantly above or below of the reference genes spotted on a microarray. P-trees are a lossless and compressed data structures that can be use to construct a “super chip” which is derived from multiple experiments (in our case data generated from researchers of Virtual Center for Hypoxic and Anoxic Research www.ndsu.edu/virtual-genomics). The Biologic Research Application and Information Network (BRAIN) integrates different sources of gene expression using a distributed data system (DDS). The DDS is a Java system that makes use of JDBC and XML to connect disparate databases and allow them to be queried as one system. DDS allows the integration of different data sources and the translation of different data formats such as CSV (comma separated variable) file, XML or the tables of relational databases. Through the use of drag and drop the user is able to associate input from various sources with existing tables and then generates the SQL needed to insert the data. While the systems allow each researcher keep and develop their own database, it does not require changing their existing database schemas. Thanks to BRAIN gene expression data is represented in multidimensional fashion. DNA microarray technology is used by different laboratories with the aim to measure the temporal and spatial gene expression of different organisms subjected to diverse experimental settings. As the data grows, the need for public repositories and integrative architectures is becoming evident. In this paper we presented the bSQ format and P-tree technology to generate lossless and “data mining ready” representations of microarray images. We take in consideration the complexity of biological systems and assume that genes can be separated in two different main groups and that association rule mining can be applied for gene expression data analysis. However the bSQ and P-tree technology can be used with other classification or clustering techniques. References: Hipp, J., Güntze, U., Makheeizade, G. 2000. Algorithms for association rule mining – A general survey and comparison. SIGKDD (2)1:58-64. Silvescu, A., Hanovar, V. 1997. Temporal Boolean Network Models of Genetic Networks and their Interference from Gene Exprssion Time Series. Complex Systems. 11:1+1. Perrizo, W., Ding, Q., Ding, D., Roy, A. 2001. On mining satellite and other Remetely Sensed Images. DMKD. 33-44. Perrizo, W. 2001. Peano Cunt Tree Technolgy. Technical Report NDSU-CSOR-TR-01-1. Valdivia-Granda, W.A., Perrizo, W., Larson, F., Deckard, E.L. 2002. Peano Count Trees (PTrees) and Rule Association Mining for Gene Expression Profiling of DNA Microarray Data. Proc. International Conf. Bioinformatics. Zheng, Z., Kohavi, R., Mason, L. 2001. Real World Performance of Association Rule Mining. KDD-2001.