Peano Count Trees (P-Trees) and Rule Association Mining for Gene

advertisement
Boolean Networks, Rule Association Mining and the Peano Count Trees (PTrees) for Gene Expression Profiling
Valdivia Granda, W.A (*1); Perrizo, W.2; Deckard, E.L.1
1
Plant Abiotic Stress, Genomics and Bioinformatics Group
2 Department
of Plant Sciences,
2
Department of Computer Science,
3
Information Technology Services,
* Corresponding author: Willy.Valdivia@ndsu.nodak.edu
North Dakota State University.
Fargo, ND 58105,
United States of America
Abstract
Gene expression profiles and signaling pathways are conformed by several positive and
negative inputs that alter the expression of other genes and consequently signaling pathways. The
constitution of an organism (K) is represented by the total number of genes belonging to two
different groups. The group X constitutes genes (X1,...,Xn) and they can be represented as 1 or 0
depending on whether the gene was expressed or not. The function of many of these genes is
conserved among organisms. The second group of Y genes (Y1,…,Yn) are expressed or repressed at
different levels. However, the genes of the group Y are specie specific and modulated by the
products and combination of genes of the group X. This approach allows us to develop
evolutionary relationships and apply Boolean biological models and rule association mining
methods. Since many gene biological functions are conserved and similar pathways exist among
diverse species, we devised the “super chip.” The super chip contains data from microarray
experiments from different organisms arranged in a multi-dimensional plane where sequence,
temporal and spatial gene expression are represented. In the super chip we represent each spot of
a microarray as a pixel with its corresponding red and green ratios. The use of Association Rule
Mining is proposed as a mean to derive meaningful rules of gene interaction. We propose
organize signal transduction pathways taking in consideration signaling functions and
evolutionary relationships. For us, the genetic
Introduction:
The rediscovery of Mendel's laws of heredity in the opening of the 20th century sparked a
scientific quest to understand the nature and content of genetic information. Living systems are
very complex and their understanding will remain a challenge for biologists, mathematicians,
computer scientists and biological modelers. Identification of genetic regulatory networks and
genetic signal transduction pathways is one of the key problems in molecular biology (Silvescu
and Hanovar, 1997). In the past two decades we have witnessed rapid advances in the
development of more accurate, sensitive and powerful devices for gene sequencing, gene
expression measurement and protein structural analysis.
This common transitivity and the molecular interactions are studied in different
organisms using DNA microarrays. Combining the information of different microarray
experiments performed in different species and experimental settings to identify ancestral and key
players in biological networks functioning. Biological systems are governed by complex
signaling networks. From this information one can try estimate to estimate the complexity of the
whole network and to a certain degree the amount of cross talk in biochemical pathways (Wagner
et al., 2001). DNA microarrays have produce a paradigm shift that is transforming biological
experimentation methods used for the understanding of gene expression changes. DNA
microarrays are becoming one of the most widely techniques for the assessment of gene
expression. Their small size, high densities, and compatibility with fluorescence labeling make
microarray technology ideal to perform comparative analysis between the control and labeled
material in a parallel fashion.
The advances in genomic data collection had lead the development of novel genetic
models that try to analyze the system as a whole in which pleotropic and multigenic interactions
can be identified (Silvescu and Hanovar, 1997). Various statistical and computational techniques
had been proposed, these techniques range from simple criteria defining fold change cut-off as
change in gene expression to complex unsupervised to supervise analysis. Nevertheless a
technique widely used has been not adopted yet. Thanks to gene expression using microarray
technology we can start to question how does the form and function of networks changed during
evolution? However to develop our model first we need to take certain definitions.
This paper centers on the development of a model in which the components of the
network act simultaneously and provide positive and negative feedback. The genome of an
organisms is constituted by two groups of genes. The first can be represented as a Boolean
function where if it is expressed the value is one or not expressed and the value is cero.
Boolean Networks and Association Rule Mining Techniques for Microarray
Data
The functional state of an organism is a dynamical process where the cell has to be capable
to sense through different mechanisms complex external stimuli and trigger the activation of
different signaling pathways. The recent developments in gene expression analysis using DNA
microarray technology is providing an approach for the exploration of genomic structure, gene
function and to understand the implications of the temporal and spatial expression levels of
thousands of genes. Microarray technology supply three basic kinds of information: i) gene
expression ii), level of gene expression or repression and iii) depending if we use supervised or
unsupervised techniques, the interaction of a particular gene with other genes. Various creative
computational and statistical methods have been proposed and develop by both public and public
initiatives. These tools analyze range from simple criteria to define gene expression changes in a
fold change cut-off to complex analysis using machine learning approaches. Nevertheless none
has yet gain widespread acceptance.
It has been argued that clustering methods for gene expression analysis have significant
drawbacks that make them unsuitable for detecting complex relationships especially when the
size and intricacy of the data growths dynamically. Strict phylogenetic trees are best suited for the
true hierarchical descent (such as evolution of species) and they are not designed to reflect
multiple ways in which gene expression patterns can be similar. Clustering is successful if there is
already a wealth of knowledge about a pathway in question. However, it is less successful when
this knowledge is sparse (Fiehn et al. 2001; Quakenbush, 2001). Without biological basis to
interpret these results there is not way to decide which group is right and which group is wrong.
In addition, clustering methods impose structures whether or not they exist in the data.
Consequently the expression vector that represents the cluster might no longer represent any of
the genes in the cluster (Tamayo et al. 1999, Quackenbush, 2001). At last, clustering techniques
order gene expression with other genes with similar level of expression, however many genes
which have similar expression level are part of a pathway or a gene part of a pathway form part of
other biological process.
In order to help elucidate the functional relationship among genes association rule mining
present an interesting alternative. Association-rule mining is a widely used technique for mining
with applications in market basket research, insurance fraud investigation, climate prediction and
remote sensing research (Valdivia-Granda et al., 2002). An association rule is a relationship (X
 Y) where X is the antecedent item set and Y is the consequent item set. In mining for these
rules the user defines a threshold, called confidence, which the implication, measured in terms of
a conditional probability, must exceed. The primary task here is to identify frequent item sets,
that are sets of data points occurring with at least a minimum frequency, called minimum support.
Once the frequent item sets are identified, the association rules are formulated. To accomplish
this task different algorithms have been developed including Apriori, Charm, FT-growth, Closet,
MagnumOpus, etc. One of the greatest challenges in use association rule mining for gene
expression data analysis is the immense large theoretical rules need to be considered. This is a
consequence of all possible combinations of gene expression under different conditions
(temporal, spatial and experimental). In addition, not all association rules discovered within a
transaction set are interesting or useful (Zheng et al. 2001). However genome sequencing projects
and comparative biology tools have allowed us to demonstrate that the biological function of
many genes is conserved among different species and that it is expected a degree of similarity
regarding the intricate connection among many biological networks. The initial comparisons
between complete eukaryotic genomes (yeast and worm) revealed a surprising fraction of gene
orthology between them. About 12% of the worm (Caenoevirens elegans) genes (~18,000)
encoded proteins whose biological function could be inferred from their similarity to 27% of the
yeast genome. Nearly 20% of the fly Drosophila melanogaster genes (~13,600) have putative
orthologs in both worn Caenoevirens elegans and yeast Saccharomices cerevisiae. Therefore
current biological information may provide the mathematical models necessary to simplify and
prune the number of rules generated by association rule mining algorithms.
The mathematical representation of these biological processes has been demonstrated
trough different equations including Boolean switching and graphical models (Wagner 2001).
To develop the model, first is necessary consider the mechanisms of gene expression
process. An organism is a system conformed by genetic networks. The network is regulated
through gene circuits of signaling pathways (Somogny et al. 2001). Then the genetic constitution
of an organism (K) is represented by a total number of genes belonging to two different levels
(Valdivia-Granda et al., 2002). There are constitutive genes expressed or repressed during the life
span of an organism. Each of these genes is defined as a node. Node genes have a similar level of
temporal and spatial expression; in many cases are orthologs in other species. They are the first
gene group of our model and constituted by genes (X1,...,Xn) and are represented as 1 or 0 when
the gene was expressed or not. The combination of node genes group produces an input switch
that activates the other level in our model (genes Y). These genes are expressed or repressed in
spatial and temporal fashion (specific tissue, organ, developmental stage and environmental
condition). The genes belonging to this group are specie and genotype specific. These groups of
genes are regulated at different levels by the different input switches combinations. The elements
of these processes have direct or indirect positive or negative effect on other process and itself.
To develop the mathematical assumptions we defined the K = (X,Y) = n, where n is the total
genes under consideration. The data from a series of experiments M involving n genes can be
represented in the “superchip” in which both gene expression and sequence are mapped
simultaneously (Fig. 1). In one side of the “superchip” an expression table E is conform by n
represented as a columns and M as row. The maximum number of input switches required for the
computing of a signaling pathway as F. This function is based in the number of occurrences of
the node genes {0,1} across the cube. We denote the circuit complexity (C) of F by C(f1,…,fm)
where the inputs are X1,…,Xn and the outputs are f1(X1,…,Xn),…,fm(X1,…,Xn). Rules R (expression
pattern) consisting of a table of all possible variations of the n, then the rule for a subset R  F.
The rules have to be constructed from the assumptions underlying the reactions, namely that
separate reactions are independent, that reactants and enzymes are all needed (AND) and
inhibitors are (NOT) needed, if inhibitors are used.
Without regard the gene circuit complexity, Let F a collection of Boolean functions f1,…,fn.
Grouping would lead to a (OR) relation inside each entry in and, but will be disregarded for the
first simple models. We denote the minimum depth (D) to be the maximum decrease of F in a
chain C. Where D[(f1(X1,…,Xn),AND, NOT, OR fm(X1,…,Xn)].
This process makes faster and easier interrogation of the biological process and the
development of the “phylogenomic classification” “superchip” This approach will allow us to
draw evolutionary relationships to determine the functional effects of plant population variance
and identify gene targets and their and is their structural characterizes and interfere their effect on
different metabolic pathway involved in sensing and tolerance to hypoxic and anoxic stress in
plants. Phylogenomic comparison will illuminate those genes that play an important role in the
protein structure or gene regulation and how alternative pathways are integrated. Phylogenomic
comparisons will identify the genes that are crucial for the development of hypoxic stress in
plants. P-tree technology already has demonstrated that it would be possible reduce the
processing time to achieve this task. As the data integration grows, the learning machine will
generate a virtual chip containing DNA sequences that need to be considered in a microarray to
interrogate a specific pathway.
Integrating gene expression and sequence in a multi-dimensional plane:
Each gene can receive several inputs from other genes or itself. Gene expression patters are
represented as variables, while signaling functions are determined by gene sequence (Somogyi et
al., 2001).
Consideration on use Boolean networks
Not all the variables that can influence gene expression are observable (Silvescu and Hanovar,
1997).
The bSQ Format and the P-tree Data Structure for Microarray Data
The expression level of each gene is indirectly recorded by the measurement of the
fluorescence level emitted by each dye (red/green) attached to the cDNA. Each spot on the
microarray emitting a signal is a pixel with byte number ranging from 0 to 255. Different bits can
make different contributions to the values that are used for gene expression profiling. Therefore, a
microarray image can be organized into an 8-separated bit sequential (bSQ) format. The intensity
of each band (red/green) is stored in two separated bSQ files. The primary key attribute of the
bSQ format consists of the pixel location (x-y coordinates of the spot on the microarray) and its
corresponding gene identification. The subsequent attributes consist of the bSQ values for each
signal. There are several reasons to use the bSQ format. First, different bits have different degrees
of contribution to the intensity value. In some applications, we do not need all the bits because the
high order bits give us enough information. Second, the bSQ format facilitates the representation
of a precision hierarchy. Third, the bSQ format facilitates better data compression. This point is
relevant for the integration of genomic data and the performing of faster data mining applications.
Fourth, and most importantly, the bSQ format facilitates the creation of an efficient, rich data
structure denominated Peano Count Tree (P-tree) that accommodates algorithm pruning based on
a one-bit-at-a-time approach. The Peano Count Tree is a lossless tree representation where the
root of a P-tree contains the 1-bit count of the entire bit-band representing the microarray spot. At
the next level, each quadrant is partitioned into sub-quadrants and their 1-bit counts in raster order
constitute the children of the quadrant node. This construction is continued recursively down each
tree path until the sub-quadrant is pure (entirely 1-bits or entirely 0-bits), which may or may not
be at the leaf level (1-by-1 sub-quadrant). Our approach is to recursively divide the entire image
into quadrants and then record the count of 1-bits in each quadrant, thus, forming a quadrant
count tree (Fig. 1).
1 2
3
bSQ format (2 files)
4 5
Band-1 or B1
127
(0111 1111)
254
Band-2 or B2
37
(0010 0101)
(1111 1110)
55
(111000101101 )
55
depth=0
lev el=3
____________/ / \ \___________
Fig. 1. Representation
levels in a bSQ format
/
_____/ of
\ ___gene expression
\
16
____8__
/ / |
\
3 0 4
1
//|\
//|\
1110
0010
_15__
/ | \ \
4 4 3 4
//|\
1101
16
depth=1
lev el=2
depth=2
lev el=1
depth=3
lev el=0
We give a very simple illustrative example with only 2 data bands for a microarray image having
only 2 rows and 2 columns (both decimal and binary representation are shown) (Fig. 2).
BAND-1
254
(1111 1110)
14
(0000 1110)
B11
1
0
0
1
127
(0111 1111)
193
(1100 0001)
BAND-2
37
(0010 0101)
200
(1100 1000)
240
(1111 0000)
19
(0001 0011)
bSQ format (16 files)
B12 B13 B14 B15 B16 B17 B18 B21 B22 B23 B24 B25 B26 B27 B28
1
1
1
1
1
1
0
0
0
1
0
0
1
0
1
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
1
1
1
0
1
1
0
0
1
0
0
0
1
0
0
0
0
0
1
0
0
0
1
0
0
1
1
BSQ format (2 files)
Band 1: 254 127 14 193
Band 2: 37 240 200 19
Fig. 2. Two bands of a 2-row-2-column image and its bSQ formats
In the example of Fig. 1, the root level is labeled as level 0. The numbers at the next two
levels (level 2) are, 16, 8, 15 and 16, are the 1-bit counts for the four major quadrants. Since the
first and last quadrants are composed entirely of 1-bits (called a “pure 1 quadrant”), we do not
need sub-trees for these two quadrants, so these branches terminate. Similarly, quadrants
composed entirely of 0-bits are called “pure 0 quadrants” which also terminate. This pattern is
continued recursively using the Peano or Z-ordering of the four sub-quadrants at each new level.
Every branch terminates eventually (at the “leaf” level, each quadrant is a pure quadrant). If we
were to expand all sub-trees, including those for pure quadrants, then the leaf sequence is just the
Peano-ordering (or, Z-ordering) of the original raster image. Thus, we use the name Peano Count
Tree (P-tree). This structure provides compression and embedded information that is needed to
do genomic data mining. P-trees defined above can be combined using simple logical operations
[AND, NOT, OR, COMPLEMENT] to produce additional P-trees from the original values from
each band, b, and value, v, where v can be expressed in 1-bit, 2-bit,.., or 8-bit precision (Perrizo et
al. 2001; Perrizo et al. 2001a). Using this approach we derive expression P-trees (EP-trees) and
repression P-trees (RP-trees) defined by the red/green intensity of each spot that are significantly
above or below of the reference genes spotted on a microarray.
P-trees are a lossless and compressed data structures that can be use to construct a “super
chip” which is derived from multiple experiments (in our case data generated from researchers of
Virtual Center for Hypoxic and Anoxic Research www.ndsu.edu/virtual-genomics). The Biologic
Research Application and Information Network (BRAIN) integrates different sources of gene
expression using a distributed data system (DDS). The DDS is a Java system that makes use of
JDBC and XML to connect disparate databases and allow them to be queried as one system. DDS
allows the integration of different data sources and the translation of different data formats such
as CSV (comma separated variable) file, XML or the tables of relational databases. Through the
use of drag and drop the user is able to associate input from various sources with existing tables
and then generates the SQL needed to insert the data. While the systems allow each researcher
keep and develop their own database, it does not require changing their existing database
schemas. Thanks to BRAIN gene expression data is represented in multidimensional fashion.
DNA microarray technology is used by different laboratories with the aim to measure the
temporal and spatial gene expression of different organisms subjected to diverse experimental
settings. As the data grows, the need for public repositories and integrative architectures is
becoming evident. In this paper we presented the bSQ format and P-tree technology to generate
lossless and “data mining ready” representations of microarray images. We take in consideration
the complexity of biological systems and assume that genes can be separated in two different
main groups and that association rule mining can be applied for gene expression data analysis.
However the bSQ and P-tree technology can be used with other classification or clustering
techniques.
References:
Hipp, J., Güntze, U., Makheeizade, G. 2000. Algorithms for association rule mining – A general
survey and comparison. SIGKDD (2)1:58-64.
Silvescu, A., Hanovar, V. 1997. Temporal Boolean Network Models of Genetic Networks and
their Interference from Gene Exprssion Time Series. Complex Systems. 11:1+1.
Perrizo, W., Ding, Q., Ding, D., Roy, A. 2001. On mining satellite and other Remetely Sensed
Images. DMKD. 33-44.
Perrizo, W. 2001. Peano Cunt Tree Technolgy. Technical Report NDSU-CSOR-TR-01-1.
Valdivia-Granda, W.A., Perrizo, W., Larson, F., Deckard, E.L. 2002. Peano Count Trees (PTrees) and Rule Association Mining for Gene Expression Profiling of DNA Microarray Data.
Proc. International Conf. Bioinformatics.
Zheng, Z., Kohavi, R., Mason, L. 2001. Real World Performance of Association Rule Mining.
KDD-2001.
Download