Data Mining of Microarray Data using Association Rule Mining

advertisement
Gene Expression Profiling of DNA
Microarray Data using Peano Count
Trees (P-Trees)
P. Kotala1, A. Perera1, J. Kai Zhou1, S.
Mudivarthy3, W. Perrizo1, E. Deckard2
1
Department of Computer Science,
2
Department of Plant Sciences,
3
Department of Statistics
North Dakota State University. Fargo, ND 58105, USA
 701-231-6257
Pratap.Kotala@ndsu.nodak.edu
Abstract
The explosion of genomic data made possible by
advances in parallel, high-throughput technologies in the
area of molecular biology, has ushered in a new era in the
area of Bioinformatics. During the last many years,
efforts concentrated on sequencing the genome of
organisms. Current emphasis lies in extracting
meaningful information from this huge DNA sequence
and expression data. The techniques currently employed
to do analysis of microarray expression data is clustering
and classification. These techniques present their own
limitations as to the amount of useful information that
can be derived. In this paper, we propose a new approach
to data mining the microarray data using new data mining
technology called Peano Count Tree (P-tree)1. This
technology employs Association Rule Mining as means
to do data mining of the microarray data.
Association Rule Mining is an advanced data
mining technique that is useful in deriving meaningful
rules from a given data. We propose using Association
Rule Mining to derive meaningful rules from microarray
expression data. Our approach proposes a new
microarray data mining technology, which involves a
"Data Mining Ready" data structure, called Peano count
tree (P-tree), to measure gene expression levels. The
method involves treating the microarray data as spatial
P-tree technology is patented to North Dakota State University
Proc. of the Virt. Conf. in Genom. and Bioinf.©
www.ndsu.edu/virtual-genomics
North Dakota State University, USA
Octuber, 15-16, 2001.
data. Each spot on the microarray is presented as a pixel
with corresponding red and green ratios. The microarray
data is reorganized into an 8-bit bSQ file (where each
attribute or band is stored as a separate file). Each bit is
then converted in a quadrant base tree structure P-tree
from which a data cube is constructed and meaningful
rules readily obtained.
Keywords: Data Mining, Association Rule Mining,
DNA Microarrays, Gene expression, P-Tree.
Introduction
DNA microarray technology is a powerful means
for exploring genomes of organisms. It is an important
tool for monitoring and analyzing gene expression
profiles of thousands of genes simultaneously. Their
small size, high densities and compatibility with
fluorescent labeling make microarray technology a
widely used technique in the area of molecular genetics.
Microarray technology provides an economic, robust,
fully automated approach toward examining and
measuring temporal and spatial global gene expression
changes [4]. Although fundamental differences exist
between the two microarray technologies - cDNA
microarrays and Oligonucleotide arrays, their strength
lies in the massive parallel analysis of thousands of genes
simultaneously. The microarray data yields valuable
information
about
gene
functions,
inter-gene
dependencies and underlying biological processes [1,2,3].
Such information may help discover gene regulation
pathways, metabolic pathways, relation of genes with
their environments etc.,
The exponential growth in genomic data
accumulation possesses the challenge to develop analysis
procedures to be able to interpret useful information from
this data. And these analytical procedures are broadly
called as "Data Mining". Data mining (also known as
Knowledge Discovery in Databases - KDD) has been
defined as "The nontrivial extraction of implicit,
previously unknown, and potentially useful information
from data"[10] The goal of data mining is to automate the
process of finding interesting patterns and trends from a
give data. Several data mining methodologies have been
proposed to analyze large amounts of gene expression
data. Most of these techniques can be broadly classified
as Cluster analysis and Classification techniques. These
techniques have been widely used to identify groups of
genes sharing similar expression profiles and the results
obtained so far have been extremely valuable. [3, 6, 7, 9]
However, the metrics adopted in these clustering
techniques have discovered only a subset of relationships
among the many potential relationship possible between
the transcripts. [8]. Clustering can work well when there
is already a wealth of knowledge about the pathway in
question, but it works less well when this knowledge is
sparse [17]. The inherent nature of clustering and
classification methodologies makes it less suited for
mining previously unknown rules and pathways. We
propose using another technique - Association Rule
Mining for mining microarray data and it is our
understanding that Association-rule mining can mine for
rules that will help in discovering new pathways
unknown before.
Association rules, proposed by Agarwal, Imielinski,
and Swami [11], are a class of simple but powerful
regularities in binary data. An association rule is an
expression of the form X  Y, where X and Y are sets of
items. The intuitive meaning of such a rule is that in the
rows of the database where the attributes in X have value
true, also the attribute Y has value true with high
probability [12]. The problem is to mine for rules that
satisfy user-specified minimum support and minimum
confidence. There are hundreds of association rules in a
given data set depending on its size and complexity. The
process of mining for such rules is called Association
Rule Mining. Much of the genomic research requires the
mining of microarray data for rules, implications, class
structures and cluster-structures. In this paper we discuss
the use of a data mining technology which involves a
"Data Mining Ready" data structure, called Peano
count tree (P-tree), to mine microarray data.
The paper is organized into five sections. Section 2
provides some
background information about
Association Rule Mining and it’s relevance to the
microarray data. Peano Count Tree (P-tree) is discussed
in Section 3. Section 4 details how to derive association
rules for microarray data using P-trees and related
pruning techniques. The conclusions and proposed work
are discussed in Section 5.
Association Rule Mining for Microarray
Data
Association-rule mining is a widely used technique
for large-scale data mining. Originally, proposed for
Market Basket data to study consumer-purchasing
patterns in retail stores, it has potential applications in
many areas. Microarray data is one of the promising
application areas. It is a very complex and highly
interlinked data, as a spot in microarray not only provides
information about it’s intensity of expression but also it’s
interaction with other genes. Extracting interesting
patterns and rules from microarray dataset, can be of
importance in identifying gene regulation pathways
where expression of certain genes depends on expression
of other genes. It would also be possible to extract rules
hitherto unknown, which can have significant biological
implications.
An association rule is a relationship of the form
X  Y where X is the antecedent item set and Y is the
consequent itemset. An example of the rule can be,
"customers who purchase an item X are very likely to
purchase another item Y at the same time. "There are
primarily two measures of quality for each rule, support
and confidence. The rule X  Y has support s% in the
transaction set T if s% of transactions in T contains X 
Y. The rule has confidence c% if c% of transactions is T
that contain X also contain Y. The goal of association
rule mining is to find all the rules with support and
confidence exceeding some user specified thresholds.
Microarray data format is very similar to the Market
Basket data format. The data mining model for Market
Research dataset can be treated as a relation R(Tid,
i1,........in) where Tid is the transaction identifier and i1.....
in denote the feature attributes - all the items available for
purchase from the store. Transactions constitute the rows
in the data-table whereas itemsets form the columns. The
values for the itemsets for different transactions are in
binary representation; 1 if the item is purchased and 0 if
not purchased. The microarray data is currently
represented as a relation R(Gid, T 1,.....Tn) where Gid is
the gene identification for each gene and T 1....Tn are the
various kinds of treatments to which the genes were
exposed. The genes constitute the rows in the data table
where as treatments are the columns. The values are in
the form of normalized Red/Green color ratios
representing the abundance of transcript for each spot on
the microarray. This table can be called as a "Gene
table". Currently the data mining techniques - clustering
and classification is being applied to the Gene table. In
clustering and classification techniques, dataset is divided
into clusters/classes by grouping on the rows (genes).
We propose a new kind of data format called
"Treatment table" formed by flipping the gene table.
The relation R of a Treatment table can be represented as
R(Tid, G1,.....Gn) where Tid represents the treatment ids
and G1...Gn are the gene identifiers. Treatment table
provides a convenient means to treat genes as a spatial
data. The goal here is to mine for rules among the genes
by associating the columns (genes) in the Treatment
table. Treatment table can be viewed as a 2-dimensional
11
11
11
11
11
11
11
01
11
11
11
11
11
11
11
11
11
10
11
11
11
11
11
11
00
00
00
10
11
11
11
11
P-tree
55
__________/ / \ \__________
/
___ / \___
\
/
/
\
\
16
____8__
_15__ 16
/ / |
\
/ | \ \
3 0 4 1
4 4 3 4
//|\
//|\
//|\
1110
0010
1101
PM-tree
m
____________/ / \ \___________
/
___ / \___
\
/
/
\
\
1
____m__
_m__
1
/ / |
\
/ | \ \
m 0 1 m
1 1 m 1
//|\
//|\
//|\
1110
0010
1101
Figure 1: P-tree and PM tree for 8x8 image
array of gene expression values. Treatment table can be
organized into a new spatial format called bit Sequential
(bSQ) proposed by Qin Ding, Qiang Ding and William
Perrizo [14].
Bit Sequential (bSQ) is a new data format for
representing the spatial data. The Red/Green ratios for
each gene, which represents the gene expression level,
can be represented as a byte. The bSQ format breaks each
of this Red/Green values into separate files by vertically
partitioning the eight bits of each byte used to store the
gene expression value. There are several reasons to use
the bSQ format. First, different bits have different
degrees of contribution to the value. In some
applications, the high-order bits alone provide the
necessary information. Second, the bSQ format facilitates
the representation of a precision hierarchy (from one bit
precision up to eight bit precision). Third, bSQ format
facilitates better compression through creation of an
efficient, rich data structure called Peano Count Tree and
accommodates algorithm pruning based on one-bit-at-atime approach.
Peano Count Trees
Microarray Data
(P-tree)
for
Each bSQ bit file Bij can be organized into a tree
structure, called Peano Count Tree (P-tree). P-trees are
basically
quadrant-wise,
Peano-order-run-lengthcompressed, representations of each bSQ file [14]. A Ptree is a quadrant based "Data Mining Ready" data
structure for data mining. P-trees can be generated by
recursively dividing the entire data into quadrants and
recording the count of 1-bits for each quadrant, thus,
forming a quadrant count tree. P-trees are somewhat
similar to Quadtrees and its variants[15].The root of the
P-tree contains the 1 bit count of a entire column of bits
representing the microarray spot data for different
treatments. The next level of the tree contains the 1-bit
counts of the four quadrants in raster order. At the next
level, each quadrant is partitioned into sub-quadrants and
their 1-bit counts in raster order constitute the children of
the quadrant node. This construction is continued
recursively down each branch of the tree until the subquadrant is pure (entirely 1-bits or entirely 0-bits), which
may or may not be at the leaf level (1-by-1 subquadrant). It is a lossless and compressed data structure
representation of a 1-bit file from which the original bit
sequence can be completely reconstructed. This also
contains the 1-bit count for each and every quadrant in
the original microarray data. A variation of P-tree, called
Peano Mask Tree (PM-tree), can be used for efficient
implementation of P-tree operations. In the PM-tree
structure, we use a 3-value logic, in which 11 represents a
pure-1 quadrant, 00 represents a pure-0 quadrant, and 01
represents a mixed quadrant. To simplify, we use 1, 0 and
m instead of 11, 00, and 01, respectively. For example,
given an 8-row-8-column image, the P-tree and PM-tree
are as shown in Figure 1.
The above figure 1 can be considered as a set of 8row-8-column microarray data, representing the
expression levels for 64 different treatments for a single
gene. In this example, 55 is the number of 1's in the entire
microarray data set. This root level is labeled as level 0.
The numbers at the next level (level 1), 16, 8, 15 and 16,
are the 1-bit counts for the four major quadrants. Since
the first and last quadrants are composed entirely of 1bits (called a "pure 1 quadrant"), We do not need subtrees for these two quadrants, so these branches
terminate. Similarly, quadrants composed entirely of 0bits are called "pure 0 quadrants" which also terminate
the tree branches. This pattern is continued recursively
using the Peano or Z-ordering of the four subquadrants at
each new level. Every branch terminates eventually (at
the "leaf" level, each quadrant is a pure quadrant). If we
were to expand all sub-trees, including those for pure
quadrants, then the leaf sequence is just the Peanoordering of the original raster data. Thus, we use the
name Peano Count Tree. This structure provides
compression and embedded information that is useful for
datamining.
Procedure P-ARM
{
Data Discretization;
F1 = {frequent 1-Asets};
For (k=2; F k-1 ) do begin
Ck = p-gen(F k-1);
Forall candidate Asets c  Ck do
c.count = rootcount(c);
Fk = {cCk | c.count >= minsup}
end
Answer = k Fk
}
insert into Ck
select p.item1, p.item2, …, p.itemk-1,
q.itemk-1
from Fk-1 p, Fk-1 q
where p.item1 = q.item1, …, p.itemk-2 =
q.itemk-2, p.itemk-1 < q.itemk-1,
p.itemk-1.group <> q.itemk-1.group
Figure 2: P-ARM algorithm
Figure 3: Join step in p-gen function
This mechanism creates a set of basic P-trees, which
can be combined using simple logical operators (AND,
NOT, OR, COMPLEMENT) to reconstruct the original
data or produce P-trees at any level of precision for any
value or combination of values. For example we can
construct a P-tree (called a Value P-tree) for all
occurrences of the value 11010011 by ANDing basic Ptrees (for each 1-bit) and their complements (for each 0
bit): PCb,11010011 = PCb1 AND PCb2 AND PCb3'
AND PCb4 AND PCb5' AND PCb6' AND PCb7 AND
PCb8 where ' indicates the bit-complement. The major
potential in this representation is that by using simple
AND operations we can construct all combinations and
permutations of the original data and that the resulting
representation has the hierarchical count information
embedded to facilitate data mining. Details are discussed
by Perrizo et al. [13].
Deriving Association-rules from P-trees
Microarray data are excellent examples of datasets
to which the use of the Peano Count tree(P-tree) show
tremendous promise for deriving confident rules for data
mining. The Red/Green reflectance values from each spot
on the microarray are converted into an 8-bit bSQ file
format. Each bit file then is converted in a quadrant base
P-tree from which a data cube is constructed. P-trees are
considered "Data Mining Ready" data structure, as all the
needed counts are pre-computed. P-tree can be
successfully used to derive rules of interest from the
microarray data. These rules can provide valuable
information to the biologist as to the gene regulatory
pathways and identify important relationships between
the different gene expression patterns hitherto unknown.
The biologist may be interested in some specific kinds of
rules. These rules can be called as "rules of interest". In
gene regulatory pathways, a biologist may be interested
in identifying genes that govern the expression of other
sets of genes. These relationships can be represented as
following,
{G1,..............Gn}  Gm where G1 ...….Gn represents the
antecedent and Gm represents the consequent of the rule.
The intuitive meaning of this rule is that for a given
confidence level the expression of G1 .....Gn genes will
result in the expression of Gm gene.
There are two algorithms for mining associations
rules on microarray data using P-trees namely P-ARM
and p-gen algorithm[14]. These two algorithms are
described in Figure 2 & 3.
The P-ARM algorithm assumes a fixed precision,
for example, 3-bit precision in all bands. In Apriori
algorithm, there is a function called "apriori-gen"[11] to
generate candidate k-itemsets from frequent (-1) itemsets.
The p-gen function in P-ARM algorithm differs from the
apriori-gen function in the way pruning is done. The
rootcount function is used to Aset calculate counts
directly by ANDing the appropriate basic P-trees. PARM algorithm also has the capability to group the
adjacent intervals depending on user's requirements. Qin
Ding, Qiang Ding and William Perrizo[14] compared PARM against other data mining algorithms like Apriori
and FP-growth algorithms and found that P-ARM was
more scalable for large datasets and for lower support
thresholds[14].
Conclusions
In this paper, we propose a new method of data
mining the microarray data using association rules
through a new data mining technology called Peano
Count Trees (P-trees). Association rule mining is helpful
in deriving meaningful rules of value and interest from
the microarray data.
Apart from providing the
relationship between the gene expression profiles, they
also provide the direction of the relationship unlike other
techniques like clustering and classification. Association
rule mining can be very useful in deriving gene
regulatory pathways. In this model, the microarray data is
organized into a bit-Sequential (bSQ) format thus
providing it a spatial organization. Peano count trees (Ptree) are used for data mining the microarray data. P-tree
structure is a space efficient, lossless, data mining ready
structure for the association rule mining. They facilitate
new pruning techniques for association rule mining based
on a high-order bit first and a single attribute first
approach. These data structures also provide early
algorithm exit advantages for fast high-confidence, lowsupport association rule identification[13]. It provides a
fast way to calculate support and confidence for
association rule mining. It is possible to derive
meaningful rules from the microarray data that has high
biological significance. Association rule mining through
P-trees can also provides rules hitherto undiscovered that
could provide insight as to the direction in which the
biologists can carry their research.
Acknowledgement
We would like to thanks all the members of the
Datasurg group for all their valuable inputs.
References
[1] Holter, N.S., Mitra, M., Maritan, A. 2000
Fundamental patterns underlying gene expression
profiles: Simplicity from Complexity, Proc.Natl.
Acad. Sci., 97(15): 8409-8414
[2] Alizadeh, A.A., Eisen, M.B., Davis, R.E. 2000
Distinct Types of Diffuse Large B-Cell Lymphoma
Identified by Gene Expression Profiling, Nature,
403:503-511
[3] C.B., Spellman, P.T., Brown, P.O., Botstein, D. 1998
Cluster analysis and display of genome-wide expression
patterns, Proc. Natl. Acad. Sci., 95(25): 14863-14868
[4] Schena, M., Shalon, D., Davis, R.W., Brown, P.O.
1995 Quantitative monitoring of gene expression patterns
with a complementary DNA microarray, Science,
270:467-470
[5] Lashkari, D.A., DeRisi, J.L., McCusker, J.H.,
Namath, A.F., Gentile, C., Hwang, S.Y., Brown, P.O.,
Davis, R.W. 1997 “Yeast microarrays for genome wide
parallel genetic and gene expression analysis”, Proc.
Natl. Acad Sci USA. 94(24):13057-13062
[6] Brazma, A., Vilo, J. 2000 Gene expression data
analysis, FEBS Letters, 480: 17-24
[7] Alon, U., Barakai, N., Notterman, D.A., Gish, K.,
Ybarra, S., Mack, D., Levine, A.J. 1999 Broad patterns of
gene expression revealed by clustering analysis of tumor
and normal colon tissues probed by oligonucleotide
arrays, Proc. Natl Acad Sci USA. 96(12): 6745-6750
[8] Rigoutsos, I., Floratos, A., Parida, L., Gao, Y., and
Platt, D. 2000 The emergence of pattern discovery
techniques in computation biology, Metab Engg. 2(3):
159-77
[9] Luscombe N.M., Greenbaum D. Gerstein M. 2001
What is Bioinformatics? An introduction and overview,
2001 IMIA Yearbook, 83-100
[10] Frawley, W., Piatetsky-Shapiro, G., Matheus, C.
1992 Knowledge Discovery in Databases: An
Overview”, AI Magazine, 13(3): 57-70
[11] Agarwal, R., Imielinshki, T., Swami, A. 1993.
“Mining associtation rules between sets of items in large
databases”, Proc. of ACM-SIGMOD Int'l Conf. on
Management of Data: 207-216
[12] Klemttinen, M., Mannila, H., Ronkainen, P.,
Toivonen, H., Verkamo, A.I. 1994 Finding Interesting
Rules from Large Sets of Discovered Association Rules,
Third Int’l Conf. on Information and Knowledge
Management (CIKM'94): 401-407
[13] Perrizo, W. 2001. Peano Count Tree Technology
Lab Notes, Technical Report NDSU-CSOR-TR-01-1.
[14] Ding, Qin., Perrizo, W., Ding, Qiang., Roy A., On
Mining Satellite and Other Remotely Sensed Images,
Proc. of DMKD-2001:33-40
[15] Samet, H. 1984 The quadtree and related
hierarchical data structure. ACM Computing Survey,
16(2): 187--260
[16] Agarwal, R. Srikant, R., 1994, Fast Algorithms for
Mining Association Rules, Proc. of the 20th VLDB: 487499.
[17] Fiehn, O., Kloska, S. Altmann, T. 2000 Integrated
studies on plant biology using multiparallel techniques,
Current Opinion in Biotechnology, 12: 82-86
Download