Gene Expression Profiling of DNA Microarray Data using Peano Count Trees (P-Trees) P. Kotala1, A. Perera1, J. Kai Zhou1, S. Mudivarthy3, W. Perrizo1, E. Deckard2 1 Department of Computer Science, 2 Department of Plant Sciences, 3 Department of Statistics North Dakota State University. Fargo, ND 58105, USA 701-231-6257 Pratap.Kotala@ndsu.nodak.edu Abstract The explosion of genomic data made possible by advances in parallel, high-throughput technologies in the area of molecular biology, has ushered in a new era in the area of Bioinformatics. During the last many years, efforts concentrated on sequencing the genome of organisms. Current emphasis lies in extracting meaningful information from this huge DNA sequence and expression data. The techniques currently employed to do analysis of microarray expression data is clustering and classification. These techniques present their own limitations as to the amount of useful information that can be derived. In this paper, we propose a new approach to data mining the microarray data using new data mining technology called Peano Count Tree (P-tree)1. This technology employs Association Rule Mining as means to do data mining of the microarray data. Association Rule Mining is an advanced data mining technique that is useful in deriving meaningful rules from a given data. We propose using Association Rule Mining to derive meaningful rules from microarray expression data. Our approach proposes a new microarray data mining technology, which involves a "Data Mining Ready" data structure, called Peano count tree (P-tree), to measure gene expression levels. The method involves treating the microarray data as spatial P-tree technology is patented to North Dakota State University Proc. of the Virt. Conf. in Genom. and Bioinf.© www.ndsu.edu/virtual-genomics North Dakota State University, USA Octuber, 15-16, 2001. data. Each spot on the microarray is presented as a pixel with corresponding red and green ratios. The microarray data is reorganized into an 8-bit bSQ file (where each attribute or band is stored as a separate file). Each bit is then converted in a quadrant base tree structure P-tree from which a data cube is constructed and meaningful rules readily obtained. Keywords: Data Mining, Association Rule Mining, DNA Microarrays, Gene expression, P-Tree. Introduction DNA microarray technology is a powerful means for exploring genomes of organisms. It is an important tool for monitoring and analyzing gene expression profiles of thousands of genes simultaneously. Their small size, high densities and compatibility with fluorescent labeling make microarray technology a widely used technique in the area of molecular genetics. Microarray technology provides an economic, robust, fully automated approach toward examining and measuring temporal and spatial global gene expression changes [4]. Although fundamental differences exist between the two microarray technologies - cDNA microarrays and Oligonucleotide arrays, their strength lies in the massive parallel analysis of thousands of genes simultaneously. The microarray data yields valuable information about gene functions, inter-gene dependencies and underlying biological processes [1,2,3]. Such information may help discover gene regulation pathways, metabolic pathways, relation of genes with their environments etc., The exponential growth in genomic data accumulation possesses the challenge to develop analysis procedures to be able to interpret useful information from this data. And these analytical procedures are broadly called as "Data Mining". Data mining (also known as Knowledge Discovery in Databases - KDD) has been defined as "The nontrivial extraction of implicit, previously unknown, and potentially useful information from data"[10] The goal of data mining is to automate the process of finding interesting patterns and trends from a give data. Several data mining methodologies have been proposed to analyze large amounts of gene expression data. Most of these techniques can be broadly classified as Cluster analysis and Classification techniques. These techniques have been widely used to identify groups of genes sharing similar expression profiles and the results obtained so far have been extremely valuable. [3, 6, 7, 9] However, the metrics adopted in these clustering techniques have discovered only a subset of relationships among the many potential relationship possible between the transcripts. [8]. Clustering can work well when there is already a wealth of knowledge about the pathway in question, but it works less well when this knowledge is sparse [17]. The inherent nature of clustering and classification methodologies makes it less suited for mining previously unknown rules and pathways. We propose using another technique - Association Rule Mining for mining microarray data and it is our understanding that Association-rule mining can mine for rules that will help in discovering new pathways unknown before. Association rules, proposed by Agarwal, Imielinski, and Swami [11], are a class of simple but powerful regularities in binary data. An association rule is an expression of the form X Y, where X and Y are sets of items. The intuitive meaning of such a rule is that in the rows of the database where the attributes in X have value true, also the attribute Y has value true with high probability [12]. The problem is to mine for rules that satisfy user-specified minimum support and minimum confidence. There are hundreds of association rules in a given data set depending on its size and complexity. The process of mining for such rules is called Association Rule Mining. Much of the genomic research requires the mining of microarray data for rules, implications, class structures and cluster-structures. In this paper we discuss the use of a data mining technology which involves a "Data Mining Ready" data structure, called Peano count tree (P-tree), to mine microarray data. The paper is organized into five sections. Section 2 provides some background information about Association Rule Mining and it’s relevance to the microarray data. Peano Count Tree (P-tree) is discussed in Section 3. Section 4 details how to derive association rules for microarray data using P-trees and related pruning techniques. The conclusions and proposed work are discussed in Section 5. Association Rule Mining for Microarray Data Association-rule mining is a widely used technique for large-scale data mining. Originally, proposed for Market Basket data to study consumer-purchasing patterns in retail stores, it has potential applications in many areas. Microarray data is one of the promising application areas. It is a very complex and highly interlinked data, as a spot in microarray not only provides information about it’s intensity of expression but also it’s interaction with other genes. Extracting interesting patterns and rules from microarray dataset, can be of importance in identifying gene regulation pathways where expression of certain genes depends on expression of other genes. It would also be possible to extract rules hitherto unknown, which can have significant biological implications. An association rule is a relationship of the form X Y where X is the antecedent item set and Y is the consequent itemset. An example of the rule can be, "customers who purchase an item X are very likely to purchase another item Y at the same time. "There are primarily two measures of quality for each rule, support and confidence. The rule X Y has support s% in the transaction set T if s% of transactions in T contains X Y. The rule has confidence c% if c% of transactions is T that contain X also contain Y. The goal of association rule mining is to find all the rules with support and confidence exceeding some user specified thresholds. Microarray data format is very similar to the Market Basket data format. The data mining model for Market Research dataset can be treated as a relation R(Tid, i1,........in) where Tid is the transaction identifier and i1..... in denote the feature attributes - all the items available for purchase from the store. Transactions constitute the rows in the data-table whereas itemsets form the columns. The values for the itemsets for different transactions are in binary representation; 1 if the item is purchased and 0 if not purchased. The microarray data is currently represented as a relation R(Gid, T 1,.....Tn) where Gid is the gene identification for each gene and T 1....Tn are the various kinds of treatments to which the genes were exposed. The genes constitute the rows in the data table where as treatments are the columns. The values are in the form of normalized Red/Green color ratios representing the abundance of transcript for each spot on the microarray. This table can be called as a "Gene table". Currently the data mining techniques - clustering and classification is being applied to the Gene table. In clustering and classification techniques, dataset is divided into clusters/classes by grouping on the rows (genes). We propose a new kind of data format called "Treatment table" formed by flipping the gene table. The relation R of a Treatment table can be represented as R(Tid, G1,.....Gn) where Tid represents the treatment ids and G1...Gn are the gene identifiers. Treatment table provides a convenient means to treat genes as a spatial data. The goal here is to mine for rules among the genes by associating the columns (genes) in the Treatment table. Treatment table can be viewed as a 2-dimensional 11 11 11 11 11 11 11 01 11 11 11 11 11 11 11 11 11 10 11 11 11 11 11 11 00 00 00 10 11 11 11 11 P-tree 55 __________/ / \ \__________ / ___ / \___ \ / / \ \ 16 ____8__ _15__ 16 / / | \ / | \ \ 3 0 4 1 4 4 3 4 //|\ //|\ //|\ 1110 0010 1101 PM-tree m ____________/ / \ \___________ / ___ / \___ \ / / \ \ 1 ____m__ _m__ 1 / / | \ / | \ \ m 0 1 m 1 1 m 1 //|\ //|\ //|\ 1110 0010 1101 Figure 1: P-tree and PM tree for 8x8 image array of gene expression values. Treatment table can be organized into a new spatial format called bit Sequential (bSQ) proposed by Qin Ding, Qiang Ding and William Perrizo [14]. Bit Sequential (bSQ) is a new data format for representing the spatial data. The Red/Green ratios for each gene, which represents the gene expression level, can be represented as a byte. The bSQ format breaks each of this Red/Green values into separate files by vertically partitioning the eight bits of each byte used to store the gene expression value. There are several reasons to use the bSQ format. First, different bits have different degrees of contribution to the value. In some applications, the high-order bits alone provide the necessary information. Second, the bSQ format facilitates the representation of a precision hierarchy (from one bit precision up to eight bit precision). Third, bSQ format facilitates better compression through creation of an efficient, rich data structure called Peano Count Tree and accommodates algorithm pruning based on one-bit-at-atime approach. Peano Count Trees Microarray Data (P-tree) for Each bSQ bit file Bij can be organized into a tree structure, called Peano Count Tree (P-tree). P-trees are basically quadrant-wise, Peano-order-run-lengthcompressed, representations of each bSQ file [14]. A Ptree is a quadrant based "Data Mining Ready" data structure for data mining. P-trees can be generated by recursively dividing the entire data into quadrants and recording the count of 1-bits for each quadrant, thus, forming a quadrant count tree. P-trees are somewhat similar to Quadtrees and its variants[15].The root of the P-tree contains the 1 bit count of a entire column of bits representing the microarray spot data for different treatments. The next level of the tree contains the 1-bit counts of the four quadrants in raster order. At the next level, each quadrant is partitioned into sub-quadrants and their 1-bit counts in raster order constitute the children of the quadrant node. This construction is continued recursively down each branch of the tree until the subquadrant is pure (entirely 1-bits or entirely 0-bits), which may or may not be at the leaf level (1-by-1 subquadrant). It is a lossless and compressed data structure representation of a 1-bit file from which the original bit sequence can be completely reconstructed. This also contains the 1-bit count for each and every quadrant in the original microarray data. A variation of P-tree, called Peano Mask Tree (PM-tree), can be used for efficient implementation of P-tree operations. In the PM-tree structure, we use a 3-value logic, in which 11 represents a pure-1 quadrant, 00 represents a pure-0 quadrant, and 01 represents a mixed quadrant. To simplify, we use 1, 0 and m instead of 11, 00, and 01, respectively. For example, given an 8-row-8-column image, the P-tree and PM-tree are as shown in Figure 1. The above figure 1 can be considered as a set of 8row-8-column microarray data, representing the expression levels for 64 different treatments for a single gene. In this example, 55 is the number of 1's in the entire microarray data set. This root level is labeled as level 0. The numbers at the next level (level 1), 16, 8, 15 and 16, are the 1-bit counts for the four major quadrants. Since the first and last quadrants are composed entirely of 1bits (called a "pure 1 quadrant"), We do not need subtrees for these two quadrants, so these branches terminate. Similarly, quadrants composed entirely of 0bits are called "pure 0 quadrants" which also terminate the tree branches. This pattern is continued recursively using the Peano or Z-ordering of the four subquadrants at each new level. Every branch terminates eventually (at the "leaf" level, each quadrant is a pure quadrant). If we were to expand all sub-trees, including those for pure quadrants, then the leaf sequence is just the Peanoordering of the original raster data. Thus, we use the name Peano Count Tree. This structure provides compression and embedded information that is useful for datamining. Procedure P-ARM { Data Discretization; F1 = {frequent 1-Asets}; For (k=2; F k-1 ) do begin Ck = p-gen(F k-1); Forall candidate Asets c Ck do c.count = rootcount(c); Fk = {cCk | c.count >= minsup} end Answer = k Fk } insert into Ck select p.item1, p.item2, …, p.itemk-1, q.itemk-1 from Fk-1 p, Fk-1 q where p.item1 = q.item1, …, p.itemk-2 = q.itemk-2, p.itemk-1 < q.itemk-1, p.itemk-1.group <> q.itemk-1.group Figure 2: P-ARM algorithm Figure 3: Join step in p-gen function This mechanism creates a set of basic P-trees, which can be combined using simple logical operators (AND, NOT, OR, COMPLEMENT) to reconstruct the original data or produce P-trees at any level of precision for any value or combination of values. For example we can construct a P-tree (called a Value P-tree) for all occurrences of the value 11010011 by ANDing basic Ptrees (for each 1-bit) and their complements (for each 0 bit): PCb,11010011 = PCb1 AND PCb2 AND PCb3' AND PCb4 AND PCb5' AND PCb6' AND PCb7 AND PCb8 where ' indicates the bit-complement. The major potential in this representation is that by using simple AND operations we can construct all combinations and permutations of the original data and that the resulting representation has the hierarchical count information embedded to facilitate data mining. Details are discussed by Perrizo et al. [13]. Deriving Association-rules from P-trees Microarray data are excellent examples of datasets to which the use of the Peano Count tree(P-tree) show tremendous promise for deriving confident rules for data mining. The Red/Green reflectance values from each spot on the microarray are converted into an 8-bit bSQ file format. Each bit file then is converted in a quadrant base P-tree from which a data cube is constructed. P-trees are considered "Data Mining Ready" data structure, as all the needed counts are pre-computed. P-tree can be successfully used to derive rules of interest from the microarray data. These rules can provide valuable information to the biologist as to the gene regulatory pathways and identify important relationships between the different gene expression patterns hitherto unknown. The biologist may be interested in some specific kinds of rules. These rules can be called as "rules of interest". In gene regulatory pathways, a biologist may be interested in identifying genes that govern the expression of other sets of genes. These relationships can be represented as following, {G1,..............Gn} Gm where G1 ...….Gn represents the antecedent and Gm represents the consequent of the rule. The intuitive meaning of this rule is that for a given confidence level the expression of G1 .....Gn genes will result in the expression of Gm gene. There are two algorithms for mining associations rules on microarray data using P-trees namely P-ARM and p-gen algorithm[14]. These two algorithms are described in Figure 2 & 3. The P-ARM algorithm assumes a fixed precision, for example, 3-bit precision in all bands. In Apriori algorithm, there is a function called "apriori-gen"[11] to generate candidate k-itemsets from frequent (-1) itemsets. The p-gen function in P-ARM algorithm differs from the apriori-gen function in the way pruning is done. The rootcount function is used to Aset calculate counts directly by ANDing the appropriate basic P-trees. PARM algorithm also has the capability to group the adjacent intervals depending on user's requirements. Qin Ding, Qiang Ding and William Perrizo[14] compared PARM against other data mining algorithms like Apriori and FP-growth algorithms and found that P-ARM was more scalable for large datasets and for lower support thresholds[14]. Conclusions In this paper, we propose a new method of data mining the microarray data using association rules through a new data mining technology called Peano Count Trees (P-trees). Association rule mining is helpful in deriving meaningful rules of value and interest from the microarray data. Apart from providing the relationship between the gene expression profiles, they also provide the direction of the relationship unlike other techniques like clustering and classification. Association rule mining can be very useful in deriving gene regulatory pathways. In this model, the microarray data is organized into a bit-Sequential (bSQ) format thus providing it a spatial organization. Peano count trees (Ptree) are used for data mining the microarray data. P-tree structure is a space efficient, lossless, data mining ready structure for the association rule mining. They facilitate new pruning techniques for association rule mining based on a high-order bit first and a single attribute first approach. These data structures also provide early algorithm exit advantages for fast high-confidence, lowsupport association rule identification[13]. It provides a fast way to calculate support and confidence for association rule mining. It is possible to derive meaningful rules from the microarray data that has high biological significance. Association rule mining through P-trees can also provides rules hitherto undiscovered that could provide insight as to the direction in which the biologists can carry their research. Acknowledgement We would like to thanks all the members of the Datasurg group for all their valuable inputs. References [1] Holter, N.S., Mitra, M., Maritan, A. 2000 Fundamental patterns underlying gene expression profiles: Simplicity from Complexity, Proc.Natl. Acad. Sci., 97(15): 8409-8414 [2] Alizadeh, A.A., Eisen, M.B., Davis, R.E. 2000 Distinct Types of Diffuse Large B-Cell Lymphoma Identified by Gene Expression Profiling, Nature, 403:503-511 [3] C.B., Spellman, P.T., Brown, P.O., Botstein, D. 1998 Cluster analysis and display of genome-wide expression patterns, Proc. Natl. Acad. Sci., 95(25): 14863-14868 [4] Schena, M., Shalon, D., Davis, R.W., Brown, P.O. 1995 Quantitative monitoring of gene expression patterns with a complementary DNA microarray, Science, 270:467-470 [5] Lashkari, D.A., DeRisi, J.L., McCusker, J.H., Namath, A.F., Gentile, C., Hwang, S.Y., Brown, P.O., Davis, R.W. 1997 “Yeast microarrays for genome wide parallel genetic and gene expression analysis”, Proc. Natl. Acad Sci USA. 94(24):13057-13062 [6] Brazma, A., Vilo, J. 2000 Gene expression data analysis, FEBS Letters, 480: 17-24 [7] Alon, U., Barakai, N., Notterman, D.A., Gish, K., Ybarra, S., Mack, D., Levine, A.J. 1999 Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays, Proc. Natl Acad Sci USA. 96(12): 6745-6750 [8] Rigoutsos, I., Floratos, A., Parida, L., Gao, Y., and Platt, D. 2000 The emergence of pattern discovery techniques in computation biology, Metab Engg. 2(3): 159-77 [9] Luscombe N.M., Greenbaum D. Gerstein M. 2001 What is Bioinformatics? An introduction and overview, 2001 IMIA Yearbook, 83-100 [10] Frawley, W., Piatetsky-Shapiro, G., Matheus, C. 1992 Knowledge Discovery in Databases: An Overview”, AI Magazine, 13(3): 57-70 [11] Agarwal, R., Imielinshki, T., Swami, A. 1993. “Mining associtation rules between sets of items in large databases”, Proc. of ACM-SIGMOD Int'l Conf. on Management of Data: 207-216 [12] Klemttinen, M., Mannila, H., Ronkainen, P., Toivonen, H., Verkamo, A.I. 1994 Finding Interesting Rules from Large Sets of Discovered Association Rules, Third Int’l Conf. on Information and Knowledge Management (CIKM'94): 401-407 [13] Perrizo, W. 2001. Peano Count Tree Technology Lab Notes, Technical Report NDSU-CSOR-TR-01-1. [14] Ding, Qin., Perrizo, W., Ding, Qiang., Roy A., On Mining Satellite and Other Remotely Sensed Images, Proc. of DMKD-2001:33-40 [15] Samet, H. 1984 The quadtree and related hierarchical data structure. ACM Computing Survey, 16(2): 187--260 [16] Agarwal, R. Srikant, R., 1994, Fast Algorithms for Mining Association Rules, Proc. of the 20th VLDB: 487499. [17] Fiehn, O., Kloska, S. Altmann, T. 2000 Integrated studies on plant biology using multiparallel techniques, Current Opinion in Biotechnology, 12: 82-86