A DATA MINING EXPERIMENT FOR INFERRING TRANSCRIPTIONAL MODULE FROM BREAST CANCER PROFILE DATA Rakhi Malpani BS, Pune University, India 2004 PROJECT Submitted in partial satisfaction of the requirements for the degree of MASTER OF SCIENCE in COMPUTER SCIENCE at CALIFORNIA STATE UNIVERSITY, SACRAMENTO FALL 2010 © 2010 Rakhi Malpani ALL RIGHTS RESERVED ii A DATA MINING EXPERIMENT FOR INFERRING TRANSCRIPTIONAL MODULE FROM A BREAST CANCER PROFILE DATA A Project by Rakhi Malpani Approved by: __________________________________, Committee Chair Meiliu Lu, PhD __________________________________, Second Reader Du Zhang, PhD ____________________________ Date iii ` Student: Rakhi Malpani I certify that this student has met the requirements for format contained in the University format manual, and that this project is suitable for shelving in the Library and credit to be awarded for the project. __________________________, Graduate Coordinator ___________________ Nikrouz Faroughi, PhD Date Department of Computer Science iv Abstract of A DATA MINING EXPERIMENT FOR INFERRING TRANSCRIPTIONAL MODULE FROM BREAST CANCER PROFILE DATA by Rakhi Malpani Breast cancer is a cancer that starts in the tissues of the breast. Over the course of a lifetime, 1 in 8 women will be diagnosed with breast cancer. There are existing therapies for treatment of cancer like chemotherapy, radiation therapy, and targeted therapy. Targeted therapy aims at specific cancer cell growth, division and lifecycle. Existing targeted therapies for breast cancer include Avastin, Herceptin, Iressa and Tykerb. Each of these drugs has specific effects on cancer cells. There is a need for new drug. The motivation of this project is to make a contribution such that the results obtained from the project are useful for new drug design. Transcriptional modules (TM) consist of groups of co-regulated genes and transcription factors (TF) regulating their expression. Transcription factors are one of the groups of proteins that read and interpret the genetic "blueprint" in the DNA. They bind DNA and help initiate a program of increased or decreased gene transcription. As such, they are vital for many important cellular processes. Currently breast cancer profile data is large and may be noisy and incomplete. It is not in the format suitable as an input to any data mining algorithm. As a result, we propose to develop a method to transform original data to a feasible format, which can be used as an input to the data-mining algorithm. This project first focus on v developing a method of data preprocessing to make the data feasible to be used by a data mining algorithm. The second step in the project is to derive association rules between set of transcriptional factors and set of genes using an association rule mining method. The project procedure is as follows: (1) Conduct analysis of the breast cancer profile data such as analyzing the most important attributes of the profile data and removing the redundant attributes from the profile data. (2) Perform a survey of data mining algorithms and identify a suitable algorithm for association rule mining. (3) Develop a method of data preprocessing so that we can rank and sort the data and make it feasible as an input to the chosen association rule mining algorithm. Data preprocessing techniques such as data cleaning, data integration, data transformation and reduction were used. (4) Perform an experiment and actually apply the association rule mining algorithm on this transformed data and perform analysis of the results. Association rule mining is performed using a two stage approach: Stage 1: Generating association rules using association rule mining method in WEKA Stage 2: Generating association rules using data filtered on the basis of p-value and support. _____________________, Committee Chair Meiliu Lu, Phd ______________________ Date vi ACKNOWLEDGMENTS I would like to thank my Prof. Meiliu Lu, my project advisor for all knowledge and guidance she has provided me through this project. I would also like to thank Prof. Du Zhang for giving me his time and input as second reader. I would also like to thank my husband, Ashwin, who has given me the time and support to finish this project. I couldn’t have done it without him. I would also like to thank my daughter Nishka for giving me the time to finish this project. vii TABLE OF CONTENTS Page Acknowledgments.……………………………………………………………………………….vii List of Tables..…………………………………………………………………………………….ix List of Figures.…….………………………………………………………………………………x Chapter 1. INTRODUCTION …………………..………..………………………………….…………..1 2. BACKGROUND ……….……………………………………………………………………4 2.1 Deoxyribonucleic Acid (DNA)………………………………………………………4 2.2 Gene …………………………………………………………………………………5 2.3 Transcription Factors ...………………………………………………………………5 2.4 Enhancer ...…...……………………………………………………………………..6 2.5 Gene Expression……………………………………………………………………..6 3. PROBLEM ANALYSIS…......................................................................................................8 3.1 Analysis of Input Data files…………………………………………………………...8 3.2 Experimental Design Approach……………………….……………………………..13 4. DATA PREPROCESSING……………...…………………………………………………...16 4.1 Data Cleaning………………………………………………………………………...18 4.1.1 Cleaning for Data file I – TFvalues_Enhancer.xls.……………………….18 4.1.2 Cleaning for Data file II – Enhancer_expression_profile.xls.…………….18 4.2 Data Reduction……………………………………………………………………..19 4.3 Data Transformation….……………………………………………………………21 viii 4.3.1 Classification…………………………………………………………..21 4.3.2 P-value Threshold………………………………………………………22 4.3.3 Splitting the File………………………………………………………..24 4.4 Data Integration…………………………………………………………………..24 5. ASSOCIATION RULE MINING………………………………………………………...29 5.1 Weka……………………………………………………………………………...30 5.2 Association Rule Mining Method………………………………………………….30 5.3 Using Weka………………………………………………………………………31 5.3.1 Loading the Data……………………………………………………….31 5.3.2 Association Rule Mining Using Weka………………………………….35 5.4 Result Interpretation………………………………………………………………37 5.5 P-value based Filtering Method……………………………………………………38 6. EXPERIMENTS…………………………………………………………………………..44 6.1 Experiment 1……………………………………………………………………...44 6.2 Experiment 2………………………………………………………………………49 6.3 Experiment 3………………………………………………………………………52 6.4 Experiment 4………………………………………………………………………55 6.5 Result Analysis…………………………………………………………………….57 7. CONCLUSION…….……………………………………………………………………...58 7.1 Summary…………………………………………………………………………..58 7.2 Learning Experience……………………………………………………………….58 7.3 Future Work…………. ……………………………………………………………59 Bibliography…..…….…………………………….………….………………………………..60 ix LIST OF TABLES Page 1. Data file I – ‘TFvalues_Enhancer.xls’ description…………………………………………….9 2. Data file II – ‘Enhancer_expression_profile.xls’ description ………………………………....11 3. Classification Table…………………………………………………………………………..22 4. P-value Table…………………………………………………………………………………23 5. File Table……………………………………………………………………………………..24 6. TF Name and Number Mapping Table…………………………………..................................25 7. Association Rule Table 1……………………………………………………………………..46 8. Association Rule Table 2……………………………………………………………………..47 x LIST OF FIGURES Page 1. DNA, Gene, Transcription Factors..……….………………………………………….7 2. ‘TFvalues_Enhancer.xls’ format...…………..……………………………………......10 3. ‘Enhancer_expression_profile.xls’ format……….…………………………………...12 4. Candidate Enhancer Information of a Single Gene...……………………………........12 5. TF to Gene Relation…………………………………….……………………..………13 6. Experimental Design Procedure…………………..……………………………..…….14 7. Data Preprocessing Flowchart………………………………………………………...17 8. Missing Value Handling…………………………….…………………………………19 9. Data Reduction………………………………………….……………………………..20 10. Data Classification………………………………………….…………………………22 11. P-value Classification…….……………………………………………………………23 12. Data Integration……………………………………………………………………….25 13. Association Rule Mining……………………………………………………………….29 14. Loading Data………..…………………………………………………………………32 15. GREB1.csv File Format………………………………………………………………...33 16. Statistics Window……….……………………………………………………………...34 17. Parameter Editor Window…..…………………………………………………………..36 18. Associator Output Window..……………………………………………………………37 19. Filtering using P-value Threshold for a Set of Genes which are Up Expressed……….39 20. The Checked Boxes correspond to TFs which are Significant…………………………40 21. Two-Stage Association Rule Mining Approach………………………………………...42 xi 22. Result Analysis……..…………………………………………………………………57 xii 1 Chapter 1 INTRODUCTION Over the last few years, computational biology research has contributed significantly to the advancement of molecular biology. High throughput genome sequencing has provided us with the complete genomes of several multicellular species from microbes to human beings. This rapid growth of biological data opens a whole range of exciting possibilities for and necessitates development of data mining methods tailored towards understanding the complex mechanisms of biological systems. Bioinformatics has gone from providing support, in terms of data management, visualization, and such, to generating new insights and directing future experiments. One key topic in molecular biology is the understanding the regulatory process and mechanism of gene expression. With the availability of breast cancer genetic profile data we look to analyze this in our project. Breast cancer is cancer originating from breast tissue. Prognosis and survival rate varies greatly depending on cancer type and stage. The lifetime risk for breast cancer in the United States is usually given as 1 in 8 (12.5%). Can we do anything about it? There is treatment based on many factors like type and stage of cancer, sensitivity to hormones, over expression of certain genes. Common treatments are radiotherapy, chemotherapy, hormonal therapy, surgery, targeted therapy. The newer therapy being the targeted therapy, it uses special anti-cancer drugs that identify certain changes in a cell that can lead to cancer. It aims at specific processes of cancer cell growth, division and lifecycle. Existing targeted therapies for breast cancer include avastin, herceptin, iressa, and tykerb[1]. Amongst the recent advancements in medical science is the study of the gene expression profiles using DNA microarrays. Data mining microarray based gene profile data is promising in predicting treatment outcome or drug response. Fortunately, advances 2 in molecular genetics technologies allow obtaining a global view of the cell. For example, measuring the simultaneous expression of tens of thousands of genes. Using microarray technology, the expression profile data of genes is extracted[2]. Usually this profile data is huge, noisy and incomplete. This data is not in the format suitable as an input to any data mining algorithm. Hence there is a need to develop a method to transform original data to a feasible format. In this project, we propose to develop a method to transform original data to a feasible format, which can be used as an input to a data mining algorithm that can recognize association rules. The objective of the project is two fold. First, to develop a method to transform original data to a feasible format, which can be used as an input to the data mining algorithm. Second, look for the association rules between a group of transcriptional factors (module) and target gene behavior from a single time-series breast cancer profile data. The following are examples of our desired results which are expressed in association rules: (TF1 TF2 TF3 ) a target gene set The detailed description of DNA, Genes, Transcription factors, Enhancers, Gene expression is given in Chapter 2. Chapter 3 gives the detailed description of data files and format used in our analysis. It also looks at the experimental design approach. Next in chapter 4, we discuss the data preprocessing steps performed before actual data mining is done. Chapter 5 describes the two stage association rule mining approach. Chapter 6 describes all experiments, their aim, methodology and results. These experiments are carried out to derive association rules 3 between set of transcription factors and set of genes. Chapter 7 concludes the results of the experiments, learning experience and future implementation plan. 4 Chapter 2 BACKGROUND The molecular basis for genes is deoxyribonucleic acid (DNA). DNA is composed of a chain of nucleotides, of which there are four types: adenine (A), cytosine (C), guanine (G), and thymine (T). Genetic information exists in the sequence of these nucleotides. In this chapter, we will introduce important definitions such as DNA, genes, transcription factors, enhancers, gene expression and their significance. 2.1 Deoxyribonucleic Acid (DNA) DNA is a nucleic acid that contains the genetic instructions used in the development and functioning of all known living organisms and some viruses. The main role of DNA molecules is the long-term storage of information. DNA is often compared to a set of blueprints or a recipe, or a code, since it contains the instructions needed to construct other components of cells, such as proteins and RNA molecules. All organisms consist of cells. Cells have a nucleus, and inside nucleus, there is DNA, which encodes the “program” for making future organisms. DNA has coding and non-coding segments, and coding segments, called “genes”, specify the structure of proteins. For instance, hemoglobin is a protein, and gene will specify the structure of it. The genes carry this genetic information, but other DNA sequences have structural purposes, or are involved in regulating the use of this genetic information[2]. 5 2.2 Gene Gene is a unit of heredity in a living organism. It normally resides on a stretch of DNA that codes for a type of protein or for an RNA chain that has a function in the organism. All living things depend on genes, as they specify all proteins and functional RNA chains. Genes hold the information to build and maintain an organism's cells and pass genetic traits to offspring[3]. 2.3 Transcription Factor A transcription factor (sometimes called a sequence-specific DNA binding factor) is a protein that binds to specific DNA sequences and thereby controls the transfer (or transcription) of genetic information from DNA to mRNA. Transcription factors perform this function alone or with other proteins in a complex, by promoting (as an activator), or blocking (as a repressor) the recruitment of RNA polymerase (the enzyme that performs the transcription of genetic information from DNA to RNA) to specific genes. Transcription factors are essential for the regulation of gene expression and are consequently, found in all living organisms. The number of transcription factors found within an organism increases with genome size, and larger genomes tend to have more transcription factors per gene. There are approximately 2600 proteins in the human genome that contain DNA-binding domains, and most of these are presumed to function as transcription factors. Therefore, approximately 10% of genes in the genome code for transcription factors, which makes this family the single largest family of human proteins. Furthermore, genes are often flanked by several binding sites for distinct transcription factors, and efficient expression of each of these genes requires the cooperative action of several different 6 transcription factors (see, for example, hepatocyte nuclear factors). Hence, the combinatorial use of a subset of the approximately 2000 human transcription factors easily accounts for the unique regulation of each gene in the human genome during development[4]. 2.4 Enhancer Enhancer is a short region of DNA that can bind proteins called activators. Binding of activators to this enhancer region can initiate the transcription of a gene that may be some distance away from the enhancer, or can even be on a different chromosome. The increase in transcription is due to the activators recruiting transcription factors, which enhances the binding of RNA polymerase[5]. 2.5 Gene Expression Gene expression is the process by which information from a gene is used in the synthesis of a functional gene product. These products are often proteins, but in non-protein coding genes such as rRNA genes or tRNA genes, the product is a functional RNA. The process of gene expression is used by all known life - eukaryotes (including multicellular organisms), prokaryotes (bacteria and archaea) and viruses - to generate the macromolecular machinery for life. Several steps in the gene expression process may be modulated, including the transcription, RNA splicing, translation, and post-translational modification of a protein[6]. Gene regulation gives the cell control over structure and function, and is the basis for cellular differentiation, morphogenesis and the versatility and adaptability of any organism. Gene regulation may also serve as a substrate for evolutionary change, since control of the timing, location, and amount of 7 gene expression can have a profound effect on the functions (actions) of the gene in a cell or in a multicellular organism. Measuring gene expression is an important part of many life sciences the ability to quantify the level at which a particular gene is expressed within a cell, tissue or organism can give a huge amount of information. For example measuring gene expression can: Identify viral infection of a cell (viral protein expression) Determine an individual's susceptibility to cancer (oncogene expression) Find if a bacterium is resistant to penicillin (beta-lactamase expression) Figure 1: DNA, Gene, Transcription Factors In chapter 3, we will see the detailed description of data files and formats used in our analysis. We will also look at the experimental design approach. 8 Chapter 3 PROBLEM ANALYSIS It is desirable to go over the input files and formats in our problem analysis before preprocessing the data. Further, we will look at the experimental design approach wherein we use these data files as input. 3.1 Analysis of Input Data files There are two input files. First file is the ‘TFvalues_Enhancer.xls’ file which describes candidate enhancers and the transcription factor attribute values associated with these enhancers. This file can be termed as superset of the second file. Second file is the ‘Enhancer_expression_profile.xls’ which describes the gene expression profile data of a breast cancer patient. a) Data file I – ‘TFvalues_Enhancer.xls’ This file describes the details of each candidate enhancer for 80 selected transcription factors. It describes the location of the TF binding site relative to the peak location of candidate enhancer. The file size is 60.5 MB. The number of rows are 41112 (~ 41K rows) and the number of columns are 241. The total number of candidate enhancers is 41112. Each row is a candidate enhancer. There are roughly 2500 transcription factors, out of which 80 most important transcription factors are considered in our analysis and are associated with every candidate enhancer. Transcription factors affect the expression of enhancer. Enhancers affect the expression 9 of genes. Each transcription factor has three attributes associated with it: Position, P-value, PWM score. Position: It is the binding site relative to the peak location. P-value: In statistics, a result is called statistically significant if it is unlikely to have occurred by chance. The amount of evidence required to accept that an event is unlikely to have arisen by chance is known as the significance level or critical p-value. For instance, “there is only one chance in a thousand that a TF is significant by coincidence," a 0.001 level of statistical significance or p-value is being implied. The most important attribute in our analysis is P-value. Table 1: Data file I – ‘TFvalues_Enhancer.xls’ description 10 Figure 2: TFvalues_Enhancer.xls format b) Data file II – ‘Enhancer_expression_profile.xls’ This file describes the enhancer expression time series profile for a particular breast cancer patient. It contains a subset of rows of data file I. The file size is 6.67 MB. The number of rows is 21997 (~ 22K). The number of columns is 19. The most important attributes are Enhancer candidate id, gene name (symbol), expression levels of genes over a period of 48 hours (MCF7_3hr, MCF7_6hr, MCF7_9hr, MCF7_12hr, MCF7_24hr, MCF7_48hr). Gene are expressed by enhancers. For instance, hemoglobin is a protein transcribed by a gene. The gene that generates hemoglobin has to be expressed first so that hemoglobin is generated. 11 Table 2: Data file II – ‘Enhancer_expression_profile.xls’ description Column A B C D E F Attribute Name of the enhancer Chromosome peak location (enhancer location) intensity don’t use Location of enhancer (Note: C is the actual location. F is the annotation. It indicates if the position is intron, 100k upstream, exon, desert (no gene), 100k downstream.) G, H, I gene name J Strand (Note + means the gene is found in +ve strand while – means the gene is found in –ve strand.) K transcription start site (TSS) L transcription end site M distance between TSS and the peak location N, O, P, Q, S the microarray expression Level Note: Columns A – F are for enhancer candidate, Columns G – S is for breast gene 12 Figure 3: Enhancer_expression_profile.xls format Figure-4 below shows the candidate enhancer information associated with a single gene “GREB1”. As seen in the figure, one gene may have multiple candidate enhancers. However, the expression levels associated with a gene are constant. Figure 4: Candidate Enhancer Information of a Single Gene 13 As seen in figure 2 and figure 3, candidate enhancer attribute is present in both the above data files. Data file I: TFvalues_Enhancer.xls describes details of each candidate enhancer for 80 selected transcription factors. Data file II: Enhancer_expression_profile.xls describes the enhancer expression time series profile data. As the transcription factors affect the expression of enhancer and enhancers affect the expression of genes, therefore the transcription factors affect the expression of genes. Thus, to find association between a set of transcription factor and set of genes, we combine the two files with candidate enhancer as the pivot. Following figure-5 shows that the two files can be integrated with candidate enhancer as a pivot. Figure 5: TF to Gene Relation 3.2 Experimental Design Approach In this section, we give an overview of the steps followed in our approach to achieve our objective: to preprocess the data in order to make it feasible as an input to a data mining algorithm and to generate association rules needed. 14 Figure 6: Experiment Design Procedure As shown in figure-6, the step by step process is described below: (1) First step in the experiment design procedure is to study and analyze the given data files that are described in section 3.1. (2) Next step is to preprocess this data as discussed in chapter 4. The steps in data preprocessing include: (2.1) data reduction – removing redundant attributes (2.2) data integration – integrate the two files to get the unified view (2.3) data transformation– transform the data into a suitable format. (3) Next step is association rule mining using a two stage approach as discussed in chapter 5. We also perform data mining by using p-value and support based filtered data as explained in section 6.2, 6.3 and 6.4. 15 (4) Once the association rules are generated, we compare the results with experimental evidences and analyze them as discussed in section 6.5. In chapter 4, we go over the detailed description of data preprocessing steps performed before actual data mining. 16 Chapter 4 DATA PREPROCESSING Today’s real world databases are highly susceptible to noisy, missing, and inconsistent data due to their typically huge size and their likely origin from multiple heterogeneous sources. Low-quality data will lead to low quality mining results. Analyzing data that has not been carefully screened can produce misleading results. Thus, the representation and quality of data is first and foremost before running an analysis. Data processing techniques, when applied before mining, can substantially improve the overall quality of the patterns mined and the time required for the actual mining [7]. As discussed earlier, we have two data files: TFvalues_Enhancer.xls and Enhancer_expression_profile.xls. The data files have missing attribute values at some places. There are a few redundant attributes also. The data needs to be integrated from the two data files to get the unified view of these data and to perform further analysis. Hence initially there is a need to preprocess this data to make it feasible for input to data mining algorithm. There are a number of data preprocessing techniques. Data cleaning can be applied to remove noise and correct inconsistencies in the data. Data integration merges data from multiple sources into a coherent data store. Data transformation such as normalization may be applied [7]. We have performed major tasks of Data cleaning, Data integration, Data transformation, Data reduction and Data discretization in our preprocessing. 17 Following figure 7 shows the data preprocessing steps in the form of a flowchart. Figure 7: Data Preprocessing Flowchart 18 4.1 Data Cleaning Data cleaning involves processing missing values, identifying and removing outliers. The details of data cleaning for the two data files is described below. 4.1.1 Cleaning of Data file I – TFvalues_Enhancer.xls The 1st column of TFvalues_Enhancer.xls file has comprehensive list of candidate enhancers. Each candidate enhancer has the details of 80 transcription factors (TF) associated with it. Out of the three attributes of every TF, p-value attribute is the most important. The remaining two attributes i.e. position (less important) and pwm score (redundant) are not used in our analysis. The missing p-values are filled by average of remaining values in that column. 4.1.2 Cleaning of Data file II – Enhancer_expression_profile.xls Enhancer_expression_profile.xls file has expression time series profile data of a breast cancer patient. The 1st column is the candidate enhancer and the subsequent columns show the expression level at 3, 6, 9, 12, 24 and 48 hours. 19 Figure 8: Missing Value Handling The missing expression values in these time stamps are filled by average of the remaining values of expression levels as seen in figure 8. The remaining attributes like chromosome, peak_location, mcf7e2, mcf7et, location, geneid, symbol, refseq, strand, TSS, TES, TSSdistance are not included in this analysis. 4.2 Data Reduction A few attributes in data files seem to be irrelevant to the mining task. Hence, in this step, we use attribute subset selection technique to reduce the data set size by removing the redundant attributes from the file. Detailed reduction is described as follows: 20 Data file I – TFvalues_Enhancer.xls: For the data file I - TFvalues_Enhancer.xls file, each transcription factor has 3 attributes: p-value, pwm score and position value. Out of these three attributes we remove pwm score and position value attributes as they are not required for mining task. Data file II – Enhancer_expression_profile.xls: The attributes from Enhancer_expression_profile.xls file, such as chromosome, peak_location, mcf7e2, mcf7et, location, geneid, refseq, strand, TSS, TES, TSSDistance as shown in figure 9, are not required for the mining task and they were removed. Figure 9: Data Reduction 21 4.3 Data Transformation In this section, we discuss method of classifying the genes, filtering method based on pvalue threshold and the method of splitting the data file-I: TFvalues_Enhancer.xls file on the basis of our classification. 4.3.1 Classification The classification of genes into different classes is done on the basis of difference between expression/intensity level values at 48hr and that of at 3hr. For example: Difference = MCF7_48hr – MCF7_3hr If Difference > 0 then target gene is up expressed denoted as “ ” If Difference < 0 then target gene is down expressed denoted as “ ” This can be further classified according to different degree of changes as shown in the following table. For instance, as shown in the figure 10, If difference >= 10 then for this class of genes, the intensity is increasing with label F. Therefore the target gene is up-expressed. The association rule for that target gene will be defined as: TF1 & TF2 Target gene set F Here, ‘ F’ denotes that the target gene is up and increasing. In Table 3, we defined a list of labels for different degree of changes. 22 Figure 10: Data Classification Table 3: Classification Table Difference Difference < = -1 Difference < - 0.5 Difference = 0 Difference >= 1 Difference >=5 Difference >=10 Target gene expression DOWN DOWN NO CHANGE UP UP UP Class Label Decreasing Decreasing No change Increasing Increasing Increasing A B C D E F 4.3.2 P-value Threshold There are roughly 2500 Transcription factors (TFs) and only 80 significant TFs are considered in our analysis. With 80 TFs we still can have mathematically 2^80 combinations that can trigger any gene in any fashion. But we need to cut this down because 2^80 is ~1.2*10^24 23 combinations which is astronomically high. In statistics, a result is called statistically significant if it is unlikely to have occurred by chance. The amount of evidence required to accept that an event is unlikely to have arisen by chance is known as the significance level or critical p-value. For instance, “there's only one chance in a thousand that a TF is significant by coincidence," a 0.001 level of statistical significance or p-value is being implied. We have chosen p-value as 0.002 (this means that there is 99.8% chance that a TF is significant) for experiment-2 and 0.005 for experiment-3 (99.5% chance that a TF is significant). Therefore, we have reduced the search space by selecting only those TFs whose p-value < 0.002 or 0.005 [8]. Therefore a transcription factor whose p-value < 0.002 or 0.005 is assumed significant for experiment-2 and experiment-3 respectively. Hence P-value is transformed to label “T” and “F” as shown in the following table: Table 4: P-value Table P-value P-value < 0.002/0.005 P-value > = 0.002/0.005 Figure 11: P-value Classification Label T (stands for true) F (stands for false) 24 4.3.3 Splitting the File Based on the classification done in above step, the data file I is split into six chunks (subfiles) as shown in the table 5. Table 5: File Table No. File Name Intensity range Size(KB) 1 greater_than_or_equal_to_10.xls {10,27} 29 2 greater_than_or_equal_to_5.xls {5,27} 42 3 greater_than_or_equal_to_1.xls {1, 27} 265 4 equal_to_0.xls {0} 1839 5 less_than_-0.5.xls {-5,-0.5} 434 6 less_than_-1.xls {-5,-1} 37 The following is description of the column labels we used in the table: Intensity range -- the interval in which the value “MCF7(48hr) – MCF7(3hr)” lies. For instance, if the value of “MCF7(48hr) – MCF7(3hr)” is 11 then it can be found in interval {10, 27}. Therefore it can be found in the file “greater_than_or_equal_to_10.xls”. Size (the size of the sub-file). 4.4 Data Integration To identify the association rules, both the files: TFvalues_Enhancer.xls and Enhancer_expression_profile.xls play important role and we need the data from both the files 25 together. Hence in this step, the two original files are merged into a third file. Figure 12: Data Integration The new file is huge ~ 65 MB. There are total 82 columns. First column is “New tag – name of the enhancer”, second column is “symbol – gene name” from file 2 and the remaining 80 columns are p-values of 80 TFs from file I which is TFvalues_enhancer.xls. For example: Column with label “F$HAC1_Q2” is name of transcription factor and that column contains p-value corresponding to a particular candidate enhancer. For easy understanding, we have labeled those 80 transcription factors from TF1 thru TF80. Below is the table that shows one to one mapping of TF’s name and its corresponding number. Table 6: TF Name and Number Mapping Table TF number TF1 TF2 TF3 TF4 TF name F$HAC1_Q2 F$LEU3_B I$GAGAFACTOR_Q6 P$ABF1_03 26 TF5 TF6 TF7 TF8 TF9 TF10 TF11 TF12 TF13 TF14 TF15 TF16 TF17 TF18 TF10 TF20 TF21 TF22 TF23 TF24 TF25 TF26 TF27 TF28 TF29 TF30 TF31 TF32 TF33 TF34 TF35 TF36 TF37 V$AHRARNT_02 V$ALPHACP1_01 V$AML_Q6 V$AP1_C V$AP1_Q2 V$AP2ALPHA_01 V$AP2GAMMA_01 V$AP4_Q6_01 V$AR_01 V$AREB6_01 V$ATF3_Q6 V$BACH1_01 V$BACH2_01 V$BEL1_B V$CACCCBINDINGFACTOR_Q6 V$CBF_02 V$CMYB_01 V$CP2_02 V$CREB_Q4 V$DBP_Q6 V$DEAF1_01 V$E2_01 V$E2F_Q6_01 V$EBF_Q6 V$ELK1_02 V$ER_Q6 V$ER_Q6_02 V$ETS1_B V$FREAC4_01 V$GATA1_01 V$GATA1_04 V$GCM_Q2 V$GR_Q6 27 TF38 TF39 TF40 TF41 TF42 TF43 TF44 TF45 TF46 TF47 TF48 TF49 TF50 TF51 TF52 TF53 TF54 TF55 TF56 TF57 TF58 TF59 TF60 TF61 TF62 TF63 TF64 TF65 TF66 TF67 TF68 TF69 TF70 V$HEN1_02 V$HES1_Q2 V$HIC1_02 V$HNF3ALPHA_Q6 V$KROX_Q6 V$LMAF_Q2 V$LRH1_Q5 V$MEF3_B V$MEIS1_01 V$MIF1_01 V$MINI19_B V$MOVOB_01 V$MTF1_Q4 V$MYOGNF1_01 V$NF1_Q6 V$NFKAPPAB_01 V$NRF2_Q4 V$OLF1_01 V$P53_01 V$PAX3_01 V$PAX5_01 V$PAX6_Q2 V$PAX9_B V$PEBP_Q6 V$PXR_Q2 V$ROAZ_01 V$SMAD3_Q6 V$SMAD4_Q6 V$SOX10_Q6 V$SP1_01 V$SP3_Q3 V$SREBP_Q3 V$STAT1_01 28 TF71 TF72 TF73 TF74 TF75 TF76 TF77 TF78 TF79 TF80 V$TEF1_Q6 V$TGIF_01 V$USF_Q6_01 V$VMAF_01 V$WHN_B V$WT1_Q6 V$XPF1_Q6 V$YY1_02 V$ZF5_01 V$ZNF219_01 In the next chapter, we will discuss about Data mining tool Weka. We will also look at the two stage association rule mining approach and interpretation of the result we recieved from Weka and p-value based filtering method. 29 Chapter 5 ASSOCIATION RULE MINING After preprocessing the data, our aim is to look for association rules between the set of transcription factors and the set of genes. Figure below highlights the association rule mining step of this project. Figure 13: Association Rule Mining Association rule learning is a method for discovering interesting relations between variables in a large database. We have developed a two stage association rule mining approach. In the first stage, we are generating association rules using Weka – a machine learning tool and then we interpret the results obtained from Weka. In the second stage, association rules are generated by filtering the data in Excel on the basis of p-value threshold. 30 5.1 About Weka Weka (Waikato Environment for Knowledge Analysis) is a popular suite of machine learning software written in Java, developed at the University of Waikato, New Zealand. WEKA is a free software available under the GNU General Public License. It runs on almost any platform and has been tested under Linux, Windows, and Macintosh operating systems and even on a personal digital assistant. Weka contains a collection of visualization tools and algorithms for data analysis and predictive modeling, together with graphical user interfaces for easy access to its functionality. It provides extensive support for the whole process of experimental data mining, including preparing the input data, evaluating learning schemes statistically, and visualizing the input data and the result of learning[11]. 5.2 Association Rule Mining Method After preprocessing the data, our aim is to identify a suitable data mining method that can recognize the association rules. Weka has three association rule learners. (1) Apriori implements the apriori algorithm. It starts with a minimum support of 100% of the data items and decreases this in steps of 5% until there are atleast 10 rules with the required minimum confidence of 0.9 or until support has reached lower bound of 10%. Association rule show attribute value conditions that occur frequently together in a given dataset. Support and confidence are two measures of rule interestingness. They respectively reflect the usefulness and certainty of discovered rules. Let I = {i1, i2, … , in}be a set of n binary attributes called items. Let D = {t1, t2, … , tn} be a set of transactions called the database. Each transaction in D has a unique transaction ID and contains a subset of the items in I. A rule is defined as an implication of the 31 form X => Y where X, Y are subset of I and X ∩ Y = Ф. The sets of items (for short itemsets) X and Y are called antecedent (left-hand-side or LHS) and consequent (right-hand-side or RHS) of the rule respectively. The support supp(X) of an itemset X is defined as the proportion of transactions in the data set which contain the itemset. The confidence of a rule is defined conf( X => Y) = supp(X U Y)/ supp(X)[12]. (2) Predictive apriori combines confidence and support into a single measure of predictive accuracy and finds the best n association rules in order. (3) Tertius finds rules according to a confirmation measure, seeking rules with multiple conditions in the consequent, like Apriori, but differing in that these conditions are OR’d together, not ANDed[11]. Out of the three, we have used Apriori which is the simplest and meeting our needs. 5.3 Using Weka The easiest way to use Weka is through a graphical user interface called Explorer. This gives access to all of its functionalities using menu selection and form filling. 5.3.1 Loading the Data In addition to the native ARFF data file format, WEKA has the capability to read in ".csv" format files. This is fortunate since many databases or spreadsheet applications can save or export data into flat files in this format. we load the data set in “.csv” format into WEKA and then perform association rule mining on the input data set. While all of these operations can be performed from the command line, we use the GUI interface for WEKA Explorer which is easy to use. 32 Figure 14 shows the snapshot of initially (in the Preprocess tab) clicking "open" and navigating to the directory containing the data file (.csv or .arff). In this example we will open the GREB1.csv file. GREB1.csv is obtained by integrating two original files TFvalue_enahancer.xls and Enhancer_expression_profile.xls as seen in section 4.4. Figure 14: Loading Data 33 Following figure 15 shows a sample of section of “GREB1.csv” file in the Weka acceptable format. The file has 82 columns. The first column is “NewTag” which contains candidate enhancer id, second column is “symbol” which contains gene name for that candidate enhancer from data file 2: ‘Enhancer_expression_profile.xls’ and the remaining 80 columns are p-values of 80 TFs from file 1: ‘TFvalues_enhancer.xls’. Figure 15: GREB1.csv File Format Once the data is loaded, WEKA will recognize the attributes and compute some basic statistics on each attribute. 34 The left panel in figure 16 shows the list of recognized attributes. Clicking on any attribute in the left panel will show the basic statistics on that attribute. For categorical attributes, the frequency for each attribute value is shown, while for continuous attributes we can obtain min, max, mean, standard deviation. For instance, attribute “TF1” has few “T” and few “F” values in its column. Hence frequency of each of them, which is the number of times they occur, will be shown in the count column of the right panel. Figure 16: Statistics Window 35 5.3.2 Association Rule Mining Using Weka: Once the data is loaded and attributes are recognized by Weka, next step is to generate association rules. Clicking on the "Associate" tab (Figure 16) will bring up the interface for the association rule algorithms. The Apriori algorithm which we will use is the default algorithm selected. However, in order to change the parameters for this run (e.g., support, confidence) we click on the text box immediately to the right of the "Choose" button (Figure 16). Note that this box, at any given time, shows the specific commandline arguments that are to be used for the algorithm. 36 The dialog box for changing the parameters is shown in figure 17. Here, you can specify various parameters associated with Apriori. Figure 17: Parameter Editor Window 37 After the execution of apriori algorithm is completed, the result of the execution can be seen in the “Associator output” window. We can see association rules generated in this window as shown in figure 18. Figure 18: Associator Output Window 5.4 Result Interpretation As discussed in section 4.3.2, we are interested in finding sets of TFs that are significant in triggering a gene to go up or go down. From figure 18, we can see that those TFs labeled with “T” are significant for upward expression of GREB1 gene. Hence we are only interested in TFs having label “T”. Therefore we discard the TFs with label “F” and write our association rule. For instance, let’s look at rule 6 in figure 18. This rule contains a set of TFs like TF30, TF31, TF66, 38 TF71 and TF79. Out of these TFs, only TF79 has label “T” and therefore we choose TF79 and discard the others. Therefore the new association rule is rewritten as: TF79 = “T” => symbol = “GREB1” which can be simplified as TF79 => GREB1. The rule suggests that out of all the associated TFs, TF79 is significant in expression of gene “GREB1”. Further details and more association rules using WEKA are described in experiment 1 of chapter 6. 5.5 P-value based Filtering Method Using Weka’s association rule mining method, we obtained a set of association rules between set of transcription factors and a single gene. We would like to get association rules between set of TFs and a set of genes. Towards that end, in sections 6.2 through 6.4, we describe our method in which the association rules are obtained by filtering the data in MS-Excel on the basis of p-value threshold. Consider file greater_than_or_equal_to_10.xls of section 4.3.3, this file contains set of genes which are up expressed as discussed in section 4.3.1. We filter this file data on the basis of p-value threshold. For instance, if p-value < 0.005 then label that TF as “T” else “F”. 39 As seen in figure 19, the file contains p-values of 80 TFs associated with candidate enhancers. We filter the p-value by applying function such as IF (column value < 0.005,”T”,”F”) and append corresponding result of filtering at the end of the file as shown in figure 19. TFs labeled as “T” are considered to be significant. Figure 19: Filtering using P-value Threshold for a Set of Genes which are Up Expressed 40 If the TF value is true always for a gene, then we select that TF by checking the TF in excel sheet as shown in figure 20. Figure 20: The Checked Boxes correspond to TFs which are Significant For instance, in row, cloumn (2,2) of figure 20, GREB1 gene has been checked for TF1 which implies TF1 is always significant for GREB1 gene. Similarly, for row,column (5,2), TF1 is not significant for MGP gene. Therefore, the corresponding box is not checked. Using this approach, we can see that TF3, TF5 and TF6 are significant for all genes GREB1, CDH26, FLJ30058, MGP, PKIB, SGK1, SYTL4 whereas TF1 is only significant to 5 genes: GREB1, CDH26, FLJ30058, PKIB, SGK1. Next step is to determine support. For a subset of gene, a TF is is expressed at a significant level (i.e its p-value < 0.005) for α% of time then the association rule is said with “support = α%” level. In this file we define α = 100 and consider only those TFs in our association rule which satisfy support = 100%. We do this by checking those TFs from the excel sheet of figure 20 which are always significant 100% of the time. For example, TF3, TF5 41 and TF6 are always significant for the set of genes of figure 20 like GREB1, CDH26, FLJ30058, MGP, PKIB, SGK1, SYTL4. TF3 ∩ TF5 ∩ TF6 => GREB1 ∩ CDH26 ∩ FLJ30058 ∩ MGP ∩ PKIB ∩ SGK1 ∩ SYTL4 If on the other hand support = 70% then TF1 would also be significant (As seen in figure 19, TF1 is true for 5 out of 7 genes implying 5/7 = 71% support). So the association rule would be: TF1 ∩ TF3 ∩ TF5 ∩ TF6 => GREB1 ∩ CDH26 ∩ FLJ30058 ∩ MGP ∩ PKIB ∩ SGK1 ∩ SYTL4 . 42 Figure 21: Two-Stage Association Rule Mining Approach The figure 21 shows the overview of the association rule mining two stage approach discussed in section 5.3, 5.4 and 5.5. In experiment 1 of chapter 6, stage 1 approach is followed and in experiments 2, 3, 4, stage 2 approach is followed. 43 In chapter 6, we look into all experiments in detail, their aim, methodology and results. 44 Chapter 6 EXPERIMENTS In this chapter, we describe four experiments that we carried out to derive association rules between a set of transcription factors and a set of genes. In the procedure of experiment 1 which is described in section 6.1, we have used Weka tool only. The two stage association rule mining method is used in the experiments 2, 3 and 4. In each experiment, we have considered a few behaviors such as expression/intensity level of genes increasing (Category: E/F) and decreasing (Category: A) as described in section 4.3.1. In each experiment, association rules are generated with various support and confidence values. 6.1 Experiment 1 Aim: To derive an association rule for a given target gene. Procedure: TFs whose p-value > 0.001 are assumed to be significant. Based on this pvalue criteria, the numerical p-values are converted to nominal values such as “T” or “F”. Now this preprocessed data is given as an input to the Weka. The Weka generates association rules for every gene. For example, the association rule generated is TF79 = “T” => symbol = “GREB1” TF30 = “F”, TF31 = “F”, TF66 = “F”, TF71 = “F”. This association rule is further interpreted as to show which TF’s are responsible for the expression of a given gene. For example, TF79 = “T” => symbol = “GREB1” which can be simplified as TF79 => GREB1. The rule suggests that out of all the associated TFs, TF79 is significant in expression of gene “GREB1”. Support is considered to be 100%. 45 6.1.1 Behavior: Expression of gene increasing (Category: F) As discussed in section 4.3.1, the classification of genes into different classes is done on the basis of difference between expression/intensity level values at 48hr and that of at 3hr. If difference >= 10 then for this class of genes, the intensity is increasing with label F. Therefore the target gene is up-expressed. The association rule for that target gene will be defined as: TF1 & TF2 Target gene set F. Here, ‘ F’ denotes that the target gene is up and increasing. 46 Let us consider rule no.1 from the following table 7. From the rule, it can be seen that transcription factors such as F$LEU3_B, V$DEAF1_01, V$E2F_Q6_01, V$MYOGNF1_01, V$SP3_Q3, V$TGIF_01, V$WHN_B and V$ZF5_01 are significant in the up expression of gene called “MGP”. Similarly we can interpret the remaining rules of the table. Table 7: Association Rules Table 1 Rule no. Gene name Rules by TF name Confidence 1. MGP F$LEU3_B ∩ V$DEAF1_01 ∩ V$E2F_Q6_01 ∩ V$MYOGNF1_01 ∩ V$SP3_Q3 ∩ V$TGIF_01 ∩ V$WHN_B ∩ V$ZF5_01 => MGP (F) 100 2 CDH26 F$LEU3_B ∩ V$AP2ALPHA_01 ∩ V$DEAF1_01 ∩ V$E2_01 ∩ V$E2F_Q6_01 ∩ V$ELK1_02 ∩ V$GATA1_04 ∩ V$GR_Q6 ∩ V$P53_01 ∩ V$PEBP_Q6 ∩ V$SP1_01 ∩ V$YY1_02 ∩ V$ZNF219_01 => CDH26 (F) 100 3 GREB1 V$ZF5_01 => GREB1 (F) 100 4 SGK1 V$SMAD4_Q6 ∩ V$WHN_B => SGK1 (F) 90 5 PKIB V$WT1_Q6 => PKIB (F) 100 6 FLJ30058 F$HAC1_Q2 ∩ F$LEU3_B ∩ V$AHRARNT_02 ∩ V$ALPHACP1_01 ∩ V$AP1_C ∩ V$AP1_Q2 ∩ V$AR_01 ∩ V$BACH1_01 ∩ V$BACH2_01 ∩ V$CP2_02 ∩ V$DEAF1_01 ∩ V$E2_01 ∩ V$E2F_Q6_01 ∩ V$EBF_Q6 ∩ V$ELK1_02 ∩ V$FREAC4_01 ∩ V$GR_Q6 ∩ V$HEN1_02 ∩ V$MIF1_01 ∩ V$MINI19_B ∩ V$MTF1_Q4 ∩ V$MYOGNF1_01 ∩ V$PAX5_01 ∩ V$PAX9_B ∩ V$ROAZ_01 ∩ V$SP1_01 ∩ V$SREBP_Q3 ∩ V$VMAF_01 ∩ V$ZF5_01 => FLJ30058 (F) 100 7 SYTL4 F$HAC1_Q2 ∩ F$LEU3_B ∩ V$AHRARNT_02 ∩ V$AML_Q6 ∩ V$AP1_C ∩ V$AREB6_01 ∩ V$ATF3_Q6 ∩ V$BACH1_01 ∩ V$BACH2_01 ∩ V$CACCCBINDINGFACTOR_Q6 ∩ V$CBF_02 ∩ V$E2_01 ∩ V$E2F_Q6_01 ∩ V$EBF_Q6 ∩ V$ER_Q6 ∩ V$GATA1_01 ∩ V$HIC1_02 ∩ V$KROX_Q6 ∩ V$MINI19_B ∩ V$MTF1_Q4 ∩ V$MYOGNF1_01 ∩ V$NFKAPPAB_01 ∩ V$OLF1_01 ∩ V$P53_01 ∩ V$PAX3_01 ∩ V$PAX6_Q2 ∩ V$SP1_01 ∩ V$STAT1_01 ∩ V$USF_Q6_01 ∩ V$WT1_Q6 ∩ V$YY1_02 ∩ V$ZF5_01 => SYTL4 (F) 100 47 6.1.2 Behavior: Expression of gene decreasing (Category: A) Following table 8 shows the association rules for decreasing behavior (Category: A) as described in section 4.3.1. Table 8: Association Rules Table 2 Rule no. Gene name Rules by TF name Confidence 1 CYP24A1 V$SP1_01 ∩ V$SP3_Q3 ∩ V$WHN_B ∩ V$WT1_Q6 => CYP24A1 (A) 100 2 KCNK5 F$LEU3_B ∩ V$WHN_B ∩ V$ZF5_01 => KCNK5 (A) 100 3 TGM2 V$WHN_B => TGM2 (A) 100 4 ZSCAN2 V$AP2GAMMA_01 ∩ V$AR_01 ∩ V$ZF5_01 => ZSCAN2 (A) 100 5 CISH V$DEAF1_01 ∩ V$E2F_Q6_01 ∩ V$MEF3_B ∩ V$NFKAPPAB_01 => CISH (A) 100 6 LOC6529 V$BACH2_01 ∩ V$BEL1_B ∩ V$E2F_Q6_01 ∩ V$HES1_Q2 => LOC6529 (A) 100 7 MAP6D1 V$E2F_Q6_01 ∩ V$KROX_Q6 ∩ V$PXR_Q2 ∩ V$ZF5_01 => MAP6D1 (A) 100 8 PRY2 V$WHN_B ∩ V$EBF_Q6 ∩ V$E2F_Q6_01 ∩ V$ALPHACP1_01 => P2RY2 (A) 80 9 PSCA V$E2F_Q6_01 => PSCA (A) 100 10 SCNNIA V$E2F_Q6_01 ∩ V$GR_Q6 ∩ V$MEIS1_01 ∩ V$ZF5_01 => SCNNIA (A) 90 11 SYT12 V$MEIS1_01 => SYT12 (A) 90 12 VASN V$HNF3ALPHA_Q6 ∩ V$GATA1_04 ∩ V$FREAC4_01 ∩ V$DBP_Q6 => VASN (A) 90 48 Let us consider rule 1 from table 8. The rule means that transcription factors including V$SP1_01, V$SP3_Q3, V$WHN_B and V$WT1_Q6 are significant for the downward expression of the “CYP24A1” gene. Only category F and category A are considered in this experiment from classification table which is table 3. Needs for improvement: As we see in the tables 7 and 8 above, the association rules generated only reflect the impact of a subset of TFs on just one target gene but not on a subset of genes. Based on the feedback from the collaborator in genome centre, to obtain more useful result, we need to make an improvement over the method used for experiment 1. Therefore, our goal in the next experiment is to identify a collection of association rules between a subset of TFs and a subset of genes. We have the availability of expression level at different points of time like at 3 hrs, 6 hrs, 12 hrs, 24 hrs, 48 hrs. We would like to research and find out what are the associated breast cancer genes by their peaking intensity level. There are many possibilities like the gene expression is peak at 3hrs and the peaking intensity reduces by 48 hrs. Another possibility is the gene expression is low at 3 hrs and peaks at 48 hrs. In the experiments 2, 3 and 4 we consider this possibilities. Another possibility is the expression level is low at 3 hrs and it peaks at 24 hrs. It could also be true that the breast cancer genes are peaking at 3 hrs and 48 hrs and they have minimum expression level at 24 hrs. We have mentioned this in future work. 49 6.2 Experiment 2 In this experiment, we want to make an improvement over the experiment 1. The association rules should target on a set of genes instead of just one gene. Therefore our aim was to identify a collection of association rules between a subset of the 80 TFs and a subset of genes given with a strong support (for example, support >= 70%) and confidence = 99.98%. The methodology we took is that we first defined P-value threshold as 0.002 and the selection criteria as p-value < 0.002. In otherwords, a transcription factor whose p-value < 0.002 is assumed to be significant. Next we define support strength. For this experiment we say that TFs having support = 70% are considered to be strong. For instance, in a set of 10 genes, if a particular TF’s p-value is less than 0.002 for 7 genes. Then that TF is said to be significant 70% of the time. Formal description of the “support= α%” association rule: If for a subset of gene, a TF is significant (i.e its p-value < 0.002) for α% of time then it is said to be “support = α%” association rule. In our experiment we define α = 70 and consider only those TFs in our association rule which satisfy support >= 70%. The results contain two parts: (1) Results of the experiment 2 with α = 70: a) Behavior- Expression of genes increasing (Category: F): As discussed in section 4.3.1, the classification of genes into different classes is done on the basis of difference between expression/intensity level values at 48hr and that of at 3hr. If difference >= 10 then for this class of genes, the intensity is increasing with label 50 F. Therefore the target gene is up-expressed. The association rule for that target gene will be defined as: TF1 & TF2 Target gene set F. Here, ‘ F’ denotes that the target gene is up and increasing. TF5 ∩ TF15 ∩ TF23 ∩ TF24 ∩ TF26 ∩ TF30 ∩ TF31 ∩ TF35 ∩ TF37 ∩ TF41 ∩ TF43 ∩ TF45 ∩ TF54 ∩ TF59 ∩ TF61 ∩ TF62 ∩ TF64 ∩ TF70 ∩ TF71 => (GREB1, MGP, PKIB, SGK1, SYTL4, FLJ30058, CDH26) (F) This rule means that the set of TFs including TF5, TF15, TF23, TF24, TF26, TF31, TF35, TF41, TF43, TF45, TF54, TF59, TF61, TF62, TF64, TF70 and TF71 are playing significant role for up expression of set of genes including GREB1, MGP, PKIB, SGK1, SYTL4, FLJ30058 and CDH26. Note: TFs marked in red are the well known activators like AP1 (TF8 - AP1_C and TF9 AP1_Q2), FREAC4 (TF33 - FREAC4_01), GATA (TF34 - GATA_01, TF35 GATA_04), HNF3(TF41 - HNF3ALPHA6_Q6). b) Behavior- Expression of genes decreasing (Category: A): As discussed in section 4.3.1, the classification of genes into different classes is done on the basis of difference between expression/intensity level values at 48hr and that of at 3hr. If difference <= -1, then for this class of genes, the intensity is decreasing with label A. Therefore the target gene is down-expressed. The association rule for that target gene will be defined as: TF1 & TF2 Target gene set A. Here, ‘ A’ denotes that the target gene is down and decreasing. 51 TF3 ∩ TF4 ∩ TF5 ∩ TF6 ∩ TF8 ∩ TF12 ∩ TF13 ∩ TF17 ∩ TF18 ∩ TF20 ∩ TF31 ∩ TF32 ∩ TF36 ∩ TF38 ∩ TF43 ∩ TF44 ∩ TF45 ∩ TF56 ∩ TF57 ∩ TF63 ∩ TF66 ∩ TF69 ∩ TF70 ∩TF71 ∩ TF77 => (CISH, CTR9,CYP24A1, KCNK5, LOC65296, KCNK6, MAP6D1, P2RY2, PSCA, SCNN1A, SMTNL2, SYT12, TGM2, UBTD1, VASN, ZSCAN2) (A) This rule means that set of TFs including TF3, TF4, TF5, TF6, TF8, TF12, TF13, TF17, TF18, TF20, TF31, TF32, TF36, TF38, TF43, TF44, TF45, TF56, TF57, TF63, TF66, TF69, TF70, TF71 and TF77 are playing significant role for the down expression of set of genes such as CISH, CTR9,CYP24A1, KCNK5, LOC65296, KCNK6, MAP6D1, P2RY2, PSCA, SCNN1A, SMTNL2, SYT12, TGM2, UBTD1, VASN and ZSCAN2 with support = 70% and confidence = 99.98%. (2) Results of the experiment 2 with α = 85: The association rule considers a TF only if it satisfy the support >= 85% a) Behavior- Expression of genes increasing (Category: F): TF5 ∩ TF30 ∩ TF31 ∩ TF37 ∩ TF54 ∩ TF61 ∩ TF62 ∩ TF71 => (GREB1, MGP, PKIB, SGK1, SYTL4, FLJ30058, CDH26) (F) This rule means that set of TFs including TF5, TF30, TF31, TF37, TF54, TF61, TF62, TF71 are significant in the up expression of set of genes including GREB1, MGP, PKIB, SGK1, SYTL4, FLJ30058, CDH26 with support >=85% and confidence = 99.98%. 52 b) Behavior- Expression of genes decreasing (Category: A): TF31 ∩ TF32 ∩ TF44 ∩ TF71 => (CISH, CTR9, CYP24A1, KCNK5, LOC65296, KCNK6, MAP6D1, P2RY2, PSCA, SCNN1A, SMTNL2, SYT12, TGM2, UBTD1, VASN, ZSCAN2) (A) From the rule, it can be seen that set of TFs including TF31, TF32, TF44, TF71 are significant in the down expression of set of genes including CISH, CTR9, CYP24A1, KCNK5, LOC65296, KCNK6, MAP6D1, P2RY2, PSCA, SCNN1A, SMTNL2, SYT12, TGM2, UBTD1, VASN, ZSCAN2 with support >=85% and confidence = 99.98%. 6.3 Experiment 3 In experiment2, it can be seen in both parts of the result that very few well known activators were present in the result. In anticipation that we would obtain more well known activators, in experiment 3, we use different approach. Our aim is to relax the p-value threshold, increase the support and analyze the results. Therefore we have relaxed the p-value to 0.005 and increased the support α to 90% and confidence = 99.95%. The selection criteria is p-value < 0.005. Therefore a transcription factor whose p-value < 0.005 is assumed to be significant. Results of experiment 3 with α = 90%: (a) Behavior- Expression of genes decreasing (Category: A): Considering the possibility that at 3 hours the expression level is at peak and it subsides by 48 hours, we find the difference of intensity at 48 hours and 3 hours. The association rule comes out to be: 53 TF1 ∩ TF4 ∩ TF5 ∩ TF6 ∩ TF7 ∩ TF8 ∩ TF9 ∩ TF11 ∩ TF17 ∩ TF18 ∩ TF12 ∩ TF13 ∩ TF14 ∩ TF15 ∩ TF16 ∩ TF19 ∩ TF20 ∩ TF21 ∩ TF22 ∩ TF23 ∩ TF24 ∩ TF25 ∩ TF26 ∩ TF28 ∩ TF29 ∩ TF30 ∩ TF31 ∩ TF32 ∩ TF33 ∩ TF34 ∩ TF35 ∩ TF36 ∩ TF37 ∩ TF38 ∩ TF42 ∩ TF43 ∩ TF44 ∩ TF45 ∩ TF46 ∩ TF47 ∩ TF48 ∩ TF50 ∩ TF52 ∩ TF53 ∩ TF54 ∩ TF55 ∩ TF56 ∩ TF59 ∩ TF60 ∩ TF61 ∩ TF62 ∩ TF63 ∩ TF64 ∩ TF67 ∩ TF69 ∩ TF70 ∩ TF71 ∩ TF72 ∩ TF73 ∩ TF75 ∩ TF77 ∩ TF78 => (CISH, CTR9, CYP24A1, KCNK5, LOC65296, KCNK6, MAP6D1, P2RY2, PSCA, SCNN1A, SMTNL2, SYT12, TGM2, UBTD1, VASN, ZSCAN2) (A) Note: TFs marked in red are the well known activators like AP1 (TF8 - AP1_C and TF9 AP1_Q2), FREAC4 (TF33 - FREAC4_01), GATA (TF34 - GATA_01, TF35 GATA_04), HNF3(TF41 - HNF3ALPHA6_Q6), SP1(TF67- SP1_01). b) Behavior- Expression of genes increasing (Category: F): Considering that at 3 hours the expression level is minimum and it increases by 48 hours, we get the following association rule TF3 ∩ TF5 ∩ TF6 ∩ TF7 ∩ TF8 ∩ TF9 ∩ TF12 ∩ TF13 ∩ TF14 ∩ TF15 ∩ TF16 ∩ TF17 ∩ TF18 ∩ TF19 ∩ TF20 ∩ TF21 ∩ TF23 ∩ TF24 ∩ TF25 ∩ TF26 ∩ TF28 ∩ TF29 ∩ TF30 ∩ TF31 ∩ TF32 ∩ TF33 ∩ TF34 ∩ TF35 ∩ TF36 ∩ TF37 ∩ TF38 ∩ TF39 ∩ TF41 ∩ TF43 ∩ TF44 ∩ TF45 ∩ TF46 ∩ TF47 ∩ TF50 ∩ TF51 ∩ TF52 ∩ TF53 ∩ TF54 ∩ TF55 ∩ TF57 ∩ TF58 ∩ TF59 ∩ TF60 ∩ TF61 ∩ TF62 ∩ TF63 ∩ TF64 ∩ TF66 ∩ TF69 ∩ TF70 ∩ TF71 ∩ TF72 ∩ TF73 ∩ TF74 ∩ TF75 ∩ TF77 ∩ TF78 => (GREB1, MGP, PKIB, SGK1, SYTL4, FLJ30058, CDH26) (F) 54 Note: TFs marked in red are the well known activators like AP1 (TF8 - AP1_C and TF9 AP1_Q2), FREAC4 (TF33 - FREAC4_01), GATA (TF34 - GATA_01, TF35 GATA_04), HNF3(TF41 - HNF3ALPHA6_Q6). c) Behavior- Expression of genes increasing (Category: E): As discussed in section 4.3.1, the classification of genes into different classes is done on the basis of difference between expression/intensity level values at 48hr and that of at 3hr. If difference >= 5, then for this class of genes, the intensity is increasing with label E. Therefore the target gene is up-expressed. The association rule for that target gene will be defined as: TF1 & TF2 Target gene set ‘’ E. Here, ‘ E’ denotes that the target gene is up and increasing. Considering that at 3 hours the expression level is minimum and it increases by 48 hours, we get the following association rule: TF6 ∩ TF7 ∩ TF9 ∩ TF14 ∩ TF17 ∩ TF18 ∩ TF20 ∩ TF21 ∩ TF23 ∩ TF24 ∩ TF26 ∩ TF29 ∩ TF30 ∩ TF31 ∩ TF32 ∩ TF33 ∩ TF34 ∩ TF35 ∩ TF37 ∩ TF41 ∩ TF43 ∩ TF44 ∩ TF45 ∩ TF46 ∩ TF52 ∩ TF54 ∩ TF57 ∩ TF61 ∩ TF62 ∩ TF64 ∩ TF66 ∩ TF71 ∩ TF72 ∩ TF74 => (AREG, CALCR, CXCL12, DEPDC6, HCK, HEY2, MYB, RAB31, RAMP3, SLC39A8, GREB1, CDH26, FLJ30058, MGP, PKIB, SGK1, SYTL4) (E) Note: TFs marked in red are well known activators like AP1 (TF9 - AP1_Q2), FREAC4 (TF33 - FREAC4_01), GATA (TF34 - GATA_01, TF35 - GATA_04), HNF3(TF41 HNF3ALPHA6_Q6). 55 6.4 Experiment 4 In experiment 3, the number of transcription factors in the association rules are high. To obtain more useful and precise result, we want to reduce the number of transcription factors appearing in the rule. In anticipation that we would obtain less number of transcription factors in the rules, in experiment 4, our aim is to further increase the support and analyze the results. Therefore, we increased α to 100. Results of experiment 4 with α = 100: a) Behavior- Expression of genes decreasing (Category: A): Considering the possibility that at 3 hours the expression level is at peak and it subsides by 48 hours, we find the difference of intensity at 48 hours and 3 hours. The association rule comes out to be: TF4 ∩ TF5 ∩ TF6 ∩ TF7 ∩ TF8 ∩ TF9 ∩ TF12 ∩ TF13 ∩ TF14 ∩ TF15 ∩ TF16 ∩ TF20 ∩ TF21 ∩ TF22 ∩ TF23 ∩ TF25 ∩ TF26 ∩ TF28 ∩ TF30 ∩ TF31 ∩ TF32 ∩ TF34 ∩ TF35 ∩ TF36 ∩ TF38 ∩ TF43 ∩ TF44 ∩ TF45 ∩ TF46 ∩ TF48 ∩ TF50 ∩ TF52 ∩ TF55 ∩ TF56 ∩ TF59 ∩ TF60 ∩ TF61 ∩ TF62 ∩ TF63 ∩ TF69 ∩ TF70 ∩ TF71 ∩ TF73 ∩ TF77 => CISH, CTR9, CYP24A1, KCNK5, LOC65296, KCNK6, MAP6D1, P2RY2, PSCA, SCNN1A, SMTNL2, SYT12, TGM2, UBTD1, VASN, ZSCAN2) (A) 56 b) Behavior- Expression of genes increasing (Category: F): Considering the possibility that at 3 hours the expression level is low and it peaks by 48 hours, we find the difference of intensity at 48 hours and 3 hours. TF3 ∩ TF5 ∩ TF6 ∩ TF7 ∩ TF9 ∩ TF13 ∩ TF16 ∩ TF18 ∩ TF19 ∩ TF20 ∩ TF21 ∩ TF23 ∩ TF24 ∩ TF25 ∩ TF26 ∩ TF28 ∩ TF29 ∩ TF31 ∩ TF33 ∩ TF34 ∩ TF35 ∩ TF36 ∩ TF37 ∩ TF43 ∩ TF44 ∩ TF45 ∩ TF46 ∩ TF47 ∩ TF50 ∩ TF51 ∩ TF52 ∩ TF54 ∩ TF59 ∩ TF60 ∩ TF61 ∩ TF62 ∩ TF63 ∩ TF64 ∩ TF66 ∩ TF70 ∩ TF71 ∩ TF72 ∩ TF74 ∩ TF78 => (GREB1, MGP, PKIB, SGK1, SYTL4, FLJ30058, CDH26) (F) c) Behavior- Expression of genes increasing (Category: E): Considering the possibility that at 3 hours the expression level is low and it peaks by 48 hours, we find the difference of intensity at 48 hours and 3 hours. TF7 ∩ TF9 ∩ TF18 ∩ TF21 ∩ TF23 ∩ TF24 ∩ TF31 ∩ TF33 ∩ TF34 ∩ TF35 ∩ TF43 ∩ TF45 ∩ TF46 ∩ TF52 ∩ TF61 ∩ TF62 ∩ TF64 ∩ TF72 ∩ TF74 => (AREG, CALCR, CXCL12, DEPDC6, HCK, HEY2, MYB, RAB31, RAMP3, SLC39A8, GREB1, CDH26, FLJ30058, MGP, PKIB, SGK1, SYTL4) (E) 57 6.5 Result Analysis Expected results: Well known activators like AP1, GATA, SP1, HNF3, FREAC4 should show up in the rules. Actual results: In association rules found in the experiments 3 and 4, it can be seen that well known activators AP1, GATA, SP1, HNF3, FREAC4 appear in the association rules. In experiment 2, 60% of the well known activators are present in the rules. Association rules obtained in experiment 3 contain all (100%) of the well known activators. In experiment 4, again 60% of the well known activators appear in the rules. Experiment AP1 GATA SP1 HNF3 FREAC4 Experiment 2 √ √ X √ X Experiment 3 √ √ √ √ √ Experiment 4 √ √ √ X X Figure 22: Result Analysis 58 Chapter 7 CONCLUSION 7.1 Summary The goal of this project was to 1) Develop a method of data preprocessing to make the data feasible to be used by a data mining algorithm. 2) Look for the association rules between a group of transcriptional factors (module) and target gene behavior from a single time-series breast cancer profile data. Towards this end number of experiments are performed to achieve results with better accuracy. Experiments 2, 3 and 4 are improved based on genome centre’s feedback compared to experiment 1 because they identify the association rules between a subset of TFs and a subset of genes. Experiment 3 and 4 show well known activators like AP1, GATA, SP1, HNF3, FREAC4 in the association rules. Experiment 3 and 4 have p-value < 0.005 (confidence=99.95%). Experiment 3 has support = 90% whereas experiment 4 has support = 100%. Well known activators like AP1, GATA, FREAC4, HNF3, SP1 are found in experiment 3 whereas well known activators like these AP1, GATA, FREAC4 are found in experiment 4. 7.2 Learning Experience Weka is a useful data mining tool. Also MS-Excel is a useful tool for data preprocessing. The data filtering capability of MS-Excel can be very useful for data preprocessing. It was good to be able to apply the concepts learnt in statistics to solve a real world problem. Report writing is also very important learning process and hence one should begin writing from start of the project. 59 7.3 Future Work More work can be done on this project to enhance the results obtained. 1) Currently the p-value is the most important attribute and the data is classified and analyzed on its basis. However, it would be an important data point to incorporate position value in the analysis and examine the results obtained. 2) We have the availability of expression level at different points of time like at 3 hrs, 6 hrs, 12 hrs, 24 hrs, 48 hrs. We would like to research and find out what are the set of transcription factors associated with set of genes by their peaking intensity level. There are many possibilities like the gene expression is peak at 3hrs and the peaking intensity reduces by 48 hrs. Another possibility is the gene expression is low at 3 hrs and peaks at 48 hrs. In the experiments 2, 3 and 4 we consider these possibilities. However, there are other possibilities like the expression level is low at 3 hrs and it peaks at 24 hrs. It could also be true that the breast cancer genes are peaking at 3 hrs and 48 hrs and they have minimum expression level at 24 hrs. We leave this for future work. 60 BIBLIOGRAPHY [1] Meiliu Lu, Csc 209 Class notes, California State University, Sacramento, Fall 2009 [2] S. Fu, and C. Tobin, An Introduction to DNA and Chromosomes, HOPES: Huntington's Disease Outreach Project for Education at Stanford, 25 April 2006 http://hopes.stanford.edu/n3401/hd-genetics/introduction-dna-and-chromosomes-text-and-audio [3] Wikipedia contributors, Gene, Wikipedia, The Free Encyclopedia, 10 May 2010 http://en.wikipedia.org/wiki/Gene [4] Wikipedia contributors, Transcription factor, Wikipedia, The Free Encyclopedia, 15 May 2010 http://en.wikipedia.org/wiki/Transcription_factor [5] Wikipedia contributors, Enhancer (genetics), Wikipedia, The Free Encyclopedia, 29 April 2010 http://en.wikipedia.org/wiki/Enhancer_(genetics) [6] Wikipedia contributors, Gene expression, Wikipedia, The Free Encyclopedia, 12 April 2010 http://en.wikipedia.org/wiki/Gene_expression [7] Jiawei Han and Micheline Kamber, Data mining: Concepts and Techniques, Second Edition, Morgan Kaufmann Publishers, 2006. [8] Ziv Bar-Joseph, Georg K Gerber, Tong Ihn Lee, Nicola J Rinaldi, Jane Y Yoo, François Robert, D Benjamin Gordon, Ernest Fraenkel, Tommi S Jaakkola, Richard A Young, & David K Gifford, Computational discovery of gene modules and regulatory networks, Nature Biotechnology, November 2003 [9] Aravind Subramanian, Pablo Tamayo, Vamsi K. Mootha, Sayan Mukherjee, Benjamin L. Ebert, Michael A. Gillette, Amanda Paulovich, Scott L. Pomeroyh, Todd R. Golub, Eric S. Lander, and Jill P. Mesirov, Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles, Proceedings of National Academy of Sciences of United States, 25 October 2005 [10] Karen Lemmens, Thomas Dhollander, Tijl De Bie, Pieter Monsieurs, Kristof Engelen, Bart Smets, Joris Winderickx, Bart De Moor and Kathleen Marchal, Inferring transcriptional modules from ChIP-chip, motif and microarray data, Genome Biology, 5 May 2006. 61 [11] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, Ian H. Witten (2009), The WEKA Data Mining Software: An Update, SIGKDD Explorations, Volume 11, Issue 1, 1997. [12] Wikipedia contributors, Association rule learning, Wikipedia, The Free Encyclopedia, 14 October 2010 http://en.wikipedia.org/wiki/Association_rule_learning