A DATA MINING EXPERIMENT FOR INFERRING TRANSCRIPTIONAL MODULE FROM Rakhi Malpani

advertisement
A DATA MINING EXPERIMENT FOR INFERRING TRANSCRIPTIONAL MODULE FROM
BREAST CANCER PROFILE DATA
Rakhi Malpani
BS, Pune University, India 2004
PROJECT
Submitted in partial satisfaction of
the requirements for the degree of
MASTER OF SCIENCE
in
COMPUTER SCIENCE
at
CALIFORNIA STATE UNIVERSITY, SACRAMENTO
FALL
2010
© 2010
Rakhi Malpani
ALL RIGHTS RESERVED
ii
A DATA MINING EXPERIMENT FOR INFERRING TRANSCRIPTIONAL MODULE
FROM A BREAST CANCER PROFILE DATA
A Project
by
Rakhi Malpani
Approved by:
__________________________________, Committee Chair
Meiliu Lu, PhD
__________________________________, Second Reader
Du Zhang, PhD
____________________________
Date
iii
`
Student: Rakhi Malpani
I certify that this student has met the requirements for format contained in the University format
manual, and that this project is suitable for shelving in the Library and credit to be awarded for
the project.
__________________________, Graduate Coordinator ___________________
Nikrouz Faroughi, PhD
Date
Department of Computer Science
iv
Abstract
of
A DATA MINING EXPERIMENT FOR INFERRING TRANSCRIPTIONAL MODULE FROM
BREAST CANCER PROFILE DATA
by
Rakhi Malpani
Breast cancer is a cancer that starts in the tissues of the breast. Over the course of a
lifetime, 1 in 8 women will be diagnosed with breast cancer. There are existing therapies for
treatment of cancer like chemotherapy, radiation therapy, and targeted therapy. Targeted therapy
aims at specific cancer cell growth, division and lifecycle. Existing targeted therapies for breast
cancer include Avastin, Herceptin, Iressa and Tykerb. Each of these drugs has specific effects on
cancer cells. There is a need for new drug. The motivation of this project is to make a
contribution such that the results obtained from the project are useful for new drug design.
Transcriptional modules (TM) consist of groups of co-regulated genes and transcription
factors (TF) regulating their expression. Transcription factors are one of the groups of proteins
that read and interpret the genetic "blueprint" in the DNA. They bind DNA and help initiate a
program of increased or decreased gene transcription. As such, they are vital for many important
cellular processes. Currently breast cancer profile data is large and may be noisy and incomplete.
It is not in the format suitable as an input to any data mining algorithm.
As a result, we propose to develop a method to transform original data to a feasible
format, which can be used as an input to the data-mining algorithm. This project first focus on
v
developing a method of data preprocessing to make the data feasible to be used by a data mining
algorithm. The second step in the project is to derive association rules between set of
transcriptional factors and set of genes using an association rule mining method.
The project procedure is as follows:
(1) Conduct analysis of the breast cancer profile data such as analyzing the most important
attributes of the profile data and removing the redundant attributes from the profile data.
(2) Perform a survey of data mining algorithms and identify a suitable algorithm for
association rule mining.
(3) Develop a method of data preprocessing so that we can rank and sort the data and make it
feasible as an input to the chosen association rule mining algorithm. Data preprocessing
techniques such as data cleaning, data integration, data transformation and reduction were
used.
(4) Perform an experiment and actually apply the association rule mining algorithm on this
transformed data and perform analysis of the results. Association rule mining is
performed using a two stage approach:
Stage 1: Generating association rules using association rule mining method in WEKA
Stage 2: Generating association rules using data filtered on the basis of p-value and
support.
_____________________, Committee Chair
Meiliu Lu, Phd
______________________ Date
vi
ACKNOWLEDGMENTS
I would like to thank my Prof. Meiliu Lu, my project advisor for all knowledge and
guidance she has provided me through this project.
I would also like to thank Prof. Du Zhang for giving me his time and input as second
reader.
I would also like to thank my husband, Ashwin, who has given me the time and support
to finish this project. I couldn’t have done it without him.
I would also like to thank my daughter Nishka for giving me the time to finish this
project.
vii
TABLE OF CONTENTS
Page
Acknowledgments.……………………………………………………………………………….vii
List of Tables..…………………………………………………………………………………….ix
List of Figures.…….………………………………………………………………………………x
Chapter
1. INTRODUCTION …………………..………..………………………………….…………..1
2. BACKGROUND ……….……………………………………………………………………4
2.1 Deoxyribonucleic Acid (DNA)………………………………………………………4
2.2 Gene …………………………………………………………………………………5
2.3 Transcription Factors ...………………………………………………………………5
2.4 Enhancer ...…...……………………………………………………………………..6
2.5 Gene Expression……………………………………………………………………..6
3.
PROBLEM ANALYSIS…......................................................................................................8
3.1 Analysis of Input Data files…………………………………………………………...8
3.2 Experimental Design Approach……………………….……………………………..13
4. DATA PREPROCESSING……………...…………………………………………………...16
4.1 Data Cleaning………………………………………………………………………...18
4.1.1 Cleaning for Data file I – TFvalues_Enhancer.xls.……………………….18
4.1.2 Cleaning for Data file II – Enhancer_expression_profile.xls.…………….18
4.2 Data Reduction……………………………………………………………………..19
4.3 Data Transformation….……………………………………………………………21
viii
4.3.1 Classification…………………………………………………………..21
4.3.2 P-value Threshold………………………………………………………22
4.3.3 Splitting the File………………………………………………………..24
4.4 Data Integration…………………………………………………………………..24
5. ASSOCIATION RULE MINING………………………………………………………...29
5.1 Weka……………………………………………………………………………...30
5.2 Association Rule Mining Method………………………………………………….30
5.3 Using Weka………………………………………………………………………31
5.3.1 Loading the Data……………………………………………………….31
5.3.2 Association Rule Mining Using Weka………………………………….35
5.4 Result Interpretation………………………………………………………………37
5.5 P-value based Filtering Method……………………………………………………38
6. EXPERIMENTS…………………………………………………………………………..44
6.1 Experiment 1……………………………………………………………………...44
6.2 Experiment 2………………………………………………………………………49
6.3 Experiment 3………………………………………………………………………52
6.4 Experiment 4………………………………………………………………………55
6.5 Result Analysis…………………………………………………………………….57
7. CONCLUSION…….……………………………………………………………………...58
7.1 Summary…………………………………………………………………………..58
7.2 Learning Experience……………………………………………………………….58
7.3 Future Work…………. ……………………………………………………………59
Bibliography…..…….…………………………….………….………………………………..60
ix
LIST OF TABLES
Page
1. Data file I – ‘TFvalues_Enhancer.xls’ description…………………………………………….9
2. Data file II – ‘Enhancer_expression_profile.xls’ description ………………………………....11
3. Classification Table…………………………………………………………………………..22
4. P-value Table…………………………………………………………………………………23
5. File Table……………………………………………………………………………………..24
6. TF Name and Number Mapping Table…………………………………..................................25
7. Association Rule Table 1……………………………………………………………………..46
8. Association Rule Table 2……………………………………………………………………..47
x
LIST OF FIGURES
Page
1. DNA, Gene, Transcription Factors..……….………………………………………….7
2. ‘TFvalues_Enhancer.xls’ format...…………..……………………………………......10
3. ‘Enhancer_expression_profile.xls’ format……….…………………………………...12
4. Candidate Enhancer Information of a Single Gene...……………………………........12
5. TF to Gene Relation…………………………………….……………………..………13
6. Experimental Design Procedure…………………..……………………………..…….14
7. Data Preprocessing Flowchart………………………………………………………...17
8. Missing Value Handling…………………………….…………………………………19
9. Data Reduction………………………………………….……………………………..20
10. Data Classification………………………………………….…………………………22
11. P-value Classification…….……………………………………………………………23
12. Data Integration……………………………………………………………………….25
13. Association Rule Mining……………………………………………………………….29
14. Loading Data………..…………………………………………………………………32
15. GREB1.csv File Format………………………………………………………………...33
16. Statistics Window……….……………………………………………………………...34
17. Parameter Editor Window…..…………………………………………………………..36
18. Associator Output Window..……………………………………………………………37
19. Filtering using P-value Threshold for a Set of Genes which are Up Expressed……….39
20. The Checked Boxes correspond to TFs which are Significant…………………………40
21. Two-Stage Association Rule Mining Approach………………………………………...42
xi
22. Result Analysis……..…………………………………………………………………57
xii
1
Chapter 1
INTRODUCTION
Over the last few years, computational biology research has contributed significantly to
the advancement of molecular biology. High throughput genome sequencing has provided us with
the complete genomes of several multicellular species from microbes to human beings. This
rapid growth of biological data opens a whole range of exciting possibilities for and necessitates
development of data mining methods tailored towards understanding the complex mechanisms of
biological systems. Bioinformatics has gone from providing support, in terms of data
management, visualization, and such, to generating new insights and directing future
experiments. One key topic in molecular biology is the understanding the regulatory process and
mechanism of gene expression. With the availability of breast cancer genetic profile data we look
to analyze this in our project.
Breast cancer is cancer originating from breast tissue. Prognosis and survival rate varies
greatly depending on cancer type and stage. The lifetime risk for breast cancer in the United
States is usually given as 1 in 8 (12.5%). Can we do anything about it? There is treatment based
on many factors like type and stage of cancer, sensitivity to hormones, over expression of certain
genes. Common treatments are radiotherapy, chemotherapy, hormonal therapy, surgery, targeted
therapy. The newer therapy being the targeted therapy, it uses special anti-cancer drugs that
identify certain changes in a cell that can lead to cancer. It aims at specific processes of cancer
cell growth, division and lifecycle. Existing targeted therapies for breast cancer include avastin,
herceptin, iressa, and tykerb[1]. Amongst the recent advancements in medical science is the study
of the gene expression profiles using DNA microarrays. Data mining microarray based gene
profile data is promising in predicting treatment outcome or drug response. Fortunately, advances
2
in molecular genetics technologies allow obtaining a global view of the cell. For example,
measuring the simultaneous expression of tens of thousands of genes. Using microarray
technology, the expression profile data of genes is extracted[2]. Usually this profile data is huge,
noisy and incomplete. This data is not in the format suitable as an input to any data mining
algorithm. Hence there is a need to develop a method to transform original data to a feasible
format. In this project, we propose to develop a method to transform original data to a feasible
format, which can be used as an input to a data mining algorithm that can recognize association
rules.
The objective of the project is two fold. First, to develop a method to transform original
data to a feasible format, which can be used as an input to the data mining algorithm. Second,
look for the association rules between a group of transcriptional factors (module) and target gene
behavior from a single time-series breast cancer profile data. The following are examples of
our desired results which are expressed in association rules: (TF1 TF2 TF3 )  a target
gene set 
The detailed description of DNA, Genes, Transcription factors, Enhancers, Gene
expression is given in Chapter 2. Chapter 3 gives the detailed description of data files and format
used in our analysis. It also looks at the experimental design approach. Next in chapter 4, we
discuss the data preprocessing steps performed before actual data mining is done. Chapter 5
describes the two stage association rule mining approach. Chapter 6 describes all experiments,
their aim, methodology and results. These experiments are carried out to derive association rules
3
between set of transcription factors and set of genes. Chapter 7 concludes the results of the
experiments, learning experience and future implementation plan.
4
Chapter 2
BACKGROUND
The molecular basis for genes is deoxyribonucleic acid (DNA). DNA is composed of a
chain of nucleotides, of which there are four types: adenine (A), cytosine (C), guanine (G), and
thymine (T). Genetic information exists in the sequence of these nucleotides. In this chapter, we
will introduce important definitions such as DNA, genes, transcription factors, enhancers, gene
expression and their significance.
2.1 Deoxyribonucleic Acid (DNA)
DNA is a nucleic acid that contains the genetic instructions used in the development and
functioning of all known living organisms and some viruses. The main role of DNA molecules is
the long-term storage of information. DNA is often compared to a set of blueprints or a recipe, or
a code, since it contains the instructions needed to construct other components of cells, such as
proteins and RNA molecules. All organisms consist of cells. Cells have a nucleus, and inside
nucleus, there is DNA, which encodes the “program” for making future organisms. DNA has
coding and non-coding segments, and coding segments, called “genes”, specify the structure of
proteins. For instance, hemoglobin is a protein, and gene will specify the structure of it. The
genes carry this genetic information, but other DNA sequences have structural purposes, or are
involved in regulating the use of this genetic information[2].
5
2.2 Gene
Gene is a unit of heredity in a living organism. It normally resides on a stretch of DNA
that codes for a type of protein or for an RNA chain that has a function in the organism. All living
things depend on genes, as they specify all proteins and functional RNA chains. Genes hold the
information to build and maintain an organism's cells and pass genetic traits to offspring[3].
2.3 Transcription Factor
A transcription factor (sometimes called a sequence-specific DNA binding factor) is a
protein that binds to specific DNA sequences and thereby controls the transfer (or transcription)
of genetic information from DNA to mRNA. Transcription factors perform this function alone or
with other proteins in a complex, by promoting (as an activator), or blocking (as a repressor) the
recruitment of RNA polymerase (the enzyme that performs the transcription of genetic
information from DNA to RNA) to specific genes. Transcription factors are essential for the
regulation of gene expression and are consequently, found in all living organisms. The number of
transcription factors found within an organism increases with genome size, and larger genomes
tend to have more transcription factors per gene. There are approximately 2600 proteins in the
human genome that contain DNA-binding domains, and most of these are presumed to function
as transcription factors. Therefore, approximately 10% of genes in the genome code for
transcription factors, which makes this family the single largest family of human proteins.
Furthermore, genes are often flanked by several binding sites for distinct transcription factors,
and efficient expression of each of these genes requires the cooperative action of several different
6
transcription factors (see, for example, hepatocyte nuclear factors). Hence, the combinatorial use
of a subset of the approximately 2000 human transcription factors easily accounts for the unique
regulation of each gene in the human genome during development[4].
2.4 Enhancer
Enhancer is a short region of DNA that can bind proteins called activators. Binding of
activators to this enhancer region can initiate the transcription of a gene that may be some
distance away from the enhancer, or can even be on a different chromosome. The increase in
transcription is due to the activators recruiting transcription factors, which enhances the binding
of RNA polymerase[5].
2.5 Gene Expression
Gene expression is the process by which information from a gene is used in the synthesis
of a functional gene product. These products are often proteins, but in non-protein coding genes
such as rRNA genes or tRNA genes, the product is a functional RNA. The process of gene
expression is used by all known life - eukaryotes (including multicellular organisms), prokaryotes
(bacteria and archaea) and viruses - to generate the macromolecular machinery for life. Several
steps in the gene expression process may be modulated, including the transcription, RNA
splicing, translation, and post-translational modification of a protein[6]. Gene regulation gives the
cell control over structure and function, and is the basis for cellular differentiation,
morphogenesis and the versatility and adaptability of any organism. Gene regulation may also
serve as a substrate for evolutionary change, since control of the timing, location, and amount of
7
gene expression can have a profound effect on the functions (actions) of the gene in a cell or in a
multicellular organism. Measuring gene expression is an important part of many life sciences the ability to quantify the level at which a particular gene is expressed within a cell, tissue or
organism can give a huge amount of information. For example measuring gene expression can:
Identify viral infection of a cell (viral protein expression)
Determine an individual's susceptibility to cancer (oncogene expression)
Find if a bacterium is resistant to penicillin (beta-lactamase expression)
Figure 1: DNA, Gene, Transcription Factors
In chapter 3, we will see the detailed description of data files and formats used in our
analysis. We will also look at the experimental design approach.
8
Chapter 3
PROBLEM ANALYSIS
It is desirable to go over the input files and formats in our problem analysis before
preprocessing the data. Further, we will look at the experimental design approach wherein we use
these data files as input.
3.1 Analysis of Input Data files
There are two input files. First file is the ‘TFvalues_Enhancer.xls’ file which describes
candidate enhancers and the transcription factor attribute values associated with these enhancers.
This file can be termed as superset of the second file.
Second file is the
‘Enhancer_expression_profile.xls’ which describes the gene expression profile data of a breast
cancer patient.
a) Data file I – ‘TFvalues_Enhancer.xls’
This file describes the details of each candidate enhancer for 80 selected transcription factors. It
describes the location of the TF binding site relative to the peak location of candidate enhancer.
The file size is 60.5 MB. The number of rows are 41112 (~ 41K rows) and the number of
columns are 241. The total number of candidate enhancers is 41112. Each row is a candidate
enhancer. There are roughly 2500 transcription factors, out of which 80 most important
transcription factors are considered in our analysis and are associated with every candidate
enhancer. Transcription factors affect the expression of enhancer. Enhancers affect the expression
9
of genes. Each transcription factor has three attributes associated with it: Position, P-value, PWM
score.
Position: It is the binding site relative to the peak location.
P-value: In statistics, a result is called statistically significant if it is unlikely to have occurred by
chance. The amount of evidence required to accept that an event is unlikely to have arisen by
chance is known as the significance level or critical p-value. For instance, “there is only one
chance in a thousand that a TF is significant by coincidence," a 0.001 level of statistical
significance or p-value is being implied. The most important attribute in our analysis is P-value.
Table 1: Data file I – ‘TFvalues_Enhancer.xls’ description
10
Figure 2: TFvalues_Enhancer.xls format
b) Data file II – ‘Enhancer_expression_profile.xls’
This file describes the enhancer expression time series profile for a particular breast
cancer patient. It contains a subset of rows of data file I. The file size is 6.67 MB. The number of
rows is 21997 (~ 22K). The number of columns is 19. The most important attributes are Enhancer
candidate id, gene name (symbol), expression levels of genes over a period of 48 hours
(MCF7_3hr, MCF7_6hr, MCF7_9hr, MCF7_12hr, MCF7_24hr, MCF7_48hr). Gene are
expressed by enhancers. For instance, hemoglobin is a protein transcribed by a gene. The gene
that generates hemoglobin has to be expressed first so that hemoglobin is generated.
11
Table 2: Data file II – ‘Enhancer_expression_profile.xls’ description
Column
A
B
C
D
E
F
Attribute
Name of the enhancer
Chromosome
peak location (enhancer location)
intensity
don’t use
Location of enhancer (Note: C is the actual location. F is the
annotation. It indicates if the position is intron, 100k upstream, exon,
desert (no gene), 100k downstream.)
G, H, I
gene name
J
Strand (Note + means the gene is found in +ve strand while – means
the gene is found in –ve strand.)
K
transcription start site (TSS)
L
transcription end site
M
distance between TSS and the peak location
N, O, P, Q, S
the microarray expression Level
Note: Columns A – F are for enhancer candidate, Columns G – S is for breast gene
12
Figure 3: Enhancer_expression_profile.xls format
Figure-4 below shows the candidate enhancer information associated with a single gene
“GREB1”. As seen in the figure, one gene may have multiple candidate enhancers. However, the
expression levels associated with a gene are constant.
Figure 4: Candidate Enhancer Information of a Single Gene
13
As seen in figure 2 and figure 3, candidate enhancer attribute is present in both the above data
files. Data file I: TFvalues_Enhancer.xls describes details of each candidate enhancer for 80
selected transcription factors. Data file II: Enhancer_expression_profile.xls describes the
enhancer expression time series profile data. As the transcription factors affect the expression
of enhancer and enhancers affect the expression of genes, therefore the transcription factors affect
the expression of genes. Thus, to find association between a set of transcription factor and set of
genes, we combine the two files with candidate enhancer as the pivot. Following figure-5 shows
that the two files can be integrated with candidate enhancer as a pivot.
Figure 5: TF to Gene Relation
3.2 Experimental Design Approach
In this section, we give an overview of the steps followed in our approach to achieve our
objective: to preprocess the data in order to make it feasible as an input to a data mining
algorithm and to generate association rules needed.
14
Figure 6: Experiment Design Procedure
As shown in figure-6, the step by step process is described below:
(1) First step in the experiment design procedure is to study and analyze the given data files
that are described in section 3.1.
(2) Next step is to preprocess this data as discussed in chapter 4. The steps in data
preprocessing include:
(2.1) data reduction – removing redundant attributes
(2.2) data integration – integrate the two files to get the unified view
(2.3) data transformation– transform the data into a suitable format.
(3) Next step is association rule mining using a two stage approach as discussed in chapter 5.
We also perform data mining by using p-value and support based filtered data as
explained in section 6.2, 6.3 and 6.4.
15
(4) Once the association rules are generated, we compare the results with experimental
evidences and analyze them as discussed in section 6.5.
In chapter 4, we go over the detailed description of data preprocessing steps performed
before actual data mining.
16
Chapter 4
DATA PREPROCESSING
Today’s real world databases are highly susceptible to noisy, missing, and inconsistent
data due to their typically huge size and their likely origin from multiple heterogeneous sources.
Low-quality data will lead to low quality mining results. Analyzing data that has not been
carefully screened can produce misleading results. Thus, the representation and quality of data is
first and foremost before running an analysis. Data processing techniques, when applied before
mining, can substantially improve the overall quality of the patterns mined and the time required
for the actual mining [7].
As
discussed
earlier, we
have
two
data
files:
TFvalues_Enhancer.xls
and
Enhancer_expression_profile.xls. The data files have missing attribute values at some places.
There are a few redundant attributes also. The data needs to be integrated from the two data files
to get the unified view of these data and to perform further analysis. Hence initially there is a
need to preprocess this data to make it feasible for input to data mining algorithm. There are a
number of data preprocessing techniques. Data cleaning can be applied to remove noise and
correct inconsistencies in the data. Data integration merges data from multiple sources into a
coherent data store. Data transformation such as normalization may be applied [7]. We have
performed major tasks of Data cleaning, Data integration, Data transformation, Data reduction
and Data discretization in our preprocessing.
17
Following figure 7 shows the data preprocessing steps in the form of a flowchart.
Figure 7: Data Preprocessing Flowchart
18
4.1 Data Cleaning
Data cleaning involves processing missing values, identifying and removing outliers. The
details of data cleaning for the two data files is described below.
4.1.1 Cleaning of Data file I – TFvalues_Enhancer.xls
The 1st column of TFvalues_Enhancer.xls file has comprehensive list of candidate
enhancers. Each candidate enhancer has the details of 80 transcription factors (TF) associated
with it. Out of the three attributes of every TF, p-value attribute is the most important. The
remaining two attributes i.e. position (less important) and pwm score (redundant) are not used in
our analysis. The missing p-values are filled by average of remaining values in that column.
4.1.2 Cleaning of Data file II – Enhancer_expression_profile.xls
Enhancer_expression_profile.xls file has expression time series profile data of a breast
cancer patient. The 1st column is the candidate enhancer and the subsequent columns show the
expression level at 3, 6, 9, 12, 24 and 48 hours.
19
Figure 8: Missing Value Handling
The missing expression values in these time stamps are filled by average of the remaining values
of expression levels as seen in figure 8. The remaining attributes like chromosome,
peak_location, mcf7e2, mcf7et, location, geneid, symbol, refseq, strand, TSS, TES, TSSdistance
are not included in this analysis.
4.2 Data Reduction
A few attributes in data files seem to be irrelevant to the mining task. Hence, in this step,
we use attribute subset selection technique to reduce the data set size by removing the redundant
attributes from the file. Detailed reduction is described as follows:
20
Data file I – TFvalues_Enhancer.xls:
For the data file I - TFvalues_Enhancer.xls file, each transcription factor has 3 attributes:
p-value, pwm score and position value. Out of these three attributes we remove pwm score and
position value attributes as they are not required for mining task.
Data file II – Enhancer_expression_profile.xls:
The attributes from Enhancer_expression_profile.xls file, such as chromosome,
peak_location, mcf7e2, mcf7et, location, geneid, refseq, strand, TSS, TES, TSSDistance as
shown in figure 9, are not required for the mining task and they were removed.
Figure 9: Data Reduction
21
4.3 Data Transformation
In this section, we discuss method of classifying the genes, filtering method based on pvalue threshold and the method of splitting the data file-I: TFvalues_Enhancer.xls file on the
basis of our classification.
4.3.1 Classification
The classification of genes into different classes is done on the basis of difference
between expression/intensity level values at 48hr and that of at 3hr. For example:
Difference = MCF7_48hr – MCF7_3hr
If Difference > 0 then target gene is up expressed denoted as “  ”
If Difference < 0 then target gene is down expressed denoted as “  ”
This can be further classified according to different degree of changes as shown in the following
table. For instance, as shown in the figure 10, If difference >= 10 then for this class of genes, the
intensity is increasing with label F. Therefore the target gene is up-expressed. The association
rule for that target gene will be defined as: TF1 & TF2  Target gene set  F
Here, ‘ F’ denotes that the target gene is up and increasing. In Table 3, we defined a list of
labels for different degree of changes.
22
Figure 10: Data Classification
Table 3: Classification Table
Difference
Difference < = -1
Difference < - 0.5
Difference = 0
Difference >= 1
Difference >=5
Difference >=10
Target gene
expression
DOWN
DOWN
NO CHANGE
UP
UP
UP
Class
Label
Decreasing
Decreasing
No change
Increasing
Increasing
Increasing
A
B
C
D
E
F
4.3.2 P-value Threshold
There are roughly 2500 Transcription factors (TFs) and only 80 significant TFs are
considered in our analysis. With 80 TFs we still can have mathematically 2^80 combinations that
can trigger any gene in any fashion. But we need to cut this down because 2^80 is ~1.2*10^24
23
combinations which is astronomically high. In statistics, a result is called statistically significant
if it is unlikely to have occurred by chance. The amount of evidence required to accept that an
event is unlikely to have arisen by chance is known as the significance level or critical p-value.
For instance, “there's only one chance in a thousand that a TF is significant by coincidence," a
0.001 level of statistical significance or p-value is being implied. We have chosen p-value as
0.002 (this means that there is 99.8% chance that a TF is significant) for experiment-2 and 0.005
for experiment-3 (99.5% chance that a TF is significant). Therefore, we have reduced the search
space by selecting only those TFs whose p-value < 0.002 or 0.005 [8]. Therefore a transcription
factor whose p-value < 0.002 or 0.005 is assumed significant for experiment-2 and experiment-3
respectively. Hence P-value is transformed to label “T” and “F” as shown in the following table:
Table 4: P-value Table
P-value
P-value < 0.002/0.005
P-value > = 0.002/0.005
Figure 11: P-value Classification
Label
T (stands for true)
F (stands for false)
24
4.3.3 Splitting the File
Based on the classification done in above step, the data file I is split into six chunks (subfiles) as shown in the table 5.
Table 5: File Table
No.
File Name
Intensity range Size(KB)
1 greater_than_or_equal_to_10.xls
{10,27}
29
2
greater_than_or_equal_to_5.xls
{5,27}
42
3
greater_than_or_equal_to_1.xls
{1, 27}
265
4
equal_to_0.xls
{0}
1839
5
less_than_-0.5.xls
{-5,-0.5}
434
6
less_than_-1.xls
{-5,-1}
37
The following is description of the column labels we used in the table:
Intensity range -- the interval in which the value “MCF7(48hr) – MCF7(3hr)” lies. For instance,
if the value of “MCF7(48hr) – MCF7(3hr)” is 11 then it can be found in interval {10, 27}.
Therefore it can be found in the file “greater_than_or_equal_to_10.xls”.
Size (the size of the sub-file).
4.4 Data Integration
To identify the association rules, both the files: TFvalues_Enhancer.xls and
Enhancer_expression_profile.xls play important role and we need the data from both the files
25
together. Hence in this step, the two original files are merged into a third file.
Figure 12: Data Integration
The new file is huge ~ 65 MB. There are total 82 columns. First column is “New tag – name of
the enhancer”, second column is “symbol – gene name” from file 2 and the remaining 80 columns
are p-values of 80 TFs from file I which is TFvalues_enhancer.xls. For example: Column with
label “F$HAC1_Q2” is name of transcription factor and that column contains p-value
corresponding to a particular candidate enhancer.
For easy understanding, we have labeled those 80 transcription factors from TF1 thru TF80.
Below is the table that shows one to one mapping of TF’s name and its corresponding number.
Table 6: TF Name and Number Mapping Table
TF
number
TF1
TF2
TF3
TF4
TF name
F$HAC1_Q2
F$LEU3_B
I$GAGAFACTOR_Q6
P$ABF1_03
26
TF5
TF6
TF7
TF8
TF9
TF10
TF11
TF12
TF13
TF14
TF15
TF16
TF17
TF18
TF10
TF20
TF21
TF22
TF23
TF24
TF25
TF26
TF27
TF28
TF29
TF30
TF31
TF32
TF33
TF34
TF35
TF36
TF37
V$AHRARNT_02
V$ALPHACP1_01
V$AML_Q6
V$AP1_C
V$AP1_Q2
V$AP2ALPHA_01
V$AP2GAMMA_01
V$AP4_Q6_01
V$AR_01
V$AREB6_01
V$ATF3_Q6
V$BACH1_01
V$BACH2_01
V$BEL1_B
V$CACCCBINDINGFACTOR_Q6
V$CBF_02
V$CMYB_01
V$CP2_02
V$CREB_Q4
V$DBP_Q6
V$DEAF1_01
V$E2_01
V$E2F_Q6_01
V$EBF_Q6
V$ELK1_02
V$ER_Q6
V$ER_Q6_02
V$ETS1_B
V$FREAC4_01
V$GATA1_01
V$GATA1_04
V$GCM_Q2
V$GR_Q6
27
TF38
TF39
TF40
TF41
TF42
TF43
TF44
TF45
TF46
TF47
TF48
TF49
TF50
TF51
TF52
TF53
TF54
TF55
TF56
TF57
TF58
TF59
TF60
TF61
TF62
TF63
TF64
TF65
TF66
TF67
TF68
TF69
TF70
V$HEN1_02
V$HES1_Q2
V$HIC1_02
V$HNF3ALPHA_Q6
V$KROX_Q6
V$LMAF_Q2
V$LRH1_Q5
V$MEF3_B
V$MEIS1_01
V$MIF1_01
V$MINI19_B
V$MOVOB_01
V$MTF1_Q4
V$MYOGNF1_01
V$NF1_Q6
V$NFKAPPAB_01
V$NRF2_Q4
V$OLF1_01
V$P53_01
V$PAX3_01
V$PAX5_01
V$PAX6_Q2
V$PAX9_B
V$PEBP_Q6
V$PXR_Q2
V$ROAZ_01
V$SMAD3_Q6
V$SMAD4_Q6
V$SOX10_Q6
V$SP1_01
V$SP3_Q3
V$SREBP_Q3
V$STAT1_01
28
TF71
TF72
TF73
TF74
TF75
TF76
TF77
TF78
TF79
TF80
V$TEF1_Q6
V$TGIF_01
V$USF_Q6_01
V$VMAF_01
V$WHN_B
V$WT1_Q6
V$XPF1_Q6
V$YY1_02
V$ZF5_01
V$ZNF219_01
In the next chapter, we will discuss about Data mining tool Weka. We will also look at
the two stage association rule mining approach and interpretation of the result we recieved from
Weka and p-value based filtering method.
29
Chapter 5
ASSOCIATION RULE MINING
After preprocessing the data, our aim is to look for association rules between the set of
transcription factors and the set of genes. Figure below highlights the association rule mining step
of this project.
Figure 13: Association Rule Mining
Association rule learning is a method for discovering interesting relations between
variables in a large database. We have developed a two stage association rule mining approach. In
the first stage, we are generating association rules using Weka – a machine learning tool and then
we interpret the results obtained from Weka. In the second stage, association rules are generated
by filtering the data in Excel on the basis of p-value threshold.
30
5.1 About Weka
Weka (Waikato Environment for Knowledge Analysis) is a popular suite of machine
learning software written in Java, developed at the University of Waikato, New Zealand. WEKA
is a free software available under the GNU General Public License. It runs on almost any
platform and has been tested under Linux, Windows, and Macintosh operating systems and even
on a personal digital assistant. Weka contains a collection of visualization tools and algorithms
for data analysis and predictive modeling, together with graphical user interfaces for easy access
to its functionality. It provides extensive support for the whole process of experimental data
mining, including preparing the input data, evaluating learning schemes statistically, and
visualizing the input data and the result of learning[11].
5.2 Association Rule Mining Method
After preprocessing the data, our aim is to identify a suitable data mining method that can
recognize the association rules. Weka has three association rule learners.
(1) Apriori implements the apriori algorithm. It starts with a minimum support of 100%
of the data items and decreases this in steps of 5% until there are atleast 10 rules with the required
minimum confidence of 0.9 or until support has reached lower bound of 10%. Association rule
show attribute value conditions that occur frequently together in a given dataset. Support and
confidence are two measures of rule interestingness. They respectively reflect the usefulness and
certainty of discovered rules. Let I = {i1, i2, … , in}be a set of n binary attributes called items. Let
D = {t1, t2, … , tn} be a set of transactions called the database. Each transaction in D has a unique
transaction ID and contains a subset of the items in I. A rule is defined as an implication of the
31
form X => Y where X, Y are subset of I and X ∩ Y = Ф. The sets of items (for short itemsets) X
and Y are called antecedent (left-hand-side or LHS) and consequent (right-hand-side or RHS) of
the rule respectively. The support supp(X) of an itemset X is defined as the proportion of
transactions in the data set which contain the itemset. The confidence of a rule is defined conf( X
=> Y) = supp(X U Y)/ supp(X)[12].
(2) Predictive apriori combines confidence and support into a single measure of
predictive accuracy and finds the best n association rules in order.
(3) Tertius finds rules according to a confirmation measure, seeking rules with multiple
conditions in the consequent, like Apriori, but differing in that these conditions are OR’d
together, not ANDed[11]. Out of the three, we have used Apriori which is the simplest and
meeting our needs.
5.3 Using Weka
The easiest way to use Weka is through a graphical user interface called Explorer. This
gives access to all of its functionalities using menu selection and form filling.
5.3.1 Loading the Data
In addition to the native ARFF data file format, WEKA has the capability to read in
".csv" format files. This is fortunate since many databases or spreadsheet applications can save or
export data into flat files in this format. we load the data set in “.csv” format into WEKA and then
perform association rule mining on the input data set. While all of these operations can be
performed from the command line, we use the GUI interface for WEKA Explorer which is easy
to use.
32
Figure 14 shows the snapshot of initially (in the Preprocess tab) clicking "open" and navigating to
the directory containing the data file (.csv or .arff). In this example we will open the GREB1.csv
file. GREB1.csv is obtained by integrating two original files TFvalue_enahancer.xls and
Enhancer_expression_profile.xls as seen in section 4.4.
Figure 14: Loading Data
33
Following figure 15 shows a sample of section of “GREB1.csv” file in the Weka
acceptable format. The file has 82 columns. The first column is “NewTag” which contains
candidate enhancer id, second column is “symbol” which contains gene name for that
candidate enhancer from data file 2: ‘Enhancer_expression_profile.xls’ and the remaining
80 columns are p-values of 80 TFs from file 1: ‘TFvalues_enhancer.xls’.
Figure 15: GREB1.csv File Format
Once the data is loaded, WEKA will recognize the attributes and compute some basic
statistics on each attribute.
34
The left panel in figure 16 shows the list of recognized attributes. Clicking on any
attribute in the left panel will show the basic statistics on that attribute. For categorical attributes,
the frequency for each attribute value is shown, while for continuous attributes we can obtain
min, max, mean, standard deviation. For instance, attribute “TF1” has few “T” and few “F”
values in its column. Hence frequency of each of them, which is the number of times they occur,
will be shown in the count column of the right panel.
Figure 16: Statistics Window
35
5.3.2 Association Rule Mining Using Weka:
Once the data is loaded and attributes are recognized by Weka, next step is to generate
association rules. Clicking on the "Associate" tab (Figure 16) will bring up the interface for the
association rule algorithms. The Apriori algorithm which we will use is the default algorithm
selected. However, in order to change the parameters for this run (e.g., support, confidence) we
click on the text box immediately to the right of the "Choose" button (Figure 16). Note that this
box, at any given time, shows the specific commandline arguments that are to be used for the
algorithm.
36
The dialog box for changing the parameters is shown in figure 17. Here, you can specify
various parameters associated with Apriori.
Figure 17: Parameter Editor Window
37
After the execution of apriori algorithm is completed, the result of the execution can be
seen in the “Associator output” window. We can see association rules generated in this window
as shown in figure 18.
Figure 18: Associator Output Window
5.4 Result Interpretation
As discussed in section 4.3.2, we are interested in finding sets of TFs that are significant
in triggering a gene to go up or go down. From figure 18, we can see that those TFs labeled with
“T” are significant for upward expression of GREB1 gene. Hence we are only interested in TFs
having label “T”. Therefore we discard the TFs with label “F” and write our association rule. For
instance, let’s look at rule 6 in figure 18. This rule contains a set of TFs like TF30, TF31, TF66,
38
TF71 and TF79. Out of these TFs, only TF79 has label “T” and therefore we choose TF79 and
discard the others. Therefore the new association rule is rewritten as:
TF79 = “T” => symbol = “GREB1” which can be simplified as TF79 => GREB1. The
rule suggests that out of all the associated TFs, TF79 is significant in expression of gene
“GREB1”.
Further details and more association rules using WEKA are described in experiment 1 of
chapter 6.
5.5 P-value based Filtering Method
Using Weka’s association rule mining method, we obtained a set of association rules
between set of transcription factors and a single gene. We would like to get association rules
between set of TFs and a set of genes. Towards that end, in sections 6.2 through 6.4, we describe
our method in which the association rules are obtained by filtering the data in MS-Excel on the
basis of p-value threshold.
Consider file greater_than_or_equal_to_10.xls of section 4.3.3, this file contains set of
genes which are up expressed as discussed in section 4.3.1. We filter this file data on the basis of
p-value threshold. For instance, if p-value < 0.005 then label that TF as “T” else “F”.
39
As seen in figure 19, the file contains p-values of 80 TFs associated with candidate
enhancers. We filter the p-value by applying function such as IF (column value < 0.005,”T”,”F”)
and append corresponding result of filtering at the end of the file as shown in figure 19. TFs
labeled as “T” are considered to be significant.
Figure 19: Filtering using P-value Threshold for a Set of Genes which are Up Expressed
40
If the TF value is true always for a gene, then we select that TF by checking the TF in
excel sheet as shown in figure 20.
Figure 20: The Checked Boxes correspond to TFs which are Significant
For instance, in row, cloumn (2,2) of figure 20, GREB1 gene has been checked for TF1
which implies TF1 is always significant for GREB1 gene. Similarly, for row,column (5,2), TF1 is
not significant for MGP gene. Therefore, the corresponding box is not checked. Using this
approach, we can see that TF3, TF5 and TF6 are significant for all genes GREB1, CDH26,
FLJ30058, MGP, PKIB, SGK1, SYTL4 whereas TF1 is only significant to 5 genes: GREB1,
CDH26, FLJ30058, PKIB, SGK1. Next step is to determine support. For a subset of gene, a TF is
is expressed at a significant level (i.e its p-value < 0.005) for α% of time then the association rule
is said with “support = α%” level. In this file we define α = 100 and consider only those TFs in
our association rule which satisfy support = 100%. We do this by checking those TFs from the
excel sheet of figure 20 which are always significant 100% of the time. For example, TF3, TF5
41
and TF6 are always significant for the set of genes of figure 20 like GREB1, CDH26, FLJ30058,
MGP, PKIB, SGK1, SYTL4.
TF3 ∩ TF5 ∩ TF6 => GREB1 ∩ CDH26 ∩ FLJ30058 ∩ MGP ∩ PKIB ∩ SGK1 ∩ SYTL4 
If on the other hand support = 70% then TF1 would also be significant (As seen in figure
19, TF1 is true for 5 out of 7 genes implying 5/7 = 71% support). So the association rule would
be:
TF1 ∩ TF3 ∩ TF5 ∩ TF6 => GREB1 ∩ CDH26 ∩ FLJ30058 ∩ MGP ∩ PKIB ∩ SGK1 ∩
SYTL4 .
42
Figure 21: Two-Stage Association Rule Mining Approach
The figure 21 shows the overview of the association rule mining two stage approach
discussed in section 5.3, 5.4 and 5.5. In experiment 1 of chapter 6, stage 1 approach is followed
and in experiments 2, 3, 4, stage 2 approach is followed.
43
In chapter 6, we look into all experiments in detail, their aim, methodology and results.
44
Chapter 6
EXPERIMENTS
In this chapter, we describe four experiments that we carried out to derive association
rules between a set of transcription factors and a set of genes. In the procedure of experiment 1
which is described in section 6.1, we have used Weka tool only. The two stage association rule
mining method is used in the experiments 2, 3 and 4. In each experiment, we have considered a
few behaviors such as expression/intensity level of genes increasing (Category: E/F) and
decreasing (Category: A) as described in section 4.3.1. In each experiment, association rules are
generated with various support and confidence values.
6.1 Experiment 1
Aim: To derive an association rule for a given target gene.
Procedure: TFs whose p-value > 0.001 are assumed to be significant. Based on this pvalue criteria, the numerical p-values are converted to nominal values such as “T” or “F”.
Now this preprocessed data is given as an input to the Weka. The Weka generates
association rules for every gene. For example, the association rule generated is TF79 =
“T” => symbol
= “GREB1” TF30 = “F”, TF31 = “F”, TF66 = “F”, TF71 = “F”. This
association rule is further interpreted as to show which TF’s are responsible for the
expression of a given gene. For example, TF79 = “T” => symbol = “GREB1” which can be
simplified as TF79 => GREB1. The rule suggests that out of all the associated TFs, TF79 is
significant in expression of gene “GREB1”. Support is considered to be 100%.
45
6.1.1 Behavior: Expression of gene increasing (Category: F)
As discussed in section 4.3.1, the classification of genes into different classes is done on
the basis of difference between expression/intensity level values at 48hr and that of at 3hr. If
difference >= 10 then for this class of genes, the intensity is increasing with label F. Therefore the
target gene is up-expressed. The association rule for that target gene will be defined as: TF1 &
TF2  Target gene set  F. Here, ‘ F’ denotes that the target gene is up and increasing.
46
Let us consider rule no.1 from the following table 7. From the rule, it can be seen that
transcription factors such as F$LEU3_B, V$DEAF1_01, V$E2F_Q6_01, V$MYOGNF1_01,
V$SP3_Q3, V$TGIF_01, V$WHN_B and V$ZF5_01 are significant in the up expression of gene
called “MGP”. Similarly we can interpret the remaining rules of the table.
Table 7: Association Rules Table 1
Rule
no.
Gene
name
Rules by TF name
Confidence
1.
MGP
F$LEU3_B ∩ V$DEAF1_01 ∩ V$E2F_Q6_01 ∩ V$MYOGNF1_01 ∩
V$SP3_Q3 ∩ V$TGIF_01 ∩ V$WHN_B ∩ V$ZF5_01 => MGP  (F)
100
2
CDH26
F$LEU3_B ∩ V$AP2ALPHA_01 ∩ V$DEAF1_01 ∩ V$E2_01 ∩ V$E2F_Q6_01
∩ V$ELK1_02 ∩ V$GATA1_04 ∩ V$GR_Q6 ∩ V$P53_01 ∩ V$PEBP_Q6 ∩
V$SP1_01 ∩ V$YY1_02 ∩ V$ZNF219_01 => CDH26  (F)
100
3
GREB1
V$ZF5_01 => GREB1  (F)
100
4
SGK1
V$SMAD4_Q6 ∩ V$WHN_B => SGK1  (F)
90
5
PKIB
V$WT1_Q6 => PKIB  (F)
100
6
FLJ30058
F$HAC1_Q2 ∩ F$LEU3_B ∩ V$AHRARNT_02 ∩ V$ALPHACP1_01 ∩
V$AP1_C ∩ V$AP1_Q2 ∩ V$AR_01 ∩ V$BACH1_01 ∩ V$BACH2_01 ∩
V$CP2_02 ∩ V$DEAF1_01 ∩ V$E2_01 ∩ V$E2F_Q6_01 ∩ V$EBF_Q6 ∩
V$ELK1_02 ∩ V$FREAC4_01 ∩ V$GR_Q6 ∩ V$HEN1_02 ∩ V$MIF1_01 ∩
V$MINI19_B ∩ V$MTF1_Q4 ∩ V$MYOGNF1_01 ∩ V$PAX5_01 ∩
V$PAX9_B ∩ V$ROAZ_01 ∩ V$SP1_01 ∩ V$SREBP_Q3 ∩ V$VMAF_01 ∩
V$ZF5_01 => FLJ30058  (F)
100
7
SYTL4
F$HAC1_Q2 ∩ F$LEU3_B ∩ V$AHRARNT_02 ∩ V$AML_Q6 ∩ V$AP1_C ∩
V$AREB6_01 ∩ V$ATF3_Q6 ∩ V$BACH1_01 ∩ V$BACH2_01 ∩
V$CACCCBINDINGFACTOR_Q6 ∩ V$CBF_02 ∩ V$E2_01 ∩ V$E2F_Q6_01
∩ V$EBF_Q6 ∩ V$ER_Q6 ∩ V$GATA1_01 ∩ V$HIC1_02 ∩ V$KROX_Q6 ∩
V$MINI19_B ∩ V$MTF1_Q4 ∩ V$MYOGNF1_01 ∩ V$NFKAPPAB_01 ∩
V$OLF1_01 ∩ V$P53_01 ∩ V$PAX3_01 ∩ V$PAX6_Q2 ∩ V$SP1_01 ∩
V$STAT1_01 ∩ V$USF_Q6_01 ∩ V$WT1_Q6 ∩ V$YY1_02 ∩ V$ZF5_01 =>
SYTL4  (F)
100
47
6.1.2 Behavior: Expression of gene decreasing (Category: A)
Following table 8 shows the association rules for decreasing behavior (Category: A) as described
in section 4.3.1.
Table 8: Association Rules Table 2
Rule
no.
Gene
name
Rules by TF name
Confidence
1
CYP24A1
V$SP1_01 ∩ V$SP3_Q3 ∩ V$WHN_B ∩ V$WT1_Q6 =>
CYP24A1  (A)
100
2
KCNK5
F$LEU3_B ∩ V$WHN_B ∩ V$ZF5_01 => KCNK5  (A)
100
3
TGM2
V$WHN_B => TGM2  (A)
100
4
ZSCAN2
V$AP2GAMMA_01 ∩ V$AR_01 ∩ V$ZF5_01 => ZSCAN2 
(A)
100
5
CISH
V$DEAF1_01 ∩ V$E2F_Q6_01 ∩ V$MEF3_B ∩
V$NFKAPPAB_01 => CISH  (A)
100
6
LOC6529
V$BACH2_01 ∩ V$BEL1_B ∩ V$E2F_Q6_01 ∩ V$HES1_Q2
=> LOC6529  (A)
100
7
MAP6D1
V$E2F_Q6_01 ∩ V$KROX_Q6 ∩ V$PXR_Q2 ∩ V$ZF5_01
=> MAP6D1  (A)
100
8
PRY2
V$WHN_B ∩ V$EBF_Q6 ∩ V$E2F_Q6_01 ∩
V$ALPHACP1_01 => P2RY2  (A)
80
9
PSCA
V$E2F_Q6_01 => PSCA  (A)
100
10
SCNNIA
V$E2F_Q6_01 ∩ V$GR_Q6 ∩ V$MEIS1_01 ∩ V$ZF5_01 =>
SCNNIA  (A)
90
11
SYT12
V$MEIS1_01 => SYT12  (A)
90
12
VASN
V$HNF3ALPHA_Q6 ∩ V$GATA1_04 ∩ V$FREAC4_01 ∩
V$DBP_Q6 => VASN  (A)
90
48
Let us consider rule 1 from table 8. The rule means that transcription factors including
V$SP1_01, V$SP3_Q3, V$WHN_B and V$WT1_Q6 are significant for the downward
expression of the “CYP24A1” gene. Only category F and category A are considered in this
experiment from classification table which is table 3.
Needs for improvement:
As we see in the tables 7 and 8 above, the association rules generated only reflect the
impact of a subset of TFs on just one target gene but not on a subset of genes. Based on the
feedback from the collaborator in genome centre, to obtain more useful result, we need to make
an improvement over the method used for experiment 1. Therefore, our goal in the next
experiment is to identify a collection of association rules between a subset of TFs and a subset of
genes.
We have the availability of expression level at different points of time like at 3 hrs, 6 hrs,
12 hrs, 24 hrs, 48 hrs. We would like to research and find out what are the associated breast
cancer genes by their peaking intensity level. There are many possibilities like the gene
expression is peak at 3hrs and the peaking intensity reduces by 48 hrs. Another possibility is the
gene expression is low at 3 hrs and peaks at 48 hrs. In the experiments 2, 3 and 4 we consider this
possibilities.
Another possibility is the expression level is low at 3 hrs and it peaks at 24 hrs. It could
also be true that the breast cancer genes are peaking at 3 hrs and 48 hrs and they have minimum
expression level at 24 hrs. We have mentioned this in future work.
49
6.2 Experiment 2
In this experiment, we want to make an improvement over the experiment 1. The
association rules should target on a set of genes instead of just one gene. Therefore our aim was
to identify a collection of association rules between a subset of the 80 TFs and a subset of genes
given with a strong support (for example, support >= 70%) and confidence = 99.98%.
The methodology we took is that we first defined P-value threshold as 0.002 and the
selection criteria as p-value < 0.002. In otherwords, a transcription factor whose p-value < 0.002
is assumed to be significant. Next we define support strength. For this experiment we say that TFs
having support = 70% are considered to be strong. For instance, in a set of 10 genes, if a
particular TF’s p-value is less than 0.002 for 7 genes. Then that TF is said to be significant 70%
of the time.
Formal description of the “support= α%” association rule: If for a subset of gene, a TF is
significant (i.e its p-value < 0.002) for α% of time then it is said to be “support = α%” association
rule. In our experiment we define α = 70 and consider only those TFs in our association rule
which satisfy support >= 70%.
The results contain two parts:
(1) Results of the experiment 2 with α = 70:
a) Behavior- Expression of genes increasing (Category: F):
As discussed in section 4.3.1, the classification of genes into different classes is done on
the basis of difference between expression/intensity level values at 48hr and that of at
3hr. If difference >= 10 then for this class of genes, the intensity is increasing with label
50
F. Therefore the target gene is up-expressed. The association rule for that target gene will
be defined as: TF1 & TF2  Target gene set  F. Here, ‘ F’ denotes that the target
gene is up and increasing.
TF5 ∩ TF15 ∩ TF23 ∩ TF24 ∩ TF26 ∩ TF30 ∩ TF31 ∩ TF35 ∩ TF37 ∩ TF41 ∩ TF43
∩ TF45 ∩ TF54 ∩ TF59 ∩ TF61 ∩ TF62 ∩ TF64 ∩ TF70 ∩ TF71 => (GREB1, MGP,
PKIB, SGK1, SYTL4, FLJ30058, CDH26)  (F)
This rule means that the set of TFs including TF5, TF15, TF23, TF24, TF26, TF31,
TF35, TF41, TF43, TF45, TF54, TF59, TF61, TF62, TF64, TF70 and TF71 are playing
significant role for up expression of set of genes including GREB1, MGP, PKIB, SGK1,
SYTL4, FLJ30058 and CDH26.
Note: TFs marked in red are the well known activators like AP1 (TF8 - AP1_C and TF9 AP1_Q2), FREAC4 (TF33 - FREAC4_01), GATA (TF34 - GATA_01, TF35 GATA_04), HNF3(TF41 - HNF3ALPHA6_Q6).
b) Behavior- Expression of genes decreasing (Category: A):
As discussed in section 4.3.1, the classification of genes into different classes is done on
the basis of difference between expression/intensity level values at 48hr and that of at
3hr. If difference <= -1, then for this class of genes, the intensity is decreasing with label
A. Therefore the target gene is down-expressed. The association rule for that target gene
will be defined as: TF1 & TF2  Target gene set  A. Here, ‘ A’ denotes that the target
gene is down and decreasing.
51
TF3 ∩ TF4 ∩ TF5 ∩ TF6 ∩ TF8 ∩ TF12 ∩ TF13 ∩ TF17 ∩ TF18 ∩ TF20 ∩ TF31 ∩
TF32 ∩ TF36 ∩ TF38 ∩ TF43 ∩ TF44 ∩ TF45 ∩ TF56 ∩ TF57 ∩ TF63 ∩ TF66 ∩
TF69 ∩ TF70 ∩TF71 ∩ TF77 => (CISH, CTR9,CYP24A1, KCNK5, LOC65296,
KCNK6, MAP6D1, P2RY2, PSCA, SCNN1A, SMTNL2, SYT12, TGM2, UBTD1,
VASN, ZSCAN2)  (A)
This rule means that set of TFs including TF3, TF4, TF5, TF6, TF8, TF12, TF13, TF17,
TF18, TF20, TF31, TF32, TF36, TF38, TF43, TF44, TF45, TF56, TF57, TF63, TF66,
TF69, TF70, TF71 and TF77 are playing significant role for the down expression of set
of genes such as CISH, CTR9,CYP24A1, KCNK5, LOC65296, KCNK6, MAP6D1,
P2RY2, PSCA, SCNN1A, SMTNL2, SYT12, TGM2, UBTD1, VASN and ZSCAN2
with support = 70% and confidence = 99.98%.
(2) Results of the experiment 2 with α = 85: The association rule considers a TF only if it satisfy
the support >= 85%
a) Behavior- Expression of genes increasing (Category: F):
TF5 ∩ TF30 ∩ TF31 ∩ TF37 ∩ TF54 ∩ TF61 ∩ TF62 ∩ TF71 => (GREB1, MGP,
PKIB, SGK1, SYTL4, FLJ30058, CDH26)  (F)
This rule means that set of TFs including TF5, TF30, TF31, TF37, TF54, TF61, TF62,
TF71 are significant in the up expression of set of genes including GREB1, MGP, PKIB,
SGK1, SYTL4, FLJ30058, CDH26 with support >=85% and confidence = 99.98%.
52
b) Behavior- Expression of genes decreasing (Category: A):
TF31 ∩ TF32 ∩ TF44 ∩ TF71 => (CISH, CTR9, CYP24A1, KCNK5, LOC65296,
KCNK6, MAP6D1, P2RY2, PSCA, SCNN1A, SMTNL2, SYT12, TGM2, UBTD1,
VASN, ZSCAN2)  (A)
From the rule, it can be seen that set of TFs including TF31, TF32, TF44, TF71 are
significant in the down expression of set of genes including CISH, CTR9, CYP24A1,
KCNK5, LOC65296, KCNK6, MAP6D1, P2RY2, PSCA, SCNN1A, SMTNL2, SYT12,
TGM2, UBTD1, VASN, ZSCAN2 with support >=85% and confidence = 99.98%.
6.3 Experiment 3
In experiment2, it can be seen in both parts of the result that very few well known
activators were present in the result. In anticipation that we would obtain more well known
activators, in experiment 3, we use different approach. Our aim is to relax the p-value threshold,
increase the support and analyze the results.
Therefore we have relaxed the p-value to 0.005 and increased the support α to 90% and
confidence = 99.95%. The selection criteria is p-value < 0.005. Therefore a transcription factor
whose p-value < 0.005 is assumed to be significant.
Results of experiment 3 with α = 90%:
(a) Behavior- Expression of genes decreasing (Category: A):
Considering the possibility that at 3 hours the expression level is at peak and it subsides
by 48 hours, we find the difference of intensity at 48 hours and 3 hours. The association
rule comes out to be:
53
TF1 ∩ TF4 ∩ TF5 ∩ TF6 ∩ TF7 ∩ TF8 ∩ TF9 ∩ TF11 ∩ TF17 ∩ TF18 ∩ TF12 ∩ TF13
∩ TF14 ∩ TF15 ∩ TF16 ∩ TF19 ∩ TF20 ∩ TF21 ∩ TF22 ∩ TF23 ∩ TF24 ∩ TF25 ∩
TF26 ∩ TF28 ∩ TF29 ∩ TF30 ∩ TF31 ∩ TF32 ∩ TF33 ∩ TF34 ∩ TF35 ∩ TF36 ∩
TF37 ∩ TF38 ∩ TF42 ∩ TF43 ∩ TF44 ∩ TF45 ∩ TF46 ∩ TF47 ∩ TF48 ∩ TF50 ∩
TF52 ∩ TF53 ∩ TF54 ∩ TF55 ∩ TF56 ∩ TF59 ∩ TF60 ∩ TF61 ∩ TF62 ∩ TF63 ∩
TF64 ∩ TF67 ∩ TF69 ∩ TF70 ∩ TF71 ∩ TF72 ∩ TF73 ∩ TF75 ∩ TF77 ∩ TF78 =>
(CISH, CTR9, CYP24A1, KCNK5, LOC65296, KCNK6, MAP6D1, P2RY2, PSCA,
SCNN1A, SMTNL2, SYT12, TGM2, UBTD1, VASN, ZSCAN2)  (A)
Note: TFs marked in red are the well known activators like AP1 (TF8 - AP1_C and TF9 AP1_Q2), FREAC4 (TF33 - FREAC4_01), GATA (TF34 - GATA_01, TF35 GATA_04), HNF3(TF41 - HNF3ALPHA6_Q6), SP1(TF67- SP1_01).
b) Behavior- Expression of genes increasing (Category: F):
Considering that at 3 hours the expression level is minimum and it increases by 48 hours,
we get the following association rule
TF3 ∩ TF5 ∩ TF6 ∩ TF7 ∩ TF8 ∩ TF9 ∩ TF12 ∩ TF13 ∩ TF14 ∩ TF15 ∩ TF16 ∩
TF17 ∩ TF18 ∩ TF19 ∩ TF20 ∩ TF21 ∩ TF23 ∩ TF24 ∩ TF25 ∩ TF26 ∩ TF28 ∩
TF29 ∩ TF30 ∩ TF31 ∩ TF32 ∩ TF33 ∩ TF34 ∩ TF35 ∩ TF36 ∩ TF37 ∩ TF38 ∩
TF39 ∩ TF41 ∩ TF43 ∩ TF44 ∩ TF45 ∩ TF46 ∩ TF47 ∩ TF50 ∩ TF51 ∩ TF52 ∩
TF53 ∩ TF54 ∩ TF55 ∩ TF57 ∩ TF58 ∩ TF59 ∩ TF60 ∩ TF61 ∩ TF62 ∩ TF63 ∩
TF64 ∩ TF66 ∩ TF69 ∩ TF70 ∩ TF71 ∩ TF72 ∩ TF73 ∩ TF74 ∩ TF75 ∩ TF77 ∩
TF78 => (GREB1, MGP, PKIB, SGK1, SYTL4, FLJ30058, CDH26)  (F)
54
Note: TFs marked in red are the well known activators like AP1 (TF8 - AP1_C and TF9 AP1_Q2), FREAC4 (TF33 - FREAC4_01), GATA (TF34 - GATA_01, TF35 GATA_04), HNF3(TF41 - HNF3ALPHA6_Q6).
c) Behavior- Expression of genes increasing (Category: E):
As discussed in section 4.3.1, the classification of genes into different classes is done on
the basis of difference between expression/intensity level values at 48hr and that of at
3hr. If difference >= 5, then for this class of genes, the intensity is increasing with label
E. Therefore the target gene is up-expressed. The association rule for that target gene will
be defined as: TF1 & TF2  Target gene set ‘’ E. Here, ‘ E’ denotes that the target
gene is up and increasing. Considering that at 3 hours the expression level is minimum
and it increases by 48 hours, we get the following association rule:
TF6 ∩ TF7 ∩ TF9 ∩ TF14 ∩ TF17 ∩ TF18 ∩ TF20 ∩ TF21 ∩ TF23 ∩ TF24 ∩ TF26 ∩
TF29 ∩ TF30 ∩ TF31 ∩ TF32 ∩ TF33 ∩ TF34 ∩ TF35 ∩ TF37 ∩ TF41 ∩ TF43 ∩
TF44 ∩ TF45 ∩ TF46 ∩ TF52 ∩ TF54 ∩ TF57 ∩ TF61 ∩ TF62 ∩ TF64 ∩ TF66 ∩
TF71 ∩ TF72 ∩ TF74 => (AREG, CALCR, CXCL12, DEPDC6, HCK, HEY2, MYB,
RAB31, RAMP3, SLC39A8, GREB1, CDH26, FLJ30058, MGP, PKIB, SGK1, SYTL4)
 (E)
Note: TFs marked in red are well known activators like AP1 (TF9 - AP1_Q2), FREAC4
(TF33 - FREAC4_01), GATA (TF34 - GATA_01, TF35 - GATA_04), HNF3(TF41 HNF3ALPHA6_Q6).
55
6.4 Experiment 4
In experiment 3, the number of transcription factors in the association rules are high. To
obtain more useful and precise result, we want to reduce the number of transcription factors
appearing in the rule. In anticipation that we would obtain less number of transcription factors in
the rules, in experiment 4, our aim is to further increase the support and analyze the results.
Therefore, we increased α to 100.
Results of experiment 4 with α = 100:
a) Behavior- Expression of genes decreasing (Category: A):
Considering the possibility that at 3 hours the expression level is at peak and it subsides by
48 hours, we find the difference of intensity at 48 hours and 3 hours. The association rule
comes out to be:
TF4 ∩ TF5 ∩ TF6 ∩ TF7 ∩ TF8 ∩ TF9 ∩ TF12 ∩ TF13 ∩ TF14 ∩ TF15 ∩ TF16 ∩
TF20 ∩ TF21 ∩ TF22 ∩ TF23 ∩ TF25 ∩ TF26 ∩ TF28 ∩ TF30 ∩ TF31 ∩ TF32 ∩ TF34
∩ TF35 ∩ TF36 ∩ TF38 ∩ TF43 ∩ TF44 ∩ TF45 ∩ TF46 ∩ TF48 ∩ TF50 ∩ TF52 ∩
TF55 ∩ TF56 ∩ TF59 ∩ TF60 ∩ TF61 ∩ TF62 ∩ TF63 ∩ TF69 ∩ TF70 ∩ TF71 ∩ TF73
∩ TF77 => CISH, CTR9, CYP24A1, KCNK5, LOC65296, KCNK6, MAP6D1, P2RY2,
PSCA, SCNN1A, SMTNL2, SYT12, TGM2, UBTD1, VASN, ZSCAN2)  (A)
56
b) Behavior- Expression of genes increasing (Category: F):
Considering the possibility that at 3 hours the expression level is low and it peaks by 48
hours, we find the difference of intensity at 48 hours and 3 hours.
TF3 ∩ TF5 ∩ TF6 ∩ TF7 ∩ TF9 ∩ TF13 ∩ TF16 ∩ TF18 ∩ TF19 ∩ TF20 ∩ TF21 ∩
TF23 ∩ TF24 ∩ TF25 ∩ TF26 ∩ TF28 ∩ TF29 ∩ TF31 ∩ TF33 ∩ TF34 ∩ TF35 ∩
TF36 ∩ TF37 ∩ TF43 ∩ TF44 ∩ TF45 ∩ TF46 ∩ TF47 ∩ TF50 ∩ TF51 ∩ TF52 ∩
TF54 ∩ TF59 ∩ TF60 ∩ TF61 ∩ TF62 ∩ TF63 ∩ TF64 ∩ TF66 ∩ TF70 ∩ TF71 ∩
TF72 ∩ TF74 ∩ TF78 => (GREB1, MGP, PKIB, SGK1, SYTL4, FLJ30058, CDH26) 
(F)
c) Behavior- Expression of genes increasing (Category: E):
Considering the possibility that at 3 hours the expression level is low and it peaks by 48
hours, we find the difference of intensity at 48 hours and 3 hours.
TF7 ∩ TF9 ∩ TF18 ∩ TF21 ∩ TF23 ∩ TF24 ∩ TF31 ∩ TF33 ∩ TF34 ∩ TF35 ∩ TF43
∩ TF45 ∩ TF46 ∩ TF52 ∩ TF61 ∩ TF62 ∩ TF64 ∩ TF72 ∩ TF74 => (AREG, CALCR,
CXCL12, DEPDC6, HCK, HEY2, MYB, RAB31, RAMP3, SLC39A8, GREB1, CDH26,
FLJ30058, MGP, PKIB, SGK1, SYTL4)  (E)
57
6.5 Result Analysis
Expected results: Well known activators like AP1, GATA, SP1, HNF3, FREAC4 should show
up in the rules.
Actual results: In association rules found in the experiments 3 and 4, it can be seen that well
known activators AP1, GATA, SP1, HNF3, FREAC4 appear in the association rules.
In experiment 2, 60% of the well known activators are present in the rules. Association rules
obtained in experiment 3 contain all (100%) of the well known activators. In experiment 4, again
60% of the well known activators appear in the rules.
Experiment
AP1
GATA
SP1
HNF3
FREAC4
Experiment 2
√
√
X
√
X
Experiment 3
√
√
√
√
√
Experiment 4
√
√
√
X
X
Figure 22: Result Analysis
58
Chapter 7
CONCLUSION
7.1 Summary
The goal of this project was to 1) Develop a method of data preprocessing to make the
data feasible to be used by a data mining algorithm. 2) Look for the association rules between a
group of transcriptional factors (module) and target gene behavior from a single time-series breast
cancer profile data. Towards this end number of experiments are performed to achieve results
with better accuracy. Experiments 2, 3 and 4 are improved based on genome centre’s feedback
compared to experiment 1 because they identify the association rules between a subset of TFs and
a subset of genes. Experiment 3 and 4 show well known activators like AP1, GATA, SP1, HNF3,
FREAC4 in the
association rules.
Experiment 3
and
4 have
p-value
<
0.005
(confidence=99.95%). Experiment 3 has support = 90% whereas experiment 4 has support =
100%. Well known activators like AP1, GATA, FREAC4, HNF3, SP1 are found in experiment 3
whereas well known activators like these AP1, GATA, FREAC4 are found in experiment 4.
7.2 Learning Experience
Weka is a useful data mining tool. Also MS-Excel is a useful tool for data preprocessing.
The data filtering capability of MS-Excel can be very useful for data preprocessing. It was good
to be able to apply the concepts learnt in statistics to solve a real world problem.
Report writing is also very important learning process and hence one should begin
writing from start of the project.
59
7.3 Future Work
More work can be done on this project to enhance the results obtained.
1) Currently the p-value is the most important attribute and the data is classified and
analyzed on its basis. However, it would be an important data point to incorporate
position value in the analysis and examine the results obtained.
2) We have the availability of expression level at different points of time like at 3 hrs, 6
hrs, 12 hrs, 24 hrs, 48 hrs. We would like to research and find out what are the set of
transcription factors associated with set of genes by their peaking intensity level.
There are many possibilities like the gene expression is peak at 3hrs and the peaking
intensity reduces by 48 hrs. Another possibility is the gene expression is low at 3 hrs
and peaks at 48 hrs. In the experiments 2, 3 and 4 we consider these possibilities.
However, there are other possibilities like the expression level is low at 3 hrs
and it peaks at 24 hrs. It could also be true that the breast cancer genes are peaking at
3 hrs and 48 hrs and they have minimum expression level at 24 hrs. We leave this for
future work.
60
BIBLIOGRAPHY
[1] Meiliu Lu, Csc 209 Class notes, California State University, Sacramento, Fall 2009
[2] S. Fu, and C. Tobin, An Introduction to DNA and Chromosomes, HOPES: Huntington's
Disease Outreach Project for Education at Stanford, 25 April 2006
http://hopes.stanford.edu/n3401/hd-genetics/introduction-dna-and-chromosomes-text-and-audio
[3] Wikipedia contributors, Gene, Wikipedia, The Free Encyclopedia, 10 May 2010
http://en.wikipedia.org/wiki/Gene
[4] Wikipedia contributors, Transcription factor, Wikipedia, The Free Encyclopedia, 15 May
2010
http://en.wikipedia.org/wiki/Transcription_factor
[5] Wikipedia contributors, Enhancer (genetics), Wikipedia, The Free Encyclopedia, 29 April
2010
http://en.wikipedia.org/wiki/Enhancer_(genetics)
[6] Wikipedia contributors, Gene expression, Wikipedia, The Free Encyclopedia, 12 April 2010
http://en.wikipedia.org/wiki/Gene_expression
[7] Jiawei Han and Micheline Kamber, Data mining: Concepts and Techniques, Second Edition,
Morgan Kaufmann Publishers, 2006.
[8] Ziv Bar-Joseph, Georg K Gerber, Tong Ihn Lee, Nicola J Rinaldi, Jane Y Yoo, François
Robert, D Benjamin Gordon, Ernest Fraenkel, Tommi S Jaakkola, Richard A Young, & David K
Gifford, Computational discovery of gene modules and regulatory networks, Nature
Biotechnology, November 2003
[9] Aravind Subramanian, Pablo Tamayo, Vamsi K. Mootha, Sayan Mukherjee, Benjamin L.
Ebert, Michael A. Gillette, Amanda Paulovich, Scott L. Pomeroyh, Todd R. Golub, Eric S.
Lander, and Jill P. Mesirov, Gene set enrichment analysis: a knowledge-based approach for
interpreting genome-wide expression profiles, Proceedings of National Academy of Sciences of
United States, 25 October 2005
[10] Karen Lemmens, Thomas Dhollander, Tijl De Bie, Pieter Monsieurs, Kristof Engelen, Bart
Smets, Joris Winderickx, Bart De Moor and Kathleen Marchal, Inferring transcriptional modules
from ChIP-chip, motif and microarray data, Genome Biology, 5 May 2006.
61
[11] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, Ian H.
Witten (2009), The WEKA Data Mining Software: An Update, SIGKDD Explorations, Volume
11, Issue 1, 1997.
[12] Wikipedia contributors, Association rule learning, Wikipedia, The Free Encyclopedia, 14
October 2010 http://en.wikipedia.org/wiki/Association_rule_learning
Download