The ENRICH paper

Computational tools
ENRICH - From Expression, Through Annotation, to Function
Goren Tali1 and Manor Ohad1.
Department of Computer Science, The Hebrew University Of Jerusalem, Israel.

cant. In addition we wished to use present biological information
Which transcription factor controls stress response in yeast? Which
and obtainable data to reconstruct novel and previously known
genes regulate amino acid biosynthesis? Do mitochondrial genes
biological networks.
have a stronger response to heat shock than peripheral genes? The
For this purpose we developed a computational tool in the
amount of available biological data is rapidly increasing. Gene
form of a computer program, which we call ‘ENRICH’. ENRICH
annotation, cellular localization, Chromatin structure, protein
enables to combine heterogeneous data, involving gene annota-
structure, function and interactions, ChIP and gene expression data
tions, gene expression data, chromatin immunoprecipitation
are overwhelming in their richness of information, but how does
(ChIP), cellular localization analysis, protein-protein interactions
one extract meaningful biological conclusions from it?
and more.
ENRICH is a computational aid tool aimed to solve these is-
ENRICH handles matrix-based data, of two main types: binary
sues. ENRICH allows users an easy interactive interface for up-
and continuous. A binary matrix can be viewed as annotations
loading their biological data to the program, being data of gene
from any type. A continuous matrix may contain various types of
annotation tables, experimental gene expression or other. These
measures over genes, such as expression, ChIP, etc.
data sets, which we refer to as matrices, can then be manipulated in
ENRICH allows to perform a global statistical analysis of such
a variety of ways which we implemented in ENRICH, to allow the
crossed types of biological information, determine how significant
user more freedom with the uploaded data.
the examined associations are, and find enrichment within different
We also implemented a large variety of statistical tests, both
data sets.
paired and unpaired that can be preformed upon the data to find
enriched aspects of it, also implemented are refinement methods
1.1 Statistical Hypothesis Testing
such as Bonferroni and FDR (false discovery rate. The result is a
It is easier to show a universal hypothesis is false than prove it is
tool, which can be used to upload various data types, investigate it
true. Therefore, we use statistical hypothesis testing to determine
statistically, and save the biological results.
the probability that a given hypothesis is true. The process of statistical hypothesis testing is usually composed of the following
Computational biology has the potential to revolutionize the under-
Formulate the null hypothesis – commonly, states that
standing of the cell by offering an unprecedented view of the mo-
"there is no phenomenon", and that the observations
lecular underpinnings of biology phenomena. Computational anal-
could have arisen through chance. The alternative hy-
ysis is essential to transform the masses of generated data into a
pothesis is commonly in contrast to the null hypothesis
mechanistic understanding of processes and pathways in the cell.
and it is accepted if the observed data values are suffi-
The project aims at extracting a comprehensive overview from
ciently improbable under the null hypothesis. An exam-
the existing huge amount of information arising from genome-wide
ple of the null hypothesis may be: “There is no connec-
analyses. Our goal is to connect information from different sys-
tion between the annotation ‘Ribosome Assembly’ and
tematic global studies and resources to infer biological function,
the transcription factor ‘RAP1’ ”. The alternative hy-
and check whether the connections found are statistically signifi-
pothesis says that the two are indeed related.
Identify a test sufficient statistics that can be used to assess the validity of the null hypothesis. We will elaborate
Department of Computer Science, Hebrew University of Jerusalem, Israel
Goren T. and Manor O.
on the sufficient statistics of each statistical test we implemented in the next section.
Compute the P-value, which is the probability that a test
lecular biology, and with which ENRICH is aimed at processing
and analyzing.
statistic at least as significant as the one observed would
be obtained assuming that the null hypothesis were true.
The smaller the P-value, the stronger the evidence
against the null hypothesis.
Compare the P-value to an acceptable significance value
(sometimes called an alpha value). If the observed effect
is statistically significant, the null hypothesis is ruled out,
and the alternative hypothesis is valid.
2.1 Implementation
Our main desire was to develop an easy-to-use, flexible and consistent application which will reduce otherwise complex tasks to
Gene Ontology Annotation (GO) - The GO database provides a useful tool to annotate and
analyze the functions of a large number of
Gene Expression - Results of microarray experiments, which indicate the RNA level of the
genome in different conditions.
ChIP on chip – Data from Chromatin Immunoprecipitation, a technique used for identification of the DNA sequences to which specific proteins bind in vivo.
Cellular localization – Proteins were marked
using different methods, and their sub-cellular
localization was checked.
Protein-protein interactions – Different experiments in which interactions between different
proteins in the cell were measured.
only a few lines of code. We decided to implement ENRICH using
the Perl programming language. We had several considerations in
choosing so: Perl is a high level programming language with
strong support for arrays and hashes. It has a wide availability of
modules and packages related to mathematics and statistics, which
were helpful for us. In addition Access to Perl is available to anyone, and many Bioinformatic tools are implemented in Perl. A
2.3 Basic Operations
All input files to ENRICH are in a matrix-based format. We therefore found it useful to implement the following three types of functions:
disadvantage of Perl, of which we were aware from the beginning
of the work process, is that it is relatively slow (compared to the C
programming language) in performing numerical calculations.
ENRICH was developed so that it can be used in three different
As a command line interactive interpreter, where the user
types one command at a time as seen in figure 1.
Running the program with an added script of ENRICH
commands to be preformed.
Using the ENRICH as a module and incorporating
ENRICH functions into any Perl program.
2.2 ENRICH input
ENRICH handles tab delimited text files in the structure of a matrix: the first row contains the headers of the columns and the first
column contains the headers of the rows. Matrices with two header
rows are also handled. Every cell in the matrix contains the data
relevant to the specific line and row headers. The data in each cell
of a matrix may contain numeric continuous values or binary values (0 or 1).
We will elaborate on some representative types of data sets,
which are widely used in the research field of computational mo-
Basic necessary matrix manipulations, which include
mathematical simple operations over all matrix’ values.
For example: Transpose, LogScale, Normal, Average,
Negate, AddCon (Add a constant numeric value to all
matrix values), etc.
Convenient and helpful data queries. For example: select rows/columns by name/sum, get rows/columns
number, get row headers, extract P-values, get number of
rows/columns, etc.
Necessary functions for simple use and orientation of
ENRICH. For example: Help, Whos (print out all the
currently used variables), Load, Save, etc.
Most of ENRICH functions are explained in more detail in
the Appendix section.
ENRICH - From Expression, Through Annotation, to Function
binary values. There are 2 possible test for binary values matrices:
χ2 test or the hypergeometric test.
a. χ2 Test
A non-parametric test of statistical significance for binary values.
The hypothesis tested with χ2 is whether or not two different samples are different enough in some characteristic or aspect of their
behavior. If there is a significant difference, we can generalize
from our samples that the populations from which our samples are
drawn are also different in their behavior or some tested characteristic. The null hypothesis of this test states there is no difference
among different sample groups [3].
b. Hypergeometric Test
Describes the number of successes in a sequence of n draws from
a finite population without replacement.
Let there be n ways for a ‘good’ selection and m ways for a ‘bad’
selection out of a total n+m possibilities. Take N samples and let
xi equal 1 if selection i is successful and 0 if it is not. Let X be the
total number of successful selections:
Figure 1. An example of using ENRICH in order to load to datasets, one being ChIP data and the other GO annotations data.
Then, a HyperGeometric test is performed upon these matrices,
and the result is refined using the False Discovery Rate correction
and saved to a new file in the user's directory.
2.4 Statistical Testing
All statistical tests are performed in ENRICH using the function
The input for performing a statistical test by ‘Enrich()’ is two matrices of the following format:
Matrix A with n rows and m columns.
Matrix B with n rows and k columns.
A and B both contain the same row values (i.e. all the
yeast genes).
The three above specifications are not necessary requirements. If A
and B don’t contain exactly the same row values, ENRICH output
would refer only to rows which appear in both A and in B.
The function ‘ENRICH’ performs the statistical test for every
pair of columns (vectors) x  A and y  B . The output is a
matrix C with m rows and k columns in which every cell C x , y
contains the P value that is the result of the statistical test performed on vector x and vector y.
’Enrich()’ performs a statistical test according to the method
specified by the user. ‘Enrich()’ checks the type of values matrices
A and B contain (may be either decimal or binary) and calls the
matching function out of the following three accordingly: EnrichBin(), EnrichFlo() or EnrichAnno(). Each function may call a variety of statistical tests, which can operate on the relevant type of
values in the matrices. We will explain briefly what is the main
purpose of each statistical test and how is it performed.
EnrichBin –
Performs the test specified by the user on two matrices containing
X   xi
i 1
The probability of i successful selections is then:
n  m 
 
# ways for i successes  # ways for N  i failures    i   N  i 
P( x  i ) 
total number of ways to select 
m  n
 N 
EnrichFlo – Performs the test specified by the user on two matrices containing decimal values. There are a few possible tests for
decimal values matrices, which will now be described shortly.
Paired Student's t Test
A parametric test to compare two paired groups. Given
two paired sets Xi and Yi of n measured values, the
paired t-test determines whether they differ from each
other in a significant way under the assumptions that the
paired differences are independent and identically normally distributed [2].
Correlations – Correlation is a measure of the relation between two
or more variables [5]. The correlation coefficients values range
from -1 to +1. The value of -1.00 represents a perfect negative
correlation while a value of +1.00 represents a perfect positive
correlation. A value of 0.00 represents a lack of correlation. We
implemented in ENRICH two methods to calculate a correlation
coefficient: Pearson and Spearman, both will now be elaboreted.
Pearson Correlation Rank Test The Pearson Correlation
Coefficient measures the linear relationship between two
variables, and is calculated as follows:
rxy 
cov( y, x)
var( y ) * var( x)
It determines the extent to which values of the two variables are proportional to each other. That is, the correla-
Goren T. and Manor O.
tion is high if it can be ‘summarized’ by a straight line.
Spearman Correlation Rank Test
The Spearman Correlation Rank Test is a nonparametric
correlation test. The Spearman Rank Correlation can be
thought of as the regular Pearson correlation coefficient,
except that Spearman correlation is computed from
EnrichAnno – Performs the test specified by the user on two matrices: matrix A containing binary values, and matrix B containing
decimal values. As in ‘EnrichFlo’ and ‘EnrichBin’, ‘EnrichAnno’
performs the statistical test for every pair of columns (vectors)
x  A and y  B . Every vector x is divided into two independent groups, according to the matching binary values in the binary
vector y. There are 3 possible tests for two independent data sets,
on which we will soon elaborate.
a. Unpaired Student's t Test
A parametric test to evaluate the differences in means between two
groups. The P-value reported with a t-test represents the probability of error involved in accepting the research hypothesis about the
existence of a difference. This is the probability of error associated
with rejecting the hypothesis of no difference between the two
categories of observations (corresponding to the groups) in the
population when, in fact, the hypothesis is true [2].
b. Wilcoxon Rank-sum Test (Mann-Whitney-Wilcoxon)
A nonparametric alternative to the paired t-test. This test assumes
that there is no information in the magnitudes of the differences
between paired observations. It calculate the differences and rank
them from smallest to largest by absolute value. The test then adds
all the ranks associated with positive differences, giving the suffiecient statistic [1].
c. Kolmogorov-Smirnoff’s Test
The Kolmogorov-Smirnov test (KS test) tries to determine if two
datasets differ significantly. This test has the advantage of making
no assumption about the distribution of data - it is non-parametric
and distribution free. The KS test is only appropriate for testing
data against a continuous distribution, such as the normal distribution. It is based on the empirical distribution function (ECDF)[1].
To compare two empirical cumulative distributions S N(x) containing N events, and SM(x) containing M events, the statistic DMN is
DMN  max S N ( x)  S M ( x) over all x.
2.5 Multiple Hypothesis Correction
The Statistical analysis ENRICH performs of a data set typically
involves not just a single hypothesis, but rather many. For any
particular test, a pre-set probability α of a type-1 error (i.e., a false
positive, rejecting the null hypothesis when in fact it is true) is
assigned. The problem of multiple comparisons is that we would
like to control the false positive rate not just for any single test but
also for the entire collection of tests that makes up our experiment.
Multiple Hypothesis correction methods attempt to keep the overall chance of getting any false positives at the same level (e.g.
0.05). This is done by the ENRICH function ‘Refine’. It refines a
P-value matrix with a cut off specified by the user. The refinement
is performed according to the specified method out of two possibilities: The Bonferroni correction or FDR.
The Bonferroni correction
The Bonferroni correction is a multiple-comparison correction
used when several dependent or independent statistical tests are
being performed repeatedly. If a particular outcome of an experiment is unlikely to happen, the fact that the experiment is repeated
multiple times will increase the probability that the outcome appears at least once [4].
While a given α value may be appropriate for each individual
comparison, it is not for the set of all comparisons. In order to
avoid a lot of false positives, the α value needs to be lowered to
account for the number of comparisons being performed.
The Bonferroni correction tests each null hypothesis independently
from outcome of others to level α/m.
False Discovery Rate Correction
The FDR is the fraction of false positives among all tests declared
significant. The motivation for using the FDR is that we may be
running a very large number of tests, with those being declared
significant being subjected to further studies. For example, searching for differently expressed genes a certain microarray experiment. The set of all genes in this experience is obviously huge, and
we want to find the significantly differently expressed genes. The
idea is that the statistical procedure results in a significant enrichment of differently expressed genes, controlling the fraction of
false positives within the enriched setting by specifying a value for
the FDR. Choosing an FDR of 5% means that (on average) 5% of
the genes we picked as being significant are actually false positives
(and 95% of those genes declared significant do indeed have differential expression). Hence, screening genes with an FDR of 5%
results in a significant enrichment of genes that are truly differentially expressed [6].
Suppose a total of N hypotheses are tested, S of which are judged
significant (by the criteria being used for each test). If we had
complete knowledge, we would know that n of the hypotheses
have the null hypothesis true and m=N-n have the alternative hypothesis true, and we might find that F of the true nulls were called
significant, while T of the alternative true were called significant,
as can be seen in table 1.
Null true
Called significant
Called not
Table 1. The FDR multiple hypothesis correction method
For this experiment, the false discovery rate is the fraction of tests
called significant that are actually true nulls, FDR = F/K. This is
 # false positive 
 # positive  
ENRICH - From Expression, Through Annotation, to Function
In order to test the use of ENRICH, we decided to try and create a
partial regulation network for the yeast S. Cerevisiae. We picked a
few key biological processes in the cell's life i.e. response, metabolism and ubiquitination, and used ENRICH as described in figure 1
in order to gain insight about the transcription factors that are involved in these processes.
The process shown in figure 2 uses two sets of data represented
in two binary matrices. The first matrix is a result of a ChIP analysis done in Rick Young’s lab, which shows for every known yeast
transcription factor all the genes it binds. The other matrix is a
Gene Ontology (GO) matrix, representing for each gene the GO
annotations that it holds.
Using these two matrices, ENRICH creates a new matrix of pvalues, where each cell represents the p-value obtained from performing a statistical significance test (such as a Hyper Geometric
test) over two columns from the previous matrices.
The p-values are then refined according to the multiple hypothesis
correction. Resulting in a decimal matrix, which is then transformed into a binary one using a significance threshold. The resulting matrix has a binary value for every pair of transcription factor
and GO annotation, being one if there is a strong enrichment between them or zero if not.
Then, ENRICH is used to select all the GO annotations related
to a certain process, i.e. response. The GO annotations are selected
such that there is minimum overlap between them (e.g. “response
to stress” and “response to stress conditions”) and the selected GO
annotations and the enriched transcription factors are then visualized by DOT/NEATO as a directed graph (or network).
The resulted networks can be seen in figures 3 and 4. A closer
look will reveal a set of transcription factors that ENRICH found
to be associated with each process, and each transcription factor is
also associated with a sub-process within each process.
In order to evaluate the quality of ENRICH predictions of subprocesses and transcription factors, we can look at the predictions
for the following processes.
MBP1 is a known Transcription factor involved in regulation
of cell cycle progression from G1 to S phase [14], which usually
indicates a crucial point of regulation, and ENRICH has predicted
it to be involved in response with DNA damage which seems very
CAD1 is a transcription factor known to be involved in iron
metabolism, considering that iron is an inorganic substance [15],
ENRICH placed CAD1 as connected to response to inorganic
YAP7 is a putative transcription factor of unknown role in
yeast, yet ENRICH strongly places it as involved with response to
chemical and abiotic stimulus. Considering that all the other TF’s
we discussed were well categorized by ENRICH, we believe this
case to be the same, although further investigation is necessary.
3.1.2. Case study of results for Metabolism process
A representing set of transcription factors from the results of the
metabolism process, give us some encouragement about the rest of
them. GCR1 and GCR2, known transcription factors involved in
glycolysis [16-18]. THI2 is known as involved in the biosynthesis
of thiamine [19,20], MET4, MET31, MET32 in the biogenesis of
sulfur amino acids[21-23], and GAT1 and GAT3 in nitrogen compound metabolism [24-26].
3.1.3. Case study of results for Ubiquitin related processes
In the case of the four proteins ENRICH found to be involved in
the ubiquitin cycle, things are less decisive. Apart from RPN4,
which is a transcription factor that stimulates expression of proteasome genes [27-28] and therefore highly connected to the ubiquitin cycle, the rest of the proteins seem less connected according
to the literature. REB1 is a RNA polymerase enhancer protein [29],
RCS1 is involved in iron utilization [30,31] and ADR1 in alcohol
related genes and peroxisomal proteins [32]. Yet they were the
genes ENRICH found to be highly enriched with the ubiquitin
annotation, therefore we suggest that each one of them does play a
role in the ubiquitin cycle of some sort.
3.1.1. Case study of results for Transport process
MSN2 is a Transcriptional activator related to Msn4p; activated in
stress conditions, binds DNA at stress response elements of responsive genes, inducing gene expression [7,8], and we can see
that ENRICH has predicted it to be associated with two subprocesses of response, namely response to stimulus and response to
stress, which is reassuring.
MCM1 is Transcription factor involved in cell-type-specific
transcription and pheromone response [9-11], and ENRICH has
predicted it to be involved in two transport sub-processes that involve the response to pheromone.
STE12, TEC1 and DIG1 that were predicted by ENRICH to be
involved in response to pheromone induction, are known to induce
mating and growth in response to pheromone induction [12,13].
Response versus localization
During our work we asked ourselves the following Biological
question: Do Mitochondrial genes react differently to heat shock
conditions than Peripheral genes do?
In order to answer that question we used ENRICH in the way described in figure 5, where we use localization information and
experimental annotations to obtain a result matrix where we can
see for various hit shock conditions the reaction intensity for mitochondrial genes versus peripheral genes.
The results can be seen in figure 6, where a dark cell represent a
high p-value, meaning that there was no significance for those
genes, and red cell represents a high enrichment for that gene
group during the relevant condition. The results clearly show that
genes located in the mitochondria react much stronger to a heat
Goren T. and Manor O.
shock than genes located in the cell periphery. An interesting biological insight gained very easily by using ENRICH with just a
few commands. When looking at literature about this subject, we
haven't found anything decisive, but the fact that there are mitochondrial HSP (heat shock proteins) and the fact that the mitochondria is involved in apoptotic pathways, perhaps strengthens
the results. Still further investigation of the issue remains to be
Figure 3. The regulatory network created by ENRICH (using
Neato) for the Metabolism process. The Transcription factors are
filled with light green and the sub-processes are dark green.
Figure 2. on the left the two data sets used in the process: ChIP
data and GO annotation data. A HyperGeometric test is done in
order to check enrichment for every pair of columns. The result is
a p-value matrix where each cell such as the yellow one is a result
of the test over two columns marked in purple. Then, a multiple
hypothesis correction is done and using a significance threshold a
binary matrix is created where each cell informs whether there is
Figure 4. The regulatory network created by ENRICH (using
Neato) for the Response and Ubiqiuitin processes. The Transcription factors are empty and the sub-processes are dark green.
an enrichment between the two columns.
ENRICH - From Expression, Through Annotation, to Function
more powerful response to heat shock conditions than do peripheral genes.
Figure 5. on the left the two data sets used in the process: experimental expression data and experimental localization data. An
Unpaired T-test is performed in order to check enrichment for
every pair of columns, resulting in a p-value matrix. That matrix is
then transformed to binary using a threshold resulting in the top
matrix where each cell represents a specific experiment vs. a cell
compartment. The yellow columns are the columns of the mitochondria and the cell periphery which interest us. The bottom matrix is an experiment vs. conditions matrix, in which every line is a
different experiment and every column is a different condition
performed, for this query, we choose only the various heat shock
conditions in this matrix. Then a HyperGeometric test is performed
in order to find how enriched are these cell compartments in
respect to heat shock conditions.
In this project we presented ENRICH, a program we created,
aimed in aiding researchers performing various statistical significance tests. We spent a large portion of the work on writing and
cleaning the code itself, and trying to make it as convenient as
possible. We then moved on to test our program on real data sets;
we used various data types such as ChIP data from Rick Young’s
lab, experimental expression data from David Botstein’s lab and
others, localization data and more. Our first aim was to try and
create a yeast regulation network, not of transcription factors and
their targets, but rather of processes and the transcription factors
involved in the process and its sub-processes.
The results and the way they were obtained shows the power of
ENRICH as an easy to use tool which we feel can be very useful,
for gaining biological insights. It still lacks many features, among
them visualization and incorporated clustering methods. However,
we strongly feel that it can be developed into a strong and useful
computational aid tool for researchers.
We would like to thank Prof. Nir Friedman and his Ph.D. student
Tommy Kaplan for guiding us through the project.
Figure 6. The resulted table of the process done in figure 4. The
Experiment column indicates which kind of heat shock condition
was used in the related experiments, the Cell Periphery and Mitochondrion columns show the enrichment for genes located in the
cell periphery and mitochondria respectively. A red cell shows of
high enrichment whereas a black cell of low enrichment. From
these results it is clear that the mitochondrial genes have a much
We will now elaborate on some ENRICH functions in more detail.
Enrich Orientation:
 Help - Print the documentation of all functions or a
desired certain function.
 Whos - Print all the currently used variables of the user.
 Load - Load a new file into the program and returns a
data handle of this file.
 Save – Save into a file the data of a given handle.
Mathematical Operations:
 AddCon - Add a constant numeric value to all values
in all cells in the matrix.
 Transpose - Turn rows into columns and vice versa.
ENRICH - From Expression, Through Annotation, to Function
 LogScale - Convert all the matrix values into logarithmic scale values in any base. Presentation of data on
a logarithmic scale can be helpful when the data covers a
large range of values; the logarithm reduces this to a
more manageable range.
 Normal – Normalize all matrix values according to
standard normal distribution. Standard means the expected value is 0 and the variance is 1.
 AvgR/AvgC - Calculate the average value of the
rows/columns respectively.
 Convert2bin - Converts the numeric values of a matrix to binary values using a given threshod (assigning 1
to values above the threshold and 0 to values below the
 Negate – Reverse the sign of all the values in the matrix.
Data Queries:
 SelectRowsByName/SelectColsByName - Select from
the matrix only rows/columns respectively, whose headers either contain or match exactly a specific word.
 SelectRowsBySum - Select from the matrix only rows
whose numeric values sum up over a certain threshold.
 GetGenes - Return all the matrix row headers.
 GetRowsNum/GetColsNum - Return the number of
rows/columns respectively of the matrix.
 ExtractPvals - Extract and write to a file the col-
umn headers that received a P-value below a specified threshold
Additional information and software such as: the ENRICH perl
script, the EnrichFunctions perl module, a running example and
other results and figures, are all available at the ENRICH web site