Selecting Informative Genes from Microarray Dataset by Incorporating Gene Ontology

advertisement
Selecting Informative Genes from Microarray Dataset
by Incorporating Gene Ontology
Xian Xu
Aidong Zhang
State University of New York at Buffalo
Department of Computer Science and Engineering
Buffalo, NY 14224, USA
xianxu,azhang@cse.buffalo.edu
Abstract
Selecting informative genes from microarray experiments is one of the most important data analysis steps for
deciphering biological information imbedded in such experiments. However, due to the characteristics of microarray
technology and the underlying biology, namely large number of genes and limited number of samples, the statistical
soundness of gene selection algorithm becomes questionable. One major problem is the high false discover rate.
Microarray experiment is only one facet of current knowledge of the biological system under study. In this paper, we
propose to alleviate this high false discover rate problem
by integrating domain knowledge into the gene selection
process. Gene Ontology represents a controlled biological vocabulary and a repository of computable biological
knowledge. It is shown in the literature that gene ontologybased similarities between genes carry significant information of the functional relationships [3]. Integration of such
domain knowledge into gene selection algorithms enables
us to remove noisy genes intelligently. We propose an addon algorithm applied to any single gene-based discriminative scores integrating domain knowledge from gene ontology annotation. Preliminary experiments are performed on
publicly available colon cancer dataset [2] to demonstrate
the utility of the integration of domain knowledge for the
purpose of gene selection. Our experiments show interesting results.
1. Introduction
Cancer classification traditionally relies upon morphological appearance of tumor. However, there has been
increasing interest in classifying tumor molecularly with
the advent of DNA microarray technologies for large-scale
transcriptional profiling [13]. A typical DNA microarray
study utilizes several DNA microarray chips on different tissue samples and generates a numerical array with thousands
of rows (genes) and tens of columns (experiments/DNA
chips). Sample classification can be performed by treating gene expression levels as features describing molecular
states of the tissue in question. Various canned classification algorithms in the machine learning literature have been
used for such classification task.
Compared to other disciplines DNA microarray datasets
bare their unique characteristics. The foremost trait of such
datasets is the over abundance of features and lacking of
samples, which renders results of any classification algorithms less statistically significant. Using excessive features
will normally degrade performance of a machine learner
[11] while increasing computational cost at the same time.
It is aslo much more difficult for biologists to understand
classification results when expression levels of thousands of
genes are used for sample class prediction. As a result, gene
selection is generally performed before microarray dataset
is fed to classification algorithms [2, 4, 5, 12, 15, 16]. However, the statistical soundness of using feature selection algorithms on microarray dataset becomes questionable. On
one hand, the threshold of selected discriminative scores
needs to be adjusted for multiple selections which has already attracted a lot of research interest [9]. On the other
hand, selected genes are not always biologically relevant
due to the nature of microarray experiments and the small
sample size. When the sample size is limited, even random
noise could result in some significant discriminative scores.
We will give an example of this in the next section.
In this paper, we propose to use gene ontology [8] annotation to alleviate the problems of gene selection on small
sized samples. GO annotation represents a large repository
of biological knowledge that is accessible to computational
algorithms. Intuition behind our work is simple: while it is
likely that even random gene expression can archive relatively high discriminative scores when the number of sam-
ples is limited, it is less likely that several random genes
annotated with the same GO term all have relatively high
discriminative scores. Gene ontology has been incorporated into several microarray data analysis and visualization algorithms/tools for various purposes, in the context of
cluster validation [6], visualization of distribution of some
scores over GO terms [7]. Authors in [3, 14] concluded
that the GO-driven similarity and expression correlation are
significantly interrelated. Another avenue of research focuses on detecting over-represented GO terms in a set of
co-expressed genes [1].
Our algorithm first examines, for each GO term, if genes
annotated with it have statistically higher discriminative
scores. This is an indication of correlation between the corresponding GO term and sample class labels. We call these
GO terms “informative”. Informative genes are then selected from genes that are annotated with such informative
GO terms and yield high discriminative scores at the same
time. Our algorithm is a generic add-on algorithm that can
be attached to many if not all gene selection algorithms.
This paper is organized as follows. In Section 2, we
demonstrate the inherent problem of gene selection using
single gene discriminative scores. In Section 3, our new
GO based gene selection algorithm is detailed. In Section
4, experimental results on publicly available colon cancer
dataset are reported. We conclude this paper and give direction of future work in Section 5.
2 The Problem of Gene Selection on SmallSized Samples
In this section, we give an example of the problems facing gene selection from limited samples. Alon [2] published
their experimental results on colon cancer. The dataset consists of 62 microarray experiments on normal and colon
cancer tissues. Expression levels of 2000 genes are monitored. After log-transformation, expression levels in this
dataset range from 0.76 to 4.32 with mean of 2.30 and standard deviation of 0.49. This is a commonly used benchmark
dataset for data analysis algorithms on microarray datasets.
Let us examine the statistical significance of gene selection on such datasets. We use four approaches to assess the
false discovery rate of gene selection algorithms based on tscores by calculating t-scores for randomly generated gene
expression arrays. In the first two approaches, we generate
random expression arrays of the same size (2000 × 62) from
uniform/normal distribution with same parameters as original log transformed dataset. We also experimented generating from empirical distribution of original log transformed
data by assuming the histogram as an approximate probability density function. Later we generate random expression array by randomly reassigning sample class labels. The
cutoff t-scores for choosing top 50, 100 and 200 genes from
Table 1. Gene Selection On Random Dataset.
Each entry shows the number of genes that
score greater than cutoff t-scores in each
dataset (original and random generated)
# of Genes
t-scores
Orig
Rand. uni
Rand. nor
Rand. emp
Rand. rel
Top 50
3.82
50
1.06 ± 1.04
0.81 ± 0.89
0.71 ± 0.83
0.86 ± 6.39
Top 100
3.23
100
5.32 ± 2.28
4.69 ± 2.19
4.475 ± 1.97
4.50 ± 22.35
Top 200
2.7
200
21 ± 4.41
19.55 ± 4.43
19.5 ± 4.04
20.98 ± 62.32
the original log transformed dataset are 3.82, 3.23 and 2.7,
respectively. Experimental results summarized in Table 1
show the number of random genes having larger-than-cutoff
t-scores in each case. Those random genes would have been
selected by t-score based algorithms. All the random experiments are repeated 1000 times.
From these experiments we observe that even randomly
generated expression levels may result in high discriminative scores. Suppose we merge the original dataset with a
random dataset, resulting in a 4000 genes by 62 sample expression matrix. If we were to choose top 200 genes using
t-scores from such merged dataset, more than 10% of selected genes would come from the random generated portion of data. The situation becomes more hopeless if we
add more random genes or the number of samples is even
smaller. When 8000 random genes are added to the merged
dataset, the probability of selecting such random genes in
top 300 gene list is more than 30%.
3 Integrating Biological Knowledge into
Gene Selection Process
One way to overcome this apparent drawback in the feature selection process is the integration of domain knowledge. Biologists have long been doing this. Gene selection is only the starting point in many biological studies.
Genes may be addded/removed to/from selected gene set at
a later stage pending on other biological evidences. However, to our surprise, there are not many feature selection
algorithms proposed in the literature to actually utilize domain knowledge. Although there are already a plethora of
online computable biological knowledge bases.
In this section, we propose an add-on algorithm to existing single gene-based gene selection algorithms. Given a
single gene-based discriminative scores (of the sample class
labels), our algorithm processes the score using biological
information contained within GO annotation. This results in
a new class of discriminative scores prefixed with the name
”GO adjusted”.
Definition 1 Informative Genes are those genes having
discriminative scores larger than θ, or F (g) > θ. Assume
F be a single gene based discriminative score.
Definition 2 Discriminative Power of a GO term is defined
as the percentage of informative genes among all genes that
are annotated with such GO term. Here g ∈ go denotes that
a gene g is annotated by GO term go and |go| denotes the
number of genes that are annotated by GO term go.
DP(go) =
|{g|g ∈ go ∧ F (g) > θ}|
|go|
Definition 3 Informative GO Term is defined as those GO
terms go whose discriminative power is larger than γ and
the number of informative genes annotated with go is larger
than β.
DP(go) > γ and |{g|g ∈ go ∧ F (g) > θ}| > β
Definition 4 GO adjusted discriminative score is defined
using one single gene discriminative score F (g) and a GO
term go, where g ∈ go.
0
if go is not informative;
Fa (g, go) =
F (g) if go is informative.
Definition 5 Best GO adjusted discriminative score is defined as the best GO adjusted discriminative score out of all
possible GO annotations of a single gene.
Fb (g) = max Fa (g, go)
∀go,g∈go
Given a single gene-based discriminative score F and
three parameters θ, β, γ, our algorithm calculates a modified single gene-based discriminative score named “best
GO adjusted score.” The basic idea behind our algorithm
is straightforward. While the expression levels of random
genes may correlate with sample class labels by chance, it
is far less likely that majority of these random genes will
also have common GO annotation. In other words, it is far
less likely that those random genes will have valid biological connections, either participating in the same biological
processes, or manifesting the same molecular functions, or
being found in the same cellular components.
Informative genes are defined by Def. 1 to be those genes
whose single gene discriminative scores F pass threshold θ.
Discriminative power of a GO term is defined by Def. 2 as
the percentage of informative genes among all genes that
are annotated with the GO term in question. Discriminative power of a GO term with respect to sample class labels
measures the collective discriminative power of genes annotated with that GO term. This in turn measures how different biological processes, cell components and molecular
functions are affected under different experimental conditions. The higher discriminative power of a GO term, the
stronger a GO term is correlated with sample class labels.
The value of θ has same range as the corresponding single gene discriminative score, which is the user’s estimate
of what a significant score is for F . We further call a GO
term informative GO term if such a GO term satisfies the
two conditions in Def. 3. First, more than β informative
genes needs to be annotated by a GO term in order for that
GO term to be called informative. Secondly, the percentage
of informative genes among all genes annotated by the GO
term needs to surpass threshold γ. These two criteria are set
to fend off the effect of random genes. We will discuss how
to choose these parameters later.
Single gene-based discriminative score F is then modified according to the discriminative power of GO terms.
To get rid of noisy genes, informative genes are only selected from those informative GO terms. We essentially
strengthen single gene-based scores if significant amount
of other genes that share common known biological annotation with the given gene are also discriminative of sample
class labels. We define “GO adjusted discriminative score”
Fa (g, go) according to Def. 4. The score is 0 if the annotating GO term is non-informative, otherwise it is the same as
the single gene discriminative score, or Fa (g, go) = F (g).
Here we assume single gene-based discriminative score is
positive and the larger the score is, the more discriminative
the corresponding gene is. Each gene product is annotated
with potentially multiple GO terms. We define “best GO adjusted score” Fb (g) to be the best “GO adjusted score” for
a gene among all its annotation. We assume the transitivity
of gene annotation in this work. If the direct annotating GO
term of a gene is not informative, the gene may still be considered if any parent GO terms of the direct annotating GO
term are informative.
Details of our algorithm and complexity analysis are
omitted in this paper due to space concern. For details
please refer to our technical report.
4 Experiment
Experiment Setup Our primary concern is the false discovery rate of any given single gene-based algorithm. We
measure this false discovery rate by repeating experiments
on randomized dataset. Our experimental dataset consists
of the original dataset from public available data sources
and the random portion that are generated as noise. We
measure for each single gene-based algorithm the percentage of genes that are selected coming from the randomized
portion of data and use this number as an estimate of the
false discovery rate.
The original data set is appended with several repetition of blocks of random data. Each random block has the
same number of genes and same number of experiments as
the original dataset. The expression levels in each random
block is generated using normal distribution with same parameters as the original log transformed dataset. Each gene
in a random block corresponds to a real gene in the original
dataset and share the same GO annotations of it. We call
the number of random blocks the random size of a given
experiment. For each experimental setup, we generate 200
sets of random data and report the average number of false
positive.
Data Preparation The data used in this work are all publicly available. We choose the widely used benchmark microarray dataset: colon cancer [2]. For gene ontology, we
downloaded a copy from GO web site at 10/15/2004. We
collected GO annotation for genes used in alon colon cancer
microarray experiment from SOURCE [10] online database
on 11/1/2004. We do notice that SOURCE annotation may
have been updated. However, the dataset suffices to test our
algorithm.
Not all genes in the original dataset have GO annotation in SOURCE database. However, a decent majority of
the original genes have been annotated. For colon cancer
dataset, out of original 2000 genes, we found 1495 of them
have been annotated with at least one of the 9137 GO annotation. That is, on average, a little more than 6 GO annotation per gene. Since our algorithm relies on gene ontology to provide necessary biological insight, we choose only
those genes that currently have GO annotation for further
analysis. We expect more and more GO annotation will be
available in the future.
Result Table 2 and 3 summarizes our experimental result on colon cancer dataset. It shows the false positive
rate measured as the number of random genes being chosen
by four single gene-based gene selection algorithms: S2N
(signal to noise ratio) tscore and their GO adjusted counterpart. For each experiment we also show the number of
overlapping genes, genes that are selected both both original single gene based discriminative scores and their GO
adjusted counterparts. From this table, we observe that GO
adjusted scores perform consistently better as less random
genes (10%) are chosen by GO adjusted scores as informative genes. The trends shown in the tables are clear:
the more random genes included the more false positive;
the more number genes selected the more false positive.
We generally have no knowledge and control of how many
genes included in microarray dataset are random with respect to sample class labels. However, we do know now
that selecting excessive informative genes from microarray
dataset becomes troublesome. We also report the number
of genes selected by both original discriminative score and
Table 2. Performance of GO adjusted scores
as measured in false positive using S2N score
Rand.
Size
1
2
3
4
1
2
3
4
1
2
3
4
# of
Genes
50
50
50
50
100
100
100
100
200
200
200
200
S2N
score
1.40
2.40
4.15
4.05
5.30
8.78
17.24
15.02
19.10
33.15
43.10
49.60
GO S2N
score
1.15
2.15
3.75
3.55
4.09
7.57
14.84
12.55
17.45
30.30
39.45
46.90
Diff S2N
score
0.25
0.25
0.40
0.50
1.21
1.21
2.40
2.47
1.65
2.85
3.65
2.70
Gene
Overlap
42.45
39.05
39.85
28.64
68.63
68.06
56.38
59.63
167.85
153.90
144.10
132.65
the best GO adjusted score. The percentage of overlapping
genes range from 60% to 80%, indicating majority of genes
selected by original single gene based algorithms also cluster in GO annotation.
Now we describe how we choose the three parameters required by GO adjusted scores: θ, β, γ in these experiments.
For θ, we set it to the cutoff score of selected genes. This
essentially states that if user is to choose top 100 genes, we
deem top 100 ranked genes as informative genes. Using
the transitivity assumption of GO annotation, the virtual top
GO node in our algorithm annotates every genes in question. Let the ratio between the number of informative genes
and the number of genes in the virtual top GO node be γ0 ,
experiments show good results when γ is set to 1.4-1.6 times
γ0 . When γ is set to γ0 , GO adjusted scores revert to original
single gene-based discriminative score since the virtual top
GO node becomes informative. With the increase in γ, we
are taking information in GO annotation more assertively.
In our experiments β is set to 1. We tested β values from 0
to 2, results are similar.
5 Conclusion and Future Work
Gene selection plays an important role in the analysis of
microarray dataset. Genes that express differently in different sample conditions are selected for further biological
investigation. However, due to the limited sample size and
complex underlying biology, gene selection algorithms are
haunted by excessive false positive rate. In this work, we
proposed to integrate biological domain knowledge imbedded in gene ontology and its annotation into the process of
gene selection. This is the first attempt of such integration,
to our best knowledge. Our experimental result shows this
References
Table 3. Performance of GO adjusted scores
as measured in false positive using t-score
Rand.
Size
1
2
3
4
1
2
3
4
1
2
3
4
# of
Genes
50
50
50
50
100
100
100
100
200
200
200
200
t
score
1.40
2.90
3.95
4.90
5.15
9.06
18.11
15.58
19.40
31.95
44.70
53.35
GO t
score
1.20
2.25
3.3
4.25
4.26
7.38
16.49
13.20
17.60
30.15
42.30
49.65
Diff t
score
0.20
0.65
0.65
0.65
0.89
1.68
1.62
2.38
1.80
1.80
2.40
3.70
Gene
Overlap
41.85
38.60
38.10
38.55
66.28
74.06
60.14
67.76
166.35
155.30
136.75
129.45
is a promising direction. Using gene ontology and its annotation, the probability of selecting random genes in our
experiments is reduced more than 10% on average. Our algorithm is a wrapper algorithm upon any if not all single
gene based discriminative scores. Our algorithm behaves
differently with different choice of one parameter γ, taking
GO annotation into consideration within the feature selection process at various degrees. This provides an interesting
way to integrate “old” knowledge in GO ontology with new
information from microarray experiments.
We ignored genes without GO annotation in this work
for simplicity. Although majority of genes (75%) in our
study are annotated by at least one GO term, genes that are
not currently annotated may be of interest nonetheless. For
these genes, traditional discriminative scores can be used
instead. Best GO adjusted score defined by Def 5 is comparable to original discriminative scores since they are essentially the same if the annotating GO term is informative. A
straightforward way to handle gene selection for genes that
are not currently annotated would be to compute original
discriminative scores. Such discriminative scores can then
be combined with best GO adjusted scores for the purpose
of gene selection.
The discriminative power of a GO term is defined to describe the correlation between GO terms and sample class
labels. This is different from the previous research in the
correlation between gene expression similarity and GO annotation similarity [3, 14], which has some interesting implication by itself. It provides a bridge between gene ontology and disease ontology. The ability of coupling GO
terms and disease symptoms may prove useful to provide
biologist new insight into disease pathology.
[1] F. Al-Shahrour, R. Daz-Uriarte, and J. Dopazo. Fatigo: a
web tool for finding significant associations of gene ontology terms with groups of genes. Bioinformatics, 20(4):578–
580, 2004.
[2] U. Alon, N. Barkai, D. Notterman, K. Gish, S. Ybarra,
D. Mack, and A. Levine. Broad patterns of gene expression revealed by clustering analysis of tumor and
normal colon tissue probed by oligonucleotide arrays.
Proc. Natl. Acad. Sci. U.S.A., 96(12):6745–50, 1999.
[3] F. Azuaje and O. Bodenreider. Incorporating ontologydriven similarity knowledge into functional genomics: An
exploratory study. In In Proc. of IEEE BIBE 2004, 2004.
[4] A. Ben-Dor, L. Bruhn, N. Friedman, I. Nachman,
M. Schummer, and Z. Yakhini. Tissue classification with
gene expression profiles. volume 7, pages 559–83, 2000.
[5] T. Bø and I. Jonassen. New feature subset selection procedures for classification of expression profiles. Genome
Biology, 3(4):research0017.1–0017.11, 2002.
[6] N. Bolshakova, F. Azuaje, and P. Cunningham.
A
knowledge-driven approach to cluster validay assessment.
Bioinformatics, 21(10):2546–2547, 2005.
[7] J. Cheng, S. Sun, A. Tracy, E. Hubbell, J. Morris,
V. Valmeekam, A. Kimbrough, M. Cline, G. Liu, R. Shigeta,
D. Kulp, and M. Siani-Rose. Netaffx gene ontology mining
tool: a visual approach for microarray data analysis. Bioinformatics, 20(9):1462–1463, 2004.
[8] G. Consortium. The gene ontology (go) database and informatics resource. Nucleic Acids Research, 32:D258–D261,
2004.
[9] X. Cui and G. A. Churchill. Statistical tests for differential
expression in cdna microarray experiments. Genome Biol.,
4(4):210, 2003.
[10] M. Diehn, G. Sherlock, et al. Source: a unified genomic
resource of functional annotations, ontologies, and gene expression data. Nucleic Acids Research,, 31(1):219–223,
2003.
[11] A. K. Jain, R. P. Duin, and J. Mao. Statistical pattern recognition: A review. IEEE Transactions on pattern analysis and
machine intelligence, 22(1):4–37, 2000.
[12] P. J. Park, M. Pagano, and M. Bonetti. A nonparametric scoring algorithm for identifying informative genes from
microarray data. In Proc. PSB, 2001.
[13] D. K. Slonim. From patterns to pathways: gene expression data analysis comes of age. Nature Genetics, 32:502–8,
2002.
[14] H. Wang and F. Azuaje. Gene expression correlation and
gene ontology-based similarity: An assessment of quantitative relationships. In In Proc. of IEEE CIBCB 2004, 2004.
[15] L. Yu and H. Liu. Redundancy based feature selection for
microarray data. In Proc. of SIGKDD, 2004.
[16] J. H. Zar. Biostatistical Analysis. Prentice Hall, 1998.
Download