srinivasan

advertisement
Adaptive Classifiers, Topic Drifts and GO Annotations
Padmini Srinivasan
School of Library and Information Science
Department of Management Sciences
The University of Iowa
{padmini-srinivasan@uiowa.edu}
Abstract
Gene annotations with Gene Ontology codes offer scientists
important options in their study of genes and their functions.
Automatic GO annotation methods have the potential to
supplement the intensive manual annotation processes.
Annotation approaches using MEDLINE documents are
generally two-phased where the first is to annotate documents
with GO codes and the second is to annotate gene products via
the documents. In this paper we study document annotation
with GO codes using a temporal perspective. Specifically, we
build adaptive code-specific classifiers. We also study topic
drift i.e., changes in the contextual characteristics of annotations
over time. We show that topic drift is significant especially in
the biological process GO hierarchy. This at least partially
explains the particular challenges faced with codes of this
hierarchy.
Keywords: annotation; Gene Ontology codes; adaptive
classifiers; topic drift.
1. INTRODUCTION
Annotating genes (or more strictly their products) is an
important research area. Of special importance is annotation
with GO (Gene Ontology) codes – these have been used in
many different ways to explore gene function and related
characteristics [e.g. 1, 2]. GO code annotations succinctly
indicate molecular functions, biological processes, and cellular
components [3] related to the gene product. Although different
subsets of GO may be used to annotate different species, the
intent is to provide a common annotation infrastructure.
Interest in automatic annotation strategies is evident from
initiatives such as BioCreAtIvE 11 and papers that have been
published (especially in the special issue [4]). The overall
approach in many automatic annotation efforts is to use the
MEDLINE literature as the key source of evidence. Our own
efforts follow this approach [5,6].
We view GO annotation as a 2-phase process. First we
annotate documents with GO codes. Next we annotate gene
products with codes based on annotations associated with
relevant documents. In recent research [5] we studied document
annotation with SVM classifiers, specifically one classifier per
code. We analyzed several angles such as the relationship
between the number of positive examples 2 and performance; the
1
http://biocreative.sourceforge.net/biocreative_1.html
A positive example is a document that provides evidence supporting the
annotation of a gene product with the GO code.
relationship between the hierarchical level of the code and
performance etc. In most of these experiments the design
adopted was a classic cross validation design used extensively
for many classification problems even within the biomedical
domain. That is we randomly partitioned the dataset into N (N =
5 in our previous research) parts with a stratified strategy to
distribute about equal numbers of positive examples for each
part. Classifiers were built iteratively using combinations of N1 parts that are each tested on the remaining part.
A weakness in the above design is that it does not
synchronize with the temporal dimension that underlies the
data. Specifically each document has a time stamp as
designated by its publication date. Atemporal cross validation
allows the testing of classifiers on documents with time stamps
that are older than those of documents used to build the
classifiers. In contrast experiments preserving the natural
ordering of the documents are likely to provide a more realistic
gauge of effectiveness. Therefore our first goal here is to
explore classifiers that are true to the temporal order in the data.
As described later this raises the level of complexity in design.
By adopting a design that follows the temporal
ordering of the documents we also get an opportunity to explore
a second goal in this paper: the study of topic drifts in GO
annotations. By topic drift we mean observable topic changes
over the temporal stream of documents to which a code is
applied. In general the notion of drift has been studied in
various contexts as for example query expansion [7],
information filtering [8]. It is important to consider the
potential for drift and its effect on GO annotation as literature
on new and known gene products are continuously added. We
are also interested in topic drift because of preliminary
observations from our previous study (described later).
In section 2 we provide details regarding our methods
and data. In section 3 we examine adaptive SVM classifiers
while in the next section we study topic drift. In section 5 we
present related research and finally we make our conclusions in
section 6 and also outline further research.
2. METHODS
2.1 Gene Ontology
Gene Ontology (GO)3 provides a structured vocabulary
for annotating gene products in order to succinctly indicate their
molecular functions (MF), biological processes (BP), and
cellular components (CC) [3]. Molecular function describes
activities performed by individual gene products or complexes
of gene products. Examples of molecular functions are arbutin
2
3
Downloaded May 16 2006 from: http://www.geneontology.org
transporter activity and retinoic acid receptor binding. A
biological process is made of several steps accomplished by
sequences of molecular functions. Examples include lipoprotein
transport and phage assembly. Cellular components are for
example, the nucleus, NADPH oxidase complex, and
chromosome. There are three hierarchies in GO corresponding
to these major dimensions. Each is a directed acyclic graph.
2.2 Annotations
We begin with a download of LocusLink (now called
Entrez Gene4) and extracted the entries for Homo Sapiens
limited to those with locus type gene with protein product,
function known or inferred. There are 77,759 annotation entries
for 16,630 locus ids. Considering only annotations using
MEDLINE for evidence we have 29,501 entries. Entries limited
to those having TAS (Traceable Author Statement) or IDA
(Inferred from Direct Assay) as evidence types yield 20,869
entries5. These are composed of 9,577 annotations for biological
processes (BP) 5,195 annotations for cellular components (CC)
and 6,097 for molecular function (MF). Together these 20,869
annotations reference 8,744 unique documents.
We use codes with at least 10 positive documents. Our
dataset contained 89, 152 and 50 codes satisfying this criterion
for the molecular function, biological process and cellular
component hierarchies respectively. We set aside as tuning data
approximately 10% of the codes for each hierarchy with a
minimum of 10 codes. Specifically, we tuned for aspects such
as thresholds with 10, 15 and 10 codes for the three hierarchies
respectively.
2.3 Document Representation
We use the title, abstract, RN and MeSH fields of the
MEDLINE records. Word stems (after removing stop words)
were used to generate document representations. These were
produced using the SMART system [9]. The ltc [10]
construction of TF * IDF weighting was used. This has worked
well in our previous research [5,6]. N is the number of
documents, n is the number associated with term ti.
ltc(t i ) 
wi
w
2
k
k
where w i  (ln( tf )  1.0)  ln( N n )
2.4 Performance Measures
We use FScore the harmonic mean of precision (P) and
recall (R). Fscore = (2R*P)/(R+P). Precision is the number of
true positive decisions made by the classifier divided by the

number of positive decisions made. Recall is the number of true
positive decisions made by the classifier divided by the number
of positive decisions that exists in the dataset.
4
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene (download date:
August 2005.
5
http://www.geneontology.org/GO.evidence.shtml.
3. ADAPTIVE CLASSIFIERS
Adaptive classifiers are those that are re-trained when
more data become available. We explore adaptive classifiers
trained in a manner consistent with the temporal dimensions of
the data. Figure 1 illustrates our approach.
We start by generating an initial training set (Gen 0) of
documents from the earliest part of the document stream. The
size of this initial set is determined by the position of the Nth
relevant document in the stream for the code, with N being a
parameter (set at 5 here). Next a chunk of documents of size
batchsize (another parameter) is used as the test set for this Gen
0 classifier. That is this classifier is built using Gen 0 training
data and tested on Gen 0 test data. For the next iteration the
Gen 0 training and testing subsets are combined to form the
Gen 1 training set and used to build the Gen 1 classifier. Again
the next chunk of documents in temporal order of size batchsize
is used to test the Gen 1 classifier. This process continues until
the entire document stream is processed. The last generation
test set may have less than batchsize documents. Notice that the
dataset partition is code specific as the initial Gen 0 depends on
the distribution of the first N relevant documents. This strategy
reflects the natural ordering of the document stream.
Figure 1: Adaptive Classifiers. @: relevant %: non relevant
document in the stream
An additional level of complexity results because we need to set
thresholds on the scores returned by the SVM classifiers. Based
on our previous research [5,6] and the work of others [e.g. 11] it
is clear that score thresholds are important especially when the
dataset is skewed. As an example, in our previous research
FScore changed from 0.052 to 0.48 for the molecular function
hierarchy when using thresholds.
We set the thresholds in the following way. When
developing the Gen X classifier for a code, we take all of its
training data and do a standard 5-fold cross validation design to
set the optimal threshold. We then build a single classifier
using all of the Gen X training data. This is then applied to the
Gen X test data along with this optimal threshold.
Finally we need to determine the optimal value for
batchsize for each hierarchy. We tested batchsize values of
100, 200, 300, 400 and 500. These were done with the tuning
set of codes as described in the Methodology section. The best
values were returned with batchsize = 100. However, runs with
this parameter value required the longest amount of time to
process a single code ~ approximately 2 hours per code on a
2GHZ machine running Linux. For practical reasons we chose
to use a batchsize of 400. Thus our results are lower bounds on
performance. E.g. we get 0.3933 FScore on our MF training
data with batchsize 400 and 0.4312 (+9.6%) for batchsize 100.
The results are given in Table 1. We see that the best
scores are attained for MF followed by CC and then BP. The
relative ordering of the three hierarchies is consistent with our
previous results. Just for comparison we also provide the scores
achieved on this dataset using a straightforward 5-fold cross
validation design. Interestingly, we see improvements of 6.3%
for MF and 5.6% for CC. As noted before these results are
lower bounds on our strategy.
Gen Size: 400
cross validation
MF
0.4476
0.4209
FScores
BP
0.3428
0.3480
CC
0.4021
0.3795
Table 1: Performance with Adaptive Classifiers
4. TOPIC DRIFT
Our interest in topic drift is motivated by related
research [7,8] as well as an observation made in our previous
study [5]. In it we explored the effectiveness of classifiers built
out of training data limited to very few (< 5) positive examples.
The experiment was motivated by the fact that every code
starts, at birth, with no positives. It then accumulates positives
over time with codes possibly varying in their accumulation
rates. We arranged our dataset temporally and consider the
position of the fifth relevant document as the cut off point.
Documents with the same or earlier time stamps formed the
training set. For the test set we used two strategies. First, we
limited the test set to the temporal stream up to and including
the fifth new relevant document (labeled Recent 5). Second, we
took the rest of the document stream as the test set (labeled
Full). Figure 2 illustrates the design while Table 1 presents
these results.
Figure 2: Training and Testing Sets for a Given Code
We see from the table that the score for the full test set
is consistently significantly less than the score for the more
recent test set. These differences are intriguing as they hint at
the presence of topic drifts in GO annotation. Our goal in this
paper is to study this phenomenon through more direct methods.
Full
Hierarchy
MF
BP
CC
Recent 5
FScore
0.2713
0.3218
0.1931
0.2251
0.2144
0.2488
Difference
-15.7 %
-14.2 %
-13.8%
Table 2: Performance on Full versus Recent 5 Test Sets
4.1 Comparing First and Last Batches of Documents
Our earlier observation is limited as the sizes of the
test data vary. The recent 5 test set is a small subset of the Full
test set. Thus the first objective in this paper is to conduct a
more fair comparison of the two scenarios.
Using the same design as before we took the 400
documents temporally closest to the training data as our First
batch test set and we took the furthest 400 documents as the
Last batch test set. Table 2 shows these results. Again we see
huge drops in FScore as we move from the First batch to the
Last batch for each hierarchy. Batchsize of 300 and 500
resulted in similar results. Thus the presence of a temporal drift
in topic coverage for the codes is strongly indicated.
Batch
First
Last
MF
0.3607
0.2337
(-35.2%)
BP
0.2424
0.1756
(-27.6%)
CC
0.3104
0.127
(-59%)
Table 3: Performance on First versus Last Batch Test Sets
4.2 Topic Drift in Relevant Documents
We now approach topic drift more directly. We collect
the relevant documents for each code and examine the
distribution of pair-wise similarity scores over time. The
question we ask is: are similarities between temporally distant
documents relevant to the same code different from similarities
between temporally close documents. We take the relevant
documents for a given code using again publication date as the
time stamp. We then create all pairs of documents from this set
and place each pair into one of five bins. We do this for each
code in turn. The bins are defined by the number of days
separating the pair based on their publication dates. Starting
with the bin representing a difference of 1 to 1000 days (b1),
each bin is defined at a thousand day interval. The last bin
represents document pairs that are more than 4000 days apart
(b5).
We compute cosine similarities between each pair of
documents in a given bin. Average similarity (along with
variance) for each bin is examined and our analysis is reported
in the following figures. Figure 3 plots the average pair-wise
similarity by temporal bin. As a point of contrast we also plot
the average similarity for each bin when documents are paired
at random. Such random pairs satisfy the bin requirements in
5. RELATED WORK
biomedical literature. This is in contrast to other annotation
methods such as ones involving sequence homology and protein
domain analysis [12].
The importance of GO annotation and the value of
computational methods to solve for it are well recognized. In
the 2004 BioCreAtIve I challenge a set of tasks were designed
to assess the performance of current systems in supporting GO
annotations for specific proteins. In particular, the second task
to identify text passages that provide the evidence for
annotation resembles most the manual process of GO
annotation [13]. The participating systems showed a variety of
approaches (from heuristics to Support Vector Machine based
classification) exploring different levels in text analysis (such as
sentences or paragraphs) [14]. In Rice et al. [15] SVM
classification was applied to the relevant documents for each
GO code. Features from the documents were selected and
conflated as sets of synonymous terms. In Ray et al. [16]
statistical methods were first applied to identify n-gram
informative terms from the relevant documents of each GO
term. These term models provided hypothesized annotation
models that could be applied to the test documents. In Chiang
et al. [17] a hybrid method combining sentence level
classification and pattern matching achieves higher precision
with fewer true positive documents.
The document annotation problem is interesting as the
codes themselves are structured hierarchically. Similar
hierarchical problems have been addressed [e.g. 18] including
by us [19]. But the three hierarchies of Gene Ontology,
molecular function (MF), biological process (BP) and cellular
component (CC) have distinct characteristics, e.g., they differ
significantly in link semantics. Molecular function is built out
of is_a links, biological process links are one-fifth part_of and
four-fifth is_a while cellular component is about evenly split
between the two link types. Although both link types are
asymmetric and transitive, their semantics are very different.
0.05
0.045
0.04
0.035
Average Similarity
terms of their temporal distance. But the documents of a pair
are not necessarily relevant to the same code.
Note first that although the raw similarity scores are
low these are all much higher for each hierarchy than for the
random pairs. What is more interesting in Figure 3 is the
change over time. Within each hierarchy bins holding
temporally closer document pairs have significantly higher
scores6 than bins with more distant document pairs. However
this is also true for the random pairings. We also observe that
for BP the drops are higher from bin 3 onwards. Also for MF
the drop from bin 1 to bin 2 is higher than for the other bins.
Finally the drop for CC parallels the drop for random pairs in all
bins.
In terms of topic drift the rate of change in similarity
scores over time is more important than the absolute score
changes (as in Figure 3). We expect that there will be a
baseline rate of change in similarity scores that may be gauged
by observing the random pairs. This baseline provides an
overall sense of the way in which documents are likely to drift
apart over time. The question is: how does the similarity drift
for each hierarchy compared to this baseline? Figure 4 presents
this analysis. The x-axis identifies bins being compared with
bin 1 (pairs that are 1 to 1000 days apart) as the focus. That is
we calculate the change in average similarity between each later
bin and bin 1 which is relatively speaking the most recent bin.
In Figure 4 the smallest percentage drops are for the
random pairs, about 2% from bin 1 to bin2, then 4%, 8% and
14% respectively for bins 3, 4 and 5. We see again that CC
parallels the baseline – however the drops are higher with 3%,
7%, 11% and 18% respectively for the same bin sequence. The
largest changes or drifts appear for BP (4.5%, 9%, 17%, 29%)
followed by MF (7%, 10%, 13%, 17%). BP percentage drops
in similarity over time are in fact quite dramatic. However,
when we focus solely on the difference going from bin 1 to bin
2, the biggest drift is for MF. MF also has the highest average
similarities for all time points (see Figure 3).
Analyzing Figures 3 and 4 jointly we see that the
document sets for codes within each hierarchy are more
cohesive than document sets created at random. This is true
over all values of temporal distance tested. BP has the second
highest set of similarities (next to MF) but it has the steepest
drop over time. In other words, topic drift is the highest for BP.
This implies that BP would be a challenging hierarchy for
which to build classifiers. This point is reflected in the
performance scores obtained with adaptive classifiers as well as
through standard cross validation for BP relative to MF and CC
(see Table 1).
0.03
0.025
0.02
0.015
0.01
0.005
Annotating genes and their products with Gene
Ontology codes is an important area of research. One approach
is to use the information available about these genes in the
0
1<=1000 (b1)
1000<=2000 (b2)
2000<=3000 (b3)
3000<=4000 (b4)
>4000 (b5)
Temporal Distance
MF
BP
CC
Random
Figure 3: Average Similarity Over Temporal Distance
6
Error bars are extremely small and so are not shown in the figure to maintain
clarity.
0
b2 -b1
b3-b1
b4-b1
b5-b1
Rate of Change in Similarity (Percentage)
-5
-10
-15
-20
-25
-30
-35
Bins Compared
MFRate
BPRate
CCRate
RandomRate
Figure 4: Rate of Change in Average Similarity
6. CONCLUSIONS AND FUTURE WORK
We explored temporal and adaptive SVM classifiers
for annotation with GO codes and obtained FScores of 0.4476,
0.3428 and 0.4021 for the molecular function, biological
process and cellular component hierarchies respectively. We
also studied topic drift. The largest drift is observed for the
biological process hierarchy, which might explain why it is the
most challenging hierarchy. In future we will explore different
versions of adaptive classifiers where the training instances are
weighted by age. Another strategy to explore is the use of
ensembles where each member classifier is from a distinct
temporal chunk of training data. A key limitation of our
research is that it is based on abstracts and not full-text. This
latter text type will be explored in future research.
ACKNOWLEDGMENTS
Padmini Srinivasan gratefully acknowledges NSF Grant No.
IIS-0312356 which partly funded this research.
REFERENCES
[1] G. Yi, S. H. Sze, M. R. Thon. Identifying clusters of
functionally related genes in genomes. BMC Bioinformatics.
2007 Jan 19.
[2] G. Lu, T. V. Nguyen et al. AffyMiner: mining differentially
expressed genes and biological knowledge in GeneChip
microarray data. BMC Bioinformatics. 2006 Dec 12;7 Suppl
4:S26.
[3] M. Ashburner, C. A. Ball, J. A. Blake et al. Gene ontology:
tool for the unification of biology. Nature Genetics, 25:25–29,
2000.
[4] C. Blaschke, L. Hirschman, A. Valencia, A. Yeh. A critical
assessment of text mining. BMC Bioinformatics, 6(Suppl 1),
2005.
[5] P. Srinivasan and X. Y. Qiu. GO for Gene Documents. To
appear BMC Bioinformatics, 2007.
[6] X. Y. Qiu and P. Srinivasan. GO for Gene Documents.
Proceedings TMBIO 2006: ACM First International Workshop
on Text Mining in Bioinformatics, CIKM 2006.
[7] A. Singhal, M. Mitra, and C. Buckley. Learning routing
queries in a query zone. Proceedings SIGIR pages 25-32, 1997.
[8] R. Klinkenberg. Learning drifting concepts: Example
selection vs. example weighting. Intelligent Data Analysis,
Special Issue on Incremental Learning Systems Capable of
Dealing with Concept Drift, 8(3):281-300, 2004.
[9] G. Salton. Automatic Text Processing: The Transformation,
Analysis, and Retrieval of Information by
Computer. Addison-Wesley, 1989.
[10] A. Singhal, C. Buckley, and M. Mitra. Pivoted document
length normalization. Proceedings ACM SIGIR Conference,
pages 21–29, 1996.
[11] J. Brank, M. Grobelnik, N. Milic-Frayling, and D. Mlade.
Training text classifiers with svm on very few positive
examples. Microsoft Corporation Technical Report, MSR-TR2003-34, 2003.
[12] H. Xie, A. Wasserman, Z. Levine et al. Large scale protein
annotation through gene ontology. Genome Research, 12:785–
794, 2002.
[13] L. Hirschman, A. Yeh, C. Blaschke, and A. Valencia.
Overview of biocreative: cretical assessment of information
extraction for biology. BMC Bioinformatics, 6(Suppl
1)(S1):795–825, May 2005.
[14] C. Blaschke, E. A. Leon, M. Krallinger, and A. Valencia.
Evaluation of biocreative assessment of task 2. BMC
Bioinformatics, 6(Suppl 1)(S16):291–301, May 2005.
[15] S. B. Rice, G. Nenadic, and B. J. Stapley. Mining protein
function from text using term-based support vector machines.
BMC Bioinformatics, 6(Suppl 1)(S22):291–301, May 2005.
[16] S. Ray and M. Craven. Learning statistical models for
annotating proteins with function information using biomedical
text. BMC Bioinformatics, 6(Suppl 1)(S18):291–301, May
2005.
[17] J.-H. Chiang & H.-C. Yu. Extracting functional
annotations of proteins based on hybrid text mining approaches.
Proceedings BioCreAtIvE Challenge Evaluation Workshop
2004.
[18] S. Charkrabarti, B. Dom, R. Agrawal, and P. Raghavan.
Using taxonomy, discriminants, and signatures for navigating in
text databases. Proceedings VLDB Conference, 1997.
[19] M. E. Ruiz and P. Srinivasan. Hierarchical text
categorization using neural networks. Information
Retrieval, 5(1):87–118, 2002.
Download