Adaptive Classifiers, Topic Drifts and GO Annotations Padmini Srinivasan School of Library and Information Science Department of Management Sciences The University of Iowa {padmini-srinivasan@uiowa.edu} Abstract Gene annotations with Gene Ontology codes offer scientists important options in their study of genes and their functions. Automatic GO annotation methods have the potential to supplement the intensive manual annotation processes. Annotation approaches using MEDLINE documents are generally two-phased where the first is to annotate documents with GO codes and the second is to annotate gene products via the documents. In this paper we study document annotation with GO codes using a temporal perspective. Specifically, we build adaptive code-specific classifiers. We also study topic drift i.e., changes in the contextual characteristics of annotations over time. We show that topic drift is significant especially in the biological process GO hierarchy. This at least partially explains the particular challenges faced with codes of this hierarchy. Keywords: annotation; Gene Ontology codes; adaptive classifiers; topic drift. 1. INTRODUCTION Annotating genes (or more strictly their products) is an important research area. Of special importance is annotation with GO (Gene Ontology) codes – these have been used in many different ways to explore gene function and related characteristics [e.g. 1, 2]. GO code annotations succinctly indicate molecular functions, biological processes, and cellular components [3] related to the gene product. Although different subsets of GO may be used to annotate different species, the intent is to provide a common annotation infrastructure. Interest in automatic annotation strategies is evident from initiatives such as BioCreAtIvE 11 and papers that have been published (especially in the special issue [4]). The overall approach in many automatic annotation efforts is to use the MEDLINE literature as the key source of evidence. Our own efforts follow this approach [5,6]. We view GO annotation as a 2-phase process. First we annotate documents with GO codes. Next we annotate gene products with codes based on annotations associated with relevant documents. In recent research [5] we studied document annotation with SVM classifiers, specifically one classifier per code. We analyzed several angles such as the relationship between the number of positive examples 2 and performance; the 1 http://biocreative.sourceforge.net/biocreative_1.html A positive example is a document that provides evidence supporting the annotation of a gene product with the GO code. relationship between the hierarchical level of the code and performance etc. In most of these experiments the design adopted was a classic cross validation design used extensively for many classification problems even within the biomedical domain. That is we randomly partitioned the dataset into N (N = 5 in our previous research) parts with a stratified strategy to distribute about equal numbers of positive examples for each part. Classifiers were built iteratively using combinations of N1 parts that are each tested on the remaining part. A weakness in the above design is that it does not synchronize with the temporal dimension that underlies the data. Specifically each document has a time stamp as designated by its publication date. Atemporal cross validation allows the testing of classifiers on documents with time stamps that are older than those of documents used to build the classifiers. In contrast experiments preserving the natural ordering of the documents are likely to provide a more realistic gauge of effectiveness. Therefore our first goal here is to explore classifiers that are true to the temporal order in the data. As described later this raises the level of complexity in design. By adopting a design that follows the temporal ordering of the documents we also get an opportunity to explore a second goal in this paper: the study of topic drifts in GO annotations. By topic drift we mean observable topic changes over the temporal stream of documents to which a code is applied. In general the notion of drift has been studied in various contexts as for example query expansion [7], information filtering [8]. It is important to consider the potential for drift and its effect on GO annotation as literature on new and known gene products are continuously added. We are also interested in topic drift because of preliminary observations from our previous study (described later). In section 2 we provide details regarding our methods and data. In section 3 we examine adaptive SVM classifiers while in the next section we study topic drift. In section 5 we present related research and finally we make our conclusions in section 6 and also outline further research. 2. METHODS 2.1 Gene Ontology Gene Ontology (GO)3 provides a structured vocabulary for annotating gene products in order to succinctly indicate their molecular functions (MF), biological processes (BP), and cellular components (CC) [3]. Molecular function describes activities performed by individual gene products or complexes of gene products. Examples of molecular functions are arbutin 2 3 Downloaded May 16 2006 from: http://www.geneontology.org transporter activity and retinoic acid receptor binding. A biological process is made of several steps accomplished by sequences of molecular functions. Examples include lipoprotein transport and phage assembly. Cellular components are for example, the nucleus, NADPH oxidase complex, and chromosome. There are three hierarchies in GO corresponding to these major dimensions. Each is a directed acyclic graph. 2.2 Annotations We begin with a download of LocusLink (now called Entrez Gene4) and extracted the entries for Homo Sapiens limited to those with locus type gene with protein product, function known or inferred. There are 77,759 annotation entries for 16,630 locus ids. Considering only annotations using MEDLINE for evidence we have 29,501 entries. Entries limited to those having TAS (Traceable Author Statement) or IDA (Inferred from Direct Assay) as evidence types yield 20,869 entries5. These are composed of 9,577 annotations for biological processes (BP) 5,195 annotations for cellular components (CC) and 6,097 for molecular function (MF). Together these 20,869 annotations reference 8,744 unique documents. We use codes with at least 10 positive documents. Our dataset contained 89, 152 and 50 codes satisfying this criterion for the molecular function, biological process and cellular component hierarchies respectively. We set aside as tuning data approximately 10% of the codes for each hierarchy with a minimum of 10 codes. Specifically, we tuned for aspects such as thresholds with 10, 15 and 10 codes for the three hierarchies respectively. 2.3 Document Representation We use the title, abstract, RN and MeSH fields of the MEDLINE records. Word stems (after removing stop words) were used to generate document representations. These were produced using the SMART system [9]. The ltc [10] construction of TF * IDF weighting was used. This has worked well in our previous research [5,6]. N is the number of documents, n is the number associated with term ti. ltc(t i ) wi w 2 k k where w i (ln( tf ) 1.0) ln( N n ) 2.4 Performance Measures We use FScore the harmonic mean of precision (P) and recall (R). Fscore = (2R*P)/(R+P). Precision is the number of true positive decisions made by the classifier divided by the number of positive decisions made. Recall is the number of true positive decisions made by the classifier divided by the number of positive decisions that exists in the dataset. 4 http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene (download date: August 2005. 5 http://www.geneontology.org/GO.evidence.shtml. 3. ADAPTIVE CLASSIFIERS Adaptive classifiers are those that are re-trained when more data become available. We explore adaptive classifiers trained in a manner consistent with the temporal dimensions of the data. Figure 1 illustrates our approach. We start by generating an initial training set (Gen 0) of documents from the earliest part of the document stream. The size of this initial set is determined by the position of the Nth relevant document in the stream for the code, with N being a parameter (set at 5 here). Next a chunk of documents of size batchsize (another parameter) is used as the test set for this Gen 0 classifier. That is this classifier is built using Gen 0 training data and tested on Gen 0 test data. For the next iteration the Gen 0 training and testing subsets are combined to form the Gen 1 training set and used to build the Gen 1 classifier. Again the next chunk of documents in temporal order of size batchsize is used to test the Gen 1 classifier. This process continues until the entire document stream is processed. The last generation test set may have less than batchsize documents. Notice that the dataset partition is code specific as the initial Gen 0 depends on the distribution of the first N relevant documents. This strategy reflects the natural ordering of the document stream. Figure 1: Adaptive Classifiers. @: relevant %: non relevant document in the stream An additional level of complexity results because we need to set thresholds on the scores returned by the SVM classifiers. Based on our previous research [5,6] and the work of others [e.g. 11] it is clear that score thresholds are important especially when the dataset is skewed. As an example, in our previous research FScore changed from 0.052 to 0.48 for the molecular function hierarchy when using thresholds. We set the thresholds in the following way. When developing the Gen X classifier for a code, we take all of its training data and do a standard 5-fold cross validation design to set the optimal threshold. We then build a single classifier using all of the Gen X training data. This is then applied to the Gen X test data along with this optimal threshold. Finally we need to determine the optimal value for batchsize for each hierarchy. We tested batchsize values of 100, 200, 300, 400 and 500. These were done with the tuning set of codes as described in the Methodology section. The best values were returned with batchsize = 100. However, runs with this parameter value required the longest amount of time to process a single code ~ approximately 2 hours per code on a 2GHZ machine running Linux. For practical reasons we chose to use a batchsize of 400. Thus our results are lower bounds on performance. E.g. we get 0.3933 FScore on our MF training data with batchsize 400 and 0.4312 (+9.6%) for batchsize 100. The results are given in Table 1. We see that the best scores are attained for MF followed by CC and then BP. The relative ordering of the three hierarchies is consistent with our previous results. Just for comparison we also provide the scores achieved on this dataset using a straightforward 5-fold cross validation design. Interestingly, we see improvements of 6.3% for MF and 5.6% for CC. As noted before these results are lower bounds on our strategy. Gen Size: 400 cross validation MF 0.4476 0.4209 FScores BP 0.3428 0.3480 CC 0.4021 0.3795 Table 1: Performance with Adaptive Classifiers 4. TOPIC DRIFT Our interest in topic drift is motivated by related research [7,8] as well as an observation made in our previous study [5]. In it we explored the effectiveness of classifiers built out of training data limited to very few (< 5) positive examples. The experiment was motivated by the fact that every code starts, at birth, with no positives. It then accumulates positives over time with codes possibly varying in their accumulation rates. We arranged our dataset temporally and consider the position of the fifth relevant document as the cut off point. Documents with the same or earlier time stamps formed the training set. For the test set we used two strategies. First, we limited the test set to the temporal stream up to and including the fifth new relevant document (labeled Recent 5). Second, we took the rest of the document stream as the test set (labeled Full). Figure 2 illustrates the design while Table 1 presents these results. Figure 2: Training and Testing Sets for a Given Code We see from the table that the score for the full test set is consistently significantly less than the score for the more recent test set. These differences are intriguing as they hint at the presence of topic drifts in GO annotation. Our goal in this paper is to study this phenomenon through more direct methods. Full Hierarchy MF BP CC Recent 5 FScore 0.2713 0.3218 0.1931 0.2251 0.2144 0.2488 Difference -15.7 % -14.2 % -13.8% Table 2: Performance on Full versus Recent 5 Test Sets 4.1 Comparing First and Last Batches of Documents Our earlier observation is limited as the sizes of the test data vary. The recent 5 test set is a small subset of the Full test set. Thus the first objective in this paper is to conduct a more fair comparison of the two scenarios. Using the same design as before we took the 400 documents temporally closest to the training data as our First batch test set and we took the furthest 400 documents as the Last batch test set. Table 2 shows these results. Again we see huge drops in FScore as we move from the First batch to the Last batch for each hierarchy. Batchsize of 300 and 500 resulted in similar results. Thus the presence of a temporal drift in topic coverage for the codes is strongly indicated. Batch First Last MF 0.3607 0.2337 (-35.2%) BP 0.2424 0.1756 (-27.6%) CC 0.3104 0.127 (-59%) Table 3: Performance on First versus Last Batch Test Sets 4.2 Topic Drift in Relevant Documents We now approach topic drift more directly. We collect the relevant documents for each code and examine the distribution of pair-wise similarity scores over time. The question we ask is: are similarities between temporally distant documents relevant to the same code different from similarities between temporally close documents. We take the relevant documents for a given code using again publication date as the time stamp. We then create all pairs of documents from this set and place each pair into one of five bins. We do this for each code in turn. The bins are defined by the number of days separating the pair based on their publication dates. Starting with the bin representing a difference of 1 to 1000 days (b1), each bin is defined at a thousand day interval. The last bin represents document pairs that are more than 4000 days apart (b5). We compute cosine similarities between each pair of documents in a given bin. Average similarity (along with variance) for each bin is examined and our analysis is reported in the following figures. Figure 3 plots the average pair-wise similarity by temporal bin. As a point of contrast we also plot the average similarity for each bin when documents are paired at random. Such random pairs satisfy the bin requirements in 5. RELATED WORK biomedical literature. This is in contrast to other annotation methods such as ones involving sequence homology and protein domain analysis [12]. The importance of GO annotation and the value of computational methods to solve for it are well recognized. In the 2004 BioCreAtIve I challenge a set of tasks were designed to assess the performance of current systems in supporting GO annotations for specific proteins. In particular, the second task to identify text passages that provide the evidence for annotation resembles most the manual process of GO annotation [13]. The participating systems showed a variety of approaches (from heuristics to Support Vector Machine based classification) exploring different levels in text analysis (such as sentences or paragraphs) [14]. In Rice et al. [15] SVM classification was applied to the relevant documents for each GO code. Features from the documents were selected and conflated as sets of synonymous terms. In Ray et al. [16] statistical methods were first applied to identify n-gram informative terms from the relevant documents of each GO term. These term models provided hypothesized annotation models that could be applied to the test documents. In Chiang et al. [17] a hybrid method combining sentence level classification and pattern matching achieves higher precision with fewer true positive documents. The document annotation problem is interesting as the codes themselves are structured hierarchically. Similar hierarchical problems have been addressed [e.g. 18] including by us [19]. But the three hierarchies of Gene Ontology, molecular function (MF), biological process (BP) and cellular component (CC) have distinct characteristics, e.g., they differ significantly in link semantics. Molecular function is built out of is_a links, biological process links are one-fifth part_of and four-fifth is_a while cellular component is about evenly split between the two link types. Although both link types are asymmetric and transitive, their semantics are very different. 0.05 0.045 0.04 0.035 Average Similarity terms of their temporal distance. But the documents of a pair are not necessarily relevant to the same code. Note first that although the raw similarity scores are low these are all much higher for each hierarchy than for the random pairs. What is more interesting in Figure 3 is the change over time. Within each hierarchy bins holding temporally closer document pairs have significantly higher scores6 than bins with more distant document pairs. However this is also true for the random pairings. We also observe that for BP the drops are higher from bin 3 onwards. Also for MF the drop from bin 1 to bin 2 is higher than for the other bins. Finally the drop for CC parallels the drop for random pairs in all bins. In terms of topic drift the rate of change in similarity scores over time is more important than the absolute score changes (as in Figure 3). We expect that there will be a baseline rate of change in similarity scores that may be gauged by observing the random pairs. This baseline provides an overall sense of the way in which documents are likely to drift apart over time. The question is: how does the similarity drift for each hierarchy compared to this baseline? Figure 4 presents this analysis. The x-axis identifies bins being compared with bin 1 (pairs that are 1 to 1000 days apart) as the focus. That is we calculate the change in average similarity between each later bin and bin 1 which is relatively speaking the most recent bin. In Figure 4 the smallest percentage drops are for the random pairs, about 2% from bin 1 to bin2, then 4%, 8% and 14% respectively for bins 3, 4 and 5. We see again that CC parallels the baseline – however the drops are higher with 3%, 7%, 11% and 18% respectively for the same bin sequence. The largest changes or drifts appear for BP (4.5%, 9%, 17%, 29%) followed by MF (7%, 10%, 13%, 17%). BP percentage drops in similarity over time are in fact quite dramatic. However, when we focus solely on the difference going from bin 1 to bin 2, the biggest drift is for MF. MF also has the highest average similarities for all time points (see Figure 3). Analyzing Figures 3 and 4 jointly we see that the document sets for codes within each hierarchy are more cohesive than document sets created at random. This is true over all values of temporal distance tested. BP has the second highest set of similarities (next to MF) but it has the steepest drop over time. In other words, topic drift is the highest for BP. This implies that BP would be a challenging hierarchy for which to build classifiers. This point is reflected in the performance scores obtained with adaptive classifiers as well as through standard cross validation for BP relative to MF and CC (see Table 1). 0.03 0.025 0.02 0.015 0.01 0.005 Annotating genes and their products with Gene Ontology codes is an important area of research. One approach is to use the information available about these genes in the 0 1<=1000 (b1) 1000<=2000 (b2) 2000<=3000 (b3) 3000<=4000 (b4) >4000 (b5) Temporal Distance MF BP CC Random Figure 3: Average Similarity Over Temporal Distance 6 Error bars are extremely small and so are not shown in the figure to maintain clarity. 0 b2 -b1 b3-b1 b4-b1 b5-b1 Rate of Change in Similarity (Percentage) -5 -10 -15 -20 -25 -30 -35 Bins Compared MFRate BPRate CCRate RandomRate Figure 4: Rate of Change in Average Similarity 6. CONCLUSIONS AND FUTURE WORK We explored temporal and adaptive SVM classifiers for annotation with GO codes and obtained FScores of 0.4476, 0.3428 and 0.4021 for the molecular function, biological process and cellular component hierarchies respectively. We also studied topic drift. The largest drift is observed for the biological process hierarchy, which might explain why it is the most challenging hierarchy. In future we will explore different versions of adaptive classifiers where the training instances are weighted by age. Another strategy to explore is the use of ensembles where each member classifier is from a distinct temporal chunk of training data. A key limitation of our research is that it is based on abstracts and not full-text. This latter text type will be explored in future research. ACKNOWLEDGMENTS Padmini Srinivasan gratefully acknowledges NSF Grant No. IIS-0312356 which partly funded this research. REFERENCES [1] G. Yi, S. H. Sze, M. R. Thon. Identifying clusters of functionally related genes in genomes. BMC Bioinformatics. 2007 Jan 19. [2] G. Lu, T. V. Nguyen et al. AffyMiner: mining differentially expressed genes and biological knowledge in GeneChip microarray data. BMC Bioinformatics. 2006 Dec 12;7 Suppl 4:S26. [3] M. Ashburner, C. A. Ball, J. A. Blake et al. Gene ontology: tool for the unification of biology. Nature Genetics, 25:25–29, 2000. [4] C. Blaschke, L. Hirschman, A. Valencia, A. Yeh. A critical assessment of text mining. BMC Bioinformatics, 6(Suppl 1), 2005. [5] P. Srinivasan and X. Y. Qiu. GO for Gene Documents. To appear BMC Bioinformatics, 2007. [6] X. Y. Qiu and P. Srinivasan. GO for Gene Documents. Proceedings TMBIO 2006: ACM First International Workshop on Text Mining in Bioinformatics, CIKM 2006. [7] A. Singhal, M. Mitra, and C. Buckley. Learning routing queries in a query zone. Proceedings SIGIR pages 25-32, 1997. [8] R. Klinkenberg. Learning drifting concepts: Example selection vs. example weighting. Intelligent Data Analysis, Special Issue on Incremental Learning Systems Capable of Dealing with Concept Drift, 8(3):281-300, 2004. [9] G. Salton. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, 1989. [10] A. Singhal, C. Buckley, and M. Mitra. Pivoted document length normalization. Proceedings ACM SIGIR Conference, pages 21–29, 1996. [11] J. Brank, M. Grobelnik, N. Milic-Frayling, and D. Mlade. Training text classifiers with svm on very few positive examples. Microsoft Corporation Technical Report, MSR-TR2003-34, 2003. [12] H. Xie, A. Wasserman, Z. Levine et al. Large scale protein annotation through gene ontology. Genome Research, 12:785– 794, 2002. [13] L. Hirschman, A. Yeh, C. Blaschke, and A. Valencia. Overview of biocreative: cretical assessment of information extraction for biology. BMC Bioinformatics, 6(Suppl 1)(S1):795–825, May 2005. [14] C. Blaschke, E. A. Leon, M. Krallinger, and A. Valencia. Evaluation of biocreative assessment of task 2. BMC Bioinformatics, 6(Suppl 1)(S16):291–301, May 2005. [15] S. B. Rice, G. Nenadic, and B. J. Stapley. Mining protein function from text using term-based support vector machines. BMC Bioinformatics, 6(Suppl 1)(S22):291–301, May 2005. [16] S. Ray and M. Craven. Learning statistical models for annotating proteins with function information using biomedical text. BMC Bioinformatics, 6(Suppl 1)(S18):291–301, May 2005. [17] J.-H. Chiang & H.-C. Yu. Extracting functional annotations of proteins based on hybrid text mining approaches. Proceedings BioCreAtIvE Challenge Evaluation Workshop 2004. [18] S. Charkrabarti, B. Dom, R. Agrawal, and P. Raghavan. Using taxonomy, discriminants, and signatures for navigating in text databases. Proceedings VLDB Conference, 1997. [19] M. E. Ruiz and P. Srinivasan. Hierarchical text categorization using neural networks. Information Retrieval, 5(1):87–118, 2002.