Efficient Ranking of Keyword Queries Using P-trees Fei Pan, Imad Rahal, Yue Cui, William Perrizo Computer Science Department North Dakota State University Fargo, ND 58105, USA fei.pan@ndsu.nodak.edu Abstract In this paper, we describe the architecture, implementation, and evaluation of the P-RANK system built to address the requirement for extracting evidences of specific products of genes from biomedical papers. PRANK is a system that accepts user interests in the form of keywords, which integrate different depth weights into the ranking score and highlights that molecular biologists who review these papers looking mainly for the certain part of a paper to extract experimental evidences, in turn, returns a ranked list pertaining to the users’ interests. Our contributions include presenting a new efficient keyword query system using a data structure called the P-tree, and a fast weighted ranking method using the EIN-ring. keyword query system using a data structure called the Ptree and a fast weighted ranking method using the EINring, which aid in the process of looking for evidences of specific products of genes from biomedical papers, especially from abstracts, experiments, and figure legends. We experimentally compare with other rankedkeyword searching approaches and results have shown that P-RANK is a very fast ranked keyword query system. This paper is organized as follows. In section 2, we describe the data model and representation used in PRANK. In section 3, we present an overview of PRANK’s system architecture. In section 4, we present the voting-based ranking method using the EIN-ring. We experimentally compare our approach to the traditional inverted list approach in section 5. Finally, we conclude this paper in section 6. Keywords: Keyword Query, Information Retrieval, Ptree. 2 1 INTRODUCTION Biomedical information exists in both the research literature and various structured databases. With a huge number of papers and books published yearly, access to biomedical information has become in great demand especially by molecular biologists. Molecular biologists are interested in retrieving a concise ranked set of references of interest without having to manually go through all the documents in a given collection. Traditional methods for ranking queries results, such as Google [5] [6], are based on the use of ranking formulas that calculate the frequencies of a document accessed. However, molecular biologists are mostly interested in extracting evidences of specific products of genes, e.g. mRNA and proteins, which cannot be achieved by general ranking approaches. An automatic analysis of scientific biomedical papers to extract only the papers that include experimental-related results is in need. In this paper, we describe the architecture, implementation, and evaluation of the P-RANK system built to address the requirement for extracting evidences of specific products of genes regarding expression of gene products from biomedical papers. P-RANK is a system that accepts user interests in the form of keywords, which integrate different dimension weights into the ranking score and evaluate documents accordingly. It returns a ranked list of reference pertaining to the users’ interests. Our contributions include presenting a new efficient 2.1 DATA MODEL AND REPRESENTATION Data Representation Most data mining tools and algorithms assume that the data being mined have some sort of structuring–e.g. relational tables in databases or data cubes in data warehouses ([10], chapter 2). Given that text references have only semi-structure, the task of text mining appears to be impossible. However, some models have been developed to impose some sort of structure on text, namely the term space model [11][12]. In this model, a document is presented as a vector in the term space – terms are used as features or dimensions. The data structure resulting from representing all the documents in a given collection as term vectors is referred to as the document-by-term matrix. In this representation, vector coordinates are terms with numeric values representing their relevance to the corresponding documents where higher values mean higher relevance. This process of giving numeric values to term coordinates is referred to as term weighting in the literature. We can view weighting as the process of giving more emphasis to more important terms. Term weight measures in document vectors can be binary with the values 1 and 0 denoting the existence or absence of the term in the corresponding document, respectively. However, due to the great loss of information associated with it, this representation lacks the accuracy needed. We employ a more sophisticated representation records, TF/IDF approach, which is the exact frequency and uniqueness measure of each term in a document [9]. Depth weight is another weight employed by P-RANK for XML format documents, which represent the location of terms. The deeper a term’s position is in a document, the more specific its contents are and thus they should be weighted higher. P-RANK uses a simple incremental weight value to accomplish this purpose. The root is weighted 1, first level is weighted 2, second level is weighted 3, and so on until we reach the lowest level. For flat text documents the depth weights are the same. Tree Structure As noted earlier, P-RANK is built on a novel and genuine data structure, the P-tree (Predicate tree). P-trees were developed to facilitate compression and very fast logical operations of bit sequential (bSQ) data [3]. We sort the highest order bSQ of attributes and build special Ptree to efficiently reduce data accesses by filtering out “bit holes” consisting of consecutive 0’s. A P-tree can be 1dimensional, 2-dimensional, 3-dimensional, etc. In this brief review, we will focus on 1-dimensional P-trees. A 1-bit in nodes of one-dimensional P-tree is corresponding to segment for which the predicate holds. The predicate can be a particular bit-position of a particular attribute or, more generally, a set of values for each of a set of attributes. Pure 1 P-tree (P1-tree) and Non Pure0 P-tree (NP0-tree) are two predicate P-Trees. In Figure 1 , one attribute coded by 3-bit binary format is stripped vertically into three bSQ file. We sort them according to the most left bSQ file and partition them half-by-half recursively. The P-Trees (P1-tree and NP0tree) are shown on the middle and right. We identify partitioned segment using Pid - the string of successive sub-segment numbers (0 or 1 separated by “.”, as in IP addresses). Thus, the Pid of the bolded and underlined quadrant in Figure 1 is 1.0. Logic AND and OR are the most important and frequently used P-tree logical operations. By using P-tree operations, we filter “big holes” consisting of consecutive 0’s and get the mixed segments. Then we only load the mixed segments into main memory, thus reducing the data accesses. The AND and OR operations using NP0-tree are illustrated in Figure 2 . In Figure 2 , a) and b) are two NP0-trees, c) and d) are their AND result and OR result, respectively. From c), we know that bits at the range of position [0, 4] are pure 0 since the segment with Pid 0 is 0. We only need to load segment with Pid 1. The logical operation results calculated in such way are the exactly same as anding two bit sequence with reduced data accesses. There are several ways to perform P-tree logical operations. The basic way is to perform operations level by level starting from the root level. Details for the P-tree implementation can be found in [7]. 2.2 3 Figure 1 Construction of 1-D Basic P-trees Figure 2 P-tree Operations: AND and OR P-RANK SYSTEM ARCHITECTURE Figure 3 depicts an abstract level view of the architecture of P-RANK. Documents are transformed to a document by term matrix using the weighting schemes and the optional pre-processing steps discussed in [9]. This matrix is then, in turn, converted to its P-tree representation by using the P-tree Capture Interface. The result of this last step is fed into the P-RANK engine. When users submit keywords to the system, P-RANK, first, applies on the keywords any optional pre-processing steps that were performed on the documents. This includes reducing keywords to their original forms or filtering some of them by using stop lists. After that, PRANK does the required matching of the pre-processed keywords by EIN-ring-based ranking and sorting as will be discussed in Section 4. This is accomplished using the P-tree version of the document by term matrix. Finally, a ranked list of document matching his/her keywords is returned to the user. R(Xstart, r, r+), where Xstart = (xstart-1, xstart-2, .., xstart-k+2) is the center for the rings of radii r, r = r1,r2,…rk+2, where each ri is a radius associated with dimension Wi and is a fixed interval. Each ri is referred to as the internal radius of the ring with center Xstart and ri+ is the corresponding external radius of the ring. The calculation of points within EIN-rings R(Xstart, r, r+) is presented as follows. Xstart is the point in X having the largest weight Biological Documents Representation and Weighting Case Folding, Stemming, and Stop Lists Document x Term Matrix P-tree Capture Interface P-tree View of the Matrix User Keywords Case Folding and Stemming P-tree Algebra Filtering Stop Lists P-RANK ENGINE Ranking Ranked List Figure 3 4 Architecture of P-RANK EFFICIENTLY RANKING KEYWORD QUERIES We now present the weighted ranking scoring for keyword queries over biomedical documents using EINrings, which is defined in paper [4]. Consider a keyword search query Q = (k1, k2, …, kn), where k1, k2, …,kn are the input keywords. Let Pmask be the P-tree representing the result list that contains all the documents containing all the input keywords. A simplified prototype vector model as discussed in Section 2 is shown in Table 1. We encode each column in Table 1 by bits and represent each bit position by a P-tree. The calculation of Pmask is as follows: First, we find the corresponding input keywords in the Term Existence section of Table 1; the P-tree representing the result list of the documents containing all the input keywords, Pmask, is just the “ANDing” of all Ptrees for participating keywords, that is Pmask=Pk1Pk2…Pkn, where Pk1, Pk2, …, and Pkn are the P-trees for the participating keywords. Table 1. Doc1 Doc2 … Docn Simplified Prototype of the Data Model Term Existence T1 … Tk 1/0 1/0 1/0 1/0 1/0 1/0 1/0 1/0 1/0 Term Weight W1 WT1 W111 W112 … W11n … … … … … WTk W1k1 W1k2 … W1kn Refer. Weight W2 W11 W12 … W1n Depth Weight W3 W21 W22 … W2n Consider each weight in W1, W2 and W3 sections in Table 1 as a dimension of the search space X. We have a space of (k+2)-dimensions (W1 having k term dimensions, W2 and W3, so, k+2 in total), X = (W11,…, W1k, W2, W3), and we denote the set of EIN-rings by values, i.e., xstart-i = max(Wi) i. Initially, each ri is zero; then, ri is increased in steps each equal to until we reach the max(Wi). Each step yields a bigger EIN-ring encompassing all the rings generated before it. Let Pri, be the P-tree representing all points Y in X existing within the scope of the EIN-ring R(Xstart, ri, ri +). Note that Pri, is just the predicate tree corresponding to the predicate Xstart-i - ri -<Y Xstart-i - ri where Y is the set of points (w1, w2, …, wk+2) in X. We calculate Pri, by ANDing PY<Xstart-i-ri and P’Y<Xstart-i-ri-. For the sake of clarity, we will refer to PY<Xstart-i-ri and P’Y<Xstart-i-ri- as Pmin and P’max respectively. Pmin is shown in the shadow area of a) and P’max in the shadow area of b) in Figure 4 . According to the calculation formula of EIN-ring, Pmin is calculated as Figure 4 Calculation of Data Points within EIN-ring R(Xstart, r, r+). Pmin = PY<Xstart-i-ri = Pw1Xstart-1-r1 Pw2Xstart-2-r2 …. Pwk+2Xstart-k+2-rk+2, (1) where each PwiXstart-i-ri is calculated as explained in paper [4]. Similarly we calculate Pmax as Pmax = PX<Xstart-i-ri-= Pw1Xstart-1-r1- Pw2 Xstart-2-r2 …. Pwk+2 Xstart-k+2-rk+2-, (2) Let P’max be the complement of Pmax. Pri, can be calculated by ANDing Pmin and P’max. Note that the set of points in Pri, that contain all the keywords can be derived by ANDing the Pmask with Pri,. The set of EIN-rings along every dimension gives us an intrinsic ranking for the points along that dimension. Points existing in the same ring have the same rank value while points existing in other rings have different rank values. Rings with smaller external radii get higher rank values along that dimension. For example, points in the ring with a radius of max(Wi) get a rank value of 1, points in the ring with a radius of max(Wi)- get a rank value of 2, points in the ring with a radius of max(Wi)-2 get a rank value of 3 and so on and so forth. After obtaining all the rank values using the EIN-rings for all points along every dimension, we calculate the rank order of the points in space as the weighted summation of all dimensions rank values. The weights for the dimensions can be derived by domain experts or machine learning algorithms. The detailed discussion of weights is beyond the scope of this paper. 5 PERFORMANCE ANALYSIS We implemented P-RANK in C++ and test it on task one data from KDDCup 2002 task2, which provided 15,234 MEDLINE abstracts and a list of genes mentioned in that abstract [[8]]. In order to compare the efficiency of our approach, we prepared the abstracts into three size groups, denoted as DB1, DB2, DB3, which contain 1,000, 7,000, and 15,000 abstracts, respectively. We also implemented another version of P-RANK using inverted list. Details for the P-tree implementation can be found in [7]. The average run time of a single keywork ranking query on different size of test data executed on a 1 GHz Pentium PC machine with Debian Linux 4.0 is listed in Table 2 and plotted in Figure 5 . Average Run Time (s) 1400 1200 1000 800 Inverted-List 600 P-rank 400 200 0 DB1 DB2 DB3 Figure 5 Comparison of SC-ID and P-RANK. Table 2. Results of Score Comparison Inverted List P-rank 1000 7000 15000 60 306 1224 22 97 254 Compared to the inverted list implementation, PRank is not only faster on all three size of data, but also has a lower increasing rate as data size increases, which indicate a better scalability. The reason for the improvement in speed is mainly related to the efficiency resulting from fast logical operation and optimized P-tree formulation. Drastic improvement is shown when the size of the data is very large – see the case of 5000x20000 matrix size in [[1]][[2]]. As for scalability, it is mainly related to the capability of P-tree’s compression [[1][3]. The improvement of accuracy could be achieved by using weights optimization method, such as genetic algorithms. The details of optimizing the weights and improving the accuracy will be explored in the future. 6 CONCLUSION In this paper, we described the architecture, implementation, and evaluation of the P-RANK system built to address the requirement for efficient ranked keyword query over biomedical documents. P-RANK’s contributes by presenting a new and efficient keywordbased query system characterized by speed and compression due to the use of P-trees. P-RANK’s weighted ranking highlights the fact that molecular biologists who review these papers looking mainly for the certain part of a paper to extract experimental evidence. Experimental comparisons demonstrated P-RANK’s overall superiority of keyword ranking query. REFERENCES [1] Qin Ding, Maleq Khan, Amalendu Roy, and William Perrizo, The P-tree Algebra, ACM Symposium on Applied Computing, 2002. [2] Maleq Khan, Qin Ding and William Perrizo, “KNearest Neighbor Classification on Spatial Data Streams Using P-trees”, PAKDD 2002. [3] Perrizo, W., Peano count tree technology lab notes. Technical Report NDSU-CS-TR-01-1, 2001. http://www.cs.ndsu.nodak.edu/~perrizo/classes/785/p ct.html. January 2003. [4] Pan, F., Wang, B., Ren, D., Hu, X. and Perrizo, W., Proximal Support Vector Machine for Spatial Data Using Peano Trees, CAINE 2003. [5] S. Brin, L. Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine, WWW Conf., 1998. [6] J. Kleinberg, “Authoritative Sources in a Hyperlinked Environment”, JACM 46(5), 1999. [7] P-tree World Wide Web Homepage. http://midas11.cs.ndsu.nodak.edu/~ptree/. July 2003. [8] KDDCup2002 Web Page http://www.biostat.wisc.edu /~craven/kddcup/task2training.zip [9] I. Rahal and W. Perrizo, An optimized P-tree based KNN approach for Text Categorization. ACM Symposium on Applied Computing, 2003. [10] J. Han and M. Kamber, “Data Mining: Concepts and Techniques”, Morgan Kaufmann, 2001. [11] Salton, Wong, and Yang, "A vector space model for automatic indexing," Communications of the ACM 18, pp. 613--620, 1975. [12] Salton and McGill, “Introduction to Modern Information Retrieval”. McGraw-Hill, 1983.