Efficient Proximal Support Vector Machine with Boosting

advertisement
Efficient Ranking of Keyword Queries Using P-trees
Fei Pan, Imad Rahal, Yue Cui, William Perrizo
Computer Science Department
North Dakota State University
Fargo, ND 58105, USA
fei.pan@ndsu.nodak.edu
Abstract
In this paper, we describe the architecture,
implementation, and evaluation of the P-RANK system
built to address the requirement for extracting evidences
of specific products of genes from biomedical papers. PRANK is a system that accepts user interests in the form
of keywords, which integrate different depth weights into
the ranking score and highlights that molecular biologists
who review these papers looking mainly for the certain
part of a paper to extract experimental evidences, in turn,
returns a ranked list pertaining to the users’ interests. Our
contributions include presenting a new efficient keyword
query system using a data structure called the P-tree, and
a fast weighted ranking method using the EIN-ring.
keyword query system using a data structure called the Ptree and a fast weighted ranking method using the EINring, which aid in the process of looking for evidences of
specific products of genes from biomedical papers,
especially from abstracts, experiments, and figure
legends. We experimentally compare with other rankedkeyword searching approaches and results have shown
that P-RANK is a very fast ranked keyword query system.
This paper is organized as follows. In section 2, we
describe the data model and representation used in PRANK. In section 3, we present an overview of PRANK’s system architecture. In section 4, we present the
voting-based ranking method using the EIN-ring. We
experimentally compare our approach to the traditional
inverted list approach in section 5. Finally, we conclude
this paper in section 6.
Keywords: Keyword Query, Information Retrieval, Ptree.
2
1
INTRODUCTION
Biomedical information exists in both the research
literature and various structured databases. With a huge
number of papers and books published yearly, access to
biomedical information has become in great demand
especially by molecular biologists. Molecular biologists
are interested in retrieving a concise ranked set of
references of interest without having to manually go
through all the documents in a given collection.
Traditional methods for ranking queries results, such
as Google [5] [6], are based on the use of ranking
formulas that calculate the frequencies of a document
accessed. However, molecular biologists are mostly
interested in extracting evidences of specific products of
genes, e.g. mRNA and proteins, which cannot be achieved
by general ranking approaches. An automatic analysis of
scientific biomedical papers to extract only the papers that
include experimental-related results is in need.
In this paper, we describe the architecture,
implementation, and evaluation of the P-RANK system
built to address the requirement for extracting evidences
of specific products of genes regarding expression of gene
products from biomedical papers. P-RANK is a system
that accepts user interests in the form of keywords, which
integrate different dimension weights into the ranking
score and evaluate documents accordingly. It returns a
ranked list of reference pertaining to the users’ interests.
Our contributions include presenting a new efficient
2.1
DATA MODEL AND
REPRESENTATION
Data Representation
Most data mining tools and algorithms assume that
the data being mined have some sort of structuring–e.g.
relational tables in databases or data cubes in data
warehouses ([10], chapter 2). Given that text references
have only semi-structure, the task of text mining appears
to be impossible. However, some models have been
developed to impose some sort of structure on text,
namely the term space model [11][12]. In this model, a
document is presented as a vector in the term space –
terms are used as features or dimensions. The data
structure resulting from representing all the documents in
a given collection as term vectors is referred to as the
document-by-term matrix. In this representation, vector
coordinates are terms with numeric values representing
their relevance to the corresponding documents where
higher values mean higher relevance. This process of
giving numeric values to term coordinates is referred to as
term weighting in the literature. We can view weighting
as the process of giving more emphasis to more important
terms.
Term weight measures in document vectors can be
binary with the values 1 and 0 denoting the existence or
absence of the term in the corresponding document,
respectively. However, due to the great loss of
information associated with it, this representation lacks
the accuracy needed. We employ a more sophisticated
representation records, TF/IDF approach, which is the
exact frequency and uniqueness measure of each term in a
document [9]. Depth weight is another weight employed
by P-RANK for XML format documents, which represent
the location of terms. The deeper a term’s position is in a
document, the more specific its contents are and thus they
should be weighted higher. P-RANK uses a simple
incremental weight value to accomplish this purpose. The
root is weighted 1, first level is weighted 2, second level
is weighted 3, and so on until we reach the lowest level.
For flat text documents the depth weights are the same.
Tree Structure
As noted earlier, P-RANK is built on a novel and
genuine data structure, the P-tree (Predicate tree). P-trees
were developed to facilitate compression and very fast
logical operations of bit sequential (bSQ) data [3]. We sort
the highest order bSQ of attributes and build special Ptree to efficiently reduce data accesses by filtering out “bit
holes” consisting of consecutive 0’s. A P-tree can be 1dimensional, 2-dimensional, 3-dimensional, etc. In this
brief review, we will focus on 1-dimensional P-trees.
A 1-bit in nodes of one-dimensional P-tree is
corresponding to segment for which the predicate holds.
The predicate can be a particular bit-position of a
particular attribute or, more generally, a set of values for
each of a set of attributes. Pure 1 P-tree (P1-tree) and Non
Pure0 P-tree (NP0-tree) are two predicate P-Trees. In
Figure 1 , one attribute coded by 3-bit binary format is
stripped vertically into three bSQ file. We sort them
according to the most left bSQ file and partition them
half-by-half recursively. The P-Trees (P1-tree and NP0tree) are shown on the middle and right. We identify
partitioned segment using Pid - the string of successive
sub-segment numbers (0 or 1 separated by “.”, as in IP
addresses). Thus, the Pid of the bolded and underlined
quadrant in Figure 1 is 1.0.
Logic AND and OR are the most important and
frequently used P-tree logical operations. By using P-tree
operations, we filter “big holes” consisting of consecutive
0’s and get the mixed segments. Then we only load the
mixed segments into main memory, thus reducing the data
accesses. The AND and OR operations using NP0-tree are
illustrated in Figure 2 .
In Figure 2 , a) and b) are two NP0-trees, c) and d)
are their AND result and OR result, respectively. From c),
we know that bits at the range of position [0, 4] are pure 0
since the segment with Pid 0 is 0. We only need to load
segment with Pid 1. The logical operation results
calculated in such way are the exactly same as anding two
bit sequence with reduced data accesses. There are several
ways to perform P-tree logical operations. The basic way
is to perform operations level by level starting from the
root level. Details for the P-tree implementation can be
found in [7].
2.2
3
Figure 1
Construction of 1-D Basic P-trees
Figure 2
P-tree Operations: AND and OR
P-RANK SYSTEM ARCHITECTURE
Figure 3 depicts an abstract level view of the
architecture of P-RANK. Documents are transformed to a
document by term matrix using the weighting schemes
and the optional pre-processing steps discussed in [9].
This matrix is then, in turn, converted to its P-tree
representation by using the P-tree Capture Interface. The
result of this last step is fed into the P-RANK engine.
When users submit keywords to the system, P-RANK,
first, applies on the keywords any optional pre-processing
steps that were performed on the documents. This
includes reducing keywords to their original forms or
filtering some of them by using stop lists. After that, PRANK does the required matching of the pre-processed
keywords by EIN-ring-based ranking and sorting as will
be discussed in Section 4. This is accomplished using the
P-tree version of the document by term matrix. Finally, a
ranked list of document matching his/her keywords is
returned to the user.
R(Xstart, r, r+), where Xstart = (xstart-1, xstart-2, ..,
xstart-k+2) is the center for the rings of radii r, r =
r1,r2,…rk+2, where each ri is a radius associated with
dimension Wi and  is a fixed interval. Each ri is referred
to as the internal radius of the ring with center Xstart and
ri+ is the corresponding external radius of the ring. The
calculation of points within EIN-rings R(Xstart, r, r+) is
presented as follows.
Xstart is the point in X having the largest weight
Biological Documents
Representation
and Weighting
Case Folding, Stemming,
and Stop Lists
Document x Term Matrix
P-tree Capture
Interface
P-tree View of the Matrix
User Keywords
Case
Folding
and
Stemming
P-tree
Algebra
Filtering
Stop Lists
P-RANK
ENGINE
Ranking
Ranked List
Figure 3
4
Architecture of P-RANK
EFFICIENTLY RANKING KEYWORD
QUERIES
We now present the weighted ranking scoring for
keyword queries over biomedical documents using EINrings, which is defined in paper [4]. Consider a keyword
search query Q = (k1, k2, …, kn), where k1, k2, …,kn are
the input keywords. Let Pmask be the P-tree representing
the result list that contains all the documents containing
all the input keywords. A simplified prototype vector
model as discussed in Section 2 is shown in Table 1. We
encode each column in Table 1 by bits and represent each
bit position by a P-tree. The calculation of Pmask is as
follows: First, we find the corresponding input keywords
in the Term Existence section of Table 1; the P-tree
representing the result list of the documents containing all
the input keywords, Pmask, is just the “ANDing” of all Ptrees
for
participating
keywords,
that
is
Pmask=Pk1Pk2…Pkn, where Pk1, Pk2, …, and Pkn
are the P-trees for the participating keywords.
Table 1.
Doc1
Doc2
…
Docn
Simplified Prototype of the Data Model
Term
Existence
T1 … Tk
1/0 1/0 1/0
1/0 1/0 1/0
1/0
1/0
1/0
Term Weight W1
WT1
W111
W112
…
W11n
…
…
…
…
…
WTk
W1k1
W1k2
…
W1kn
Refer.
Weight
W2
W11
W12
…
W1n
Depth
Weight
W3
W21
W22
…
W2n
Consider each weight in W1, W2 and W3 sections
in Table 1 as a dimension of the search space X. We have
a space of (k+2)-dimensions (W1 having k term
dimensions, W2 and W3, so, k+2 in total), X = (W11,…,
W1k, W2, W3), and we denote the set of EIN-rings by
values, i.e., xstart-i = max(Wi)  i. Initially, each ri is
zero; then, ri is increased in steps each equal to  until we
reach the max(Wi). Each step yields a bigger EIN-ring
encompassing all the rings generated before it. Let Pri,
be the P-tree representing all points Y in X existing within
the scope of the EIN-ring R(Xstart, ri, ri +). Note that
Pri, is just the predicate tree corresponding to the
predicate Xstart-i - ri -<Y Xstart-i - ri where Y is the
set of points (w1, w2, …, wk+2) in X. We calculate Pri,
by ANDing PY<Xstart-i-ri and P’Y<Xstart-i-ri-. For the
sake of clarity, we will refer to PY<Xstart-i-ri and
P’Y<Xstart-i-ri- as Pmin and P’max respectively. Pmin
is shown in the shadow area of a) and P’max in the
shadow area of b) in Figure 4 . According to the
calculation formula of EIN-ring, Pmin is calculated as
Figure 4
Calculation of Data Points within EIN-ring
R(Xstart, r, r+).
Pmin = PY<Xstart-i-ri = Pw1Xstart-1-r1 Pw2Xstart-2-r2 ….
 Pwk+2Xstart-k+2-rk+2,
(1)
where each PwiXstart-i-ri is calculated as explained in
paper [4]. Similarly we calculate Pmax as
Pmax = PX<Xstart-i-ri-= Pw1Xstart-1-r1- Pw2 Xstart-2-r2 ….  Pwk+2 Xstart-k+2-rk+2-,
(2)
Let P’max be the complement of Pmax. Pri, can
be calculated by ANDing Pmin and P’max. Note that the
set of points in Pri, that contain all the keywords can be
derived by ANDing the Pmask with Pri,.
The set of EIN-rings along every dimension gives
us an intrinsic ranking for the points along that dimension.
Points existing in the same ring have the same rank value
while points existing in other rings have different rank
values. Rings with smaller external radii get higher rank
values along that dimension. For example, points in the
ring with a radius of max(Wi) get a rank value of 1, points
in the ring with a radius of max(Wi)- get a rank value of
2, points in the ring with a radius of max(Wi)-2 get a
rank value of 3 and so on and so forth. After obtaining all
the rank values using the EIN-rings for all points along
every dimension, we calculate the rank order of the points
in space as the weighted summation of all dimensions
rank values. The weights for the dimensions can be
derived by domain experts or machine learning
algorithms. The detailed discussion of weights is beyond
the scope of this paper.
5
PERFORMANCE ANALYSIS
We implemented P-RANK in C++ and test it on
task one data from KDDCup 2002 task2, which provided
15,234 MEDLINE abstracts and a list of genes mentioned
in that abstract [[8]]. In order to compare the efficiency of
our approach, we prepared the abstracts into three size
groups, denoted as DB1, DB2, DB3, which contain 1,000,
7,000, and 15,000 abstracts, respectively. We also
implemented another version of P-RANK using inverted
list. Details for the P-tree implementation can be found in
[7]. The average run time of a single keywork ranking
query on different size of test data executed on a 1 GHz
Pentium PC machine with Debian Linux 4.0 is listed in
Table 2 and plotted in Figure 5 .
Average Run Time (s)
1400
1200
1000
800
Inverted-List
600
P-rank
400
200
0
DB1
DB2
DB3
Figure 5
Comparison of SC-ID and P-RANK.
Table 2.
Results of Score Comparison
Inverted List
P-rank
1000
7000
15000
60
306
1224
22
97
254
Compared to the inverted list implementation, PRank is not only faster on all three size of data, but also
has a lower increasing rate as data size increases, which
indicate a better scalability. The reason for the
improvement in speed is mainly related to the efficiency
resulting from fast logical operation and optimized P-tree
formulation. Drastic improvement is shown when the size
of the data is very large – see the case of 5000x20000
matrix size in [[1]][[2]]. As for scalability, it is mainly
related to the capability of P-tree’s compression [[1][3].
The improvement of accuracy could be achieved by using
weights optimization method, such as genetic algorithms.
The details of optimizing the weights and improving the
accuracy will be explored in the future.
6
CONCLUSION
In this paper, we described the architecture,
implementation, and evaluation of the P-RANK system
built to address the requirement for efficient ranked
keyword query over biomedical documents. P-RANK’s
contributes by presenting a new and efficient keywordbased query system characterized by speed and
compression due to the use of P-trees. P-RANK’s
weighted ranking highlights the fact that molecular
biologists who review these papers looking mainly for the
certain part of a paper to extract experimental evidence.
Experimental comparisons demonstrated P-RANK’s
overall superiority of keyword ranking query.
REFERENCES
[1] Qin Ding, Maleq Khan, Amalendu Roy, and William
Perrizo, The P-tree Algebra, ACM Symposium on
Applied Computing, 2002.
[2] Maleq Khan, Qin Ding and William Perrizo,
“KNearest Neighbor Classification on Spatial Data
Streams Using P-trees”, PAKDD 2002.
[3] Perrizo, W., Peano count tree technology lab notes.
Technical Report NDSU-CS-TR-01-1, 2001.
http://www.cs.ndsu.nodak.edu/~perrizo/classes/785/p
ct.html. January 2003.
[4] Pan, F., Wang, B., Ren, D., Hu, X. and Perrizo, W.,
Proximal Support Vector Machine for Spatial Data
Using Peano Trees, CAINE 2003.
[5] S. Brin, L. Page, The Anatomy of a Large-Scale
Hypertextual Web Search Engine, WWW Conf.,
1998.
[6] J. Kleinberg, “Authoritative Sources in a Hyperlinked
Environment”, JACM 46(5), 1999.
[7] P-tree World Wide Web Homepage. http://midas11.cs.ndsu.nodak.edu/~ptree/. July 2003.
[8] KDDCup2002
Web
Page
http://www.biostat.wisc.edu /~craven/kddcup/task2training.zip
[9] I. Rahal and W. Perrizo, An optimized P-tree based
KNN approach for Text Categorization. ACM
Symposium on Applied Computing, 2003.
[10] J. Han and M. Kamber, “Data Mining: Concepts and
Techniques”, Morgan Kaufmann, 2001.
[11] Salton, Wong, and Yang, "A vector space model for
automatic indexing," Communications of the ACM
18, pp. 613--620, 1975.
[12] Salton and McGill, “Introduction to Modern
Information Retrieval”. McGraw-Hill, 1983.
Download