XINIX: Extended Inverted Index for Keyword Search Jomon Joseph , Pankaj Kumar

advertisement
International Journal of Engineering Trends and Technology (IJETT) – Volume 15 Number 8 – Sep 2014
XINIX: Extended Inverted Index for Keyword
Search
Jomon Joseph1, Pankaj Kumar2, Shyni S T3
1
M.Tech Student, 2 3Asst. Professor
School of Computer Sciences, M G University, Kerala, India
2
Asst. Professor Computer Science Dept. FISAT, Kerala, India
1,3
Abstract— Keyword search is the easiest way to access the data
in the face of information explosion. Inverted lists are used to
index documents to access those documents according to a set of
keywords efficiently. These inverted lists are large in size. So,
many compression techniques have been proposed to reduce the
storage space and disk I/O time. This paper presents a
convenient index structure, which is an extension of the
Generalized Inverted Index (GINIX). Ginix merges the
consecutive IDs in inverted lists into interval lists and it reduces
the size of the inverted index. The new index structure is called
Extended Inverted Index (XINIX) which extends the structure of
Ginix. The primary objective of Xinix is to minimize the storage
cost. Xinix not only reduces the storage cost, but also increase the
search performance, compared with traditional inverted indexes.
Keywords— Inverted Index, Keyword search, Index compression,
Search algorithms
I. INTRODUCTION
Keyword search is a method, used to access text datasets.
The datasets include web pages, XML documents, and
relational tables. End users use keyword search to retrieve
documents by typing in keywords as search queries. The
keyword search systems use an inverted index that maps each
word in the dataset to a list of IDs of documents. These lists
are called inverted lists, also known as posting lists. Each
posting list contains a word and all the IDs of documents
where this word appears in ascending order.
The concept of the inverted file type of index is as follows.
Assume a set of documents. Each document is assigned a list
of keywords or attributes, with optional relevance weights
associated with each keyword. An inverted file is then the
sorted list of keywords, with each keyword having links to the
documents containing that keyword. This is the kind of index
found in most commercial library systems. The use of an
inverted file improves search efficiency by several orders of
magnitude, a necessity for very large text files.
But the real world datasets are large in size. So, keyword
search systems use many compression techniques to minimize
the storage cost. Variable Byte Coding (VBE), PForDelta [2]
etc are some compression techniques used in inverted indexes.
But it needs decompression during query processing and it
leads to extra computational cost.
This paper presents a new index structure called Xinix
(Extended Inverted Index), which is an extension of the
traditional inverted index and the Ginix (Generalized Inverted
Index) [1][4]. Extended inverted index converts consecutive
IDs in the inverted list of traditional inverted index to interval
ISSN: 2231-5381
lists if the difference between the upper bound and the lower
bound elements greater than one. In the case of Ginix, it
merges consecutive IDs of traditional inverted lists. But, there
are many interval lists having the difference of upper bound
and lower bound elements is zero or one. In the case of Xinix,
it merges consecutive IDs into interval list, only when the
difference between upper bound and lower bound elements is
greater than one. This extension reduces the storage cost more
effectively, compared to Ginix and existing search algorithms
can be used with Xinix.
II. BASIC CONCEPTS OF XINIX
An inverted list of a particular file is an index data structure
storing a mapping from content, such as digits, to its place in a
database file, or in a collection of documents. The main
objective of an inverted index is to provide fast text searches
at a cost of increase processing. Table 1.a shows a collection
of titles of several papers and table 1.b shows its inverted
index.
TABLE I
(a) A Collection of Several Paper Titles
ID
Content
1
Introduction
2
Procedure-Oriented Paradigm
3
Procedure-Oriented Development Tools
4
Object-Oriented Paradigm
5
Object-Oriented Notations and Graphs
6
Steps in Object Oriented Analysis
7
Introduction to Prototyping Paradigm
(b) Inverted Index
Word
ID
Introduction
Procedure
Oriented
Development
Object
Paradigm
1,7
2,3
2,3,4,5,6
3
4,5,6
2,4,6
http://www.ijettjournal.org
Page 407
International Journal of Engineering Trends and Technology (IJETT) – Volume 15 Number 8 – Sep 2014
Table I(a) shows a dataset of 7 paper titles. The
corresponding Inverted is shown in table I(b). The inverted
index contains important words in the datasets and their
corresponding document IDs. If the dataset is large, then the
size of the inverted index is also large. And there are many
consecutive IDs. Table I(c) shows the Generalized Inverted
Index for the dataset.
(c) Generalized Inverted Index
Word
ID
Introduction
(1,1), (7,7)
Procedure
(2,3)
Oriented
(2,6)
Development
(3,3)
Object
(4,6)
Paradigm
(2,2), (4,4), (6,6)
There are many consecutive IDs on the traditional inverted
lists. In the case of Ginix, these consecutive IDs are merged
into intervals. Each interval denoted by ‘r’, can be represented
by two numbers lower bound (lb(r)), and upper bound (ub(r)).
Ginix is not convenient when the difference between ub(r) and
lb(r) is 0 or 1. That is, if an interval (l, u) is a single element
interval, two integers are still needed to represent the interval.
If there are many single element intervals, space cost will be
large. Table I(d) shows the proposed inverted index structure,
which is an extension of Generalized inverted index.
(d) Extended Inverted Index
Word
ID
Introduction
1, 7
Procedure
2, 3
Oriented
(2, 6)
Development
3
III. SEARCH OPERATION
There are two types of search operations, Union and
Intersection. The union operation is the OR query operation
which returns every document that contains at least one of the
query keyword. The intersection operation is used to support
AND query semantics in which only those documents that
contain all the query keywords are returned.
In traditional keyword search system first retrieves the
compressed inverted list for each keyword then decompresses
these lists into ID lists, and then calculates union or
intersection of these lists.
The search operation is quiet simple in the case of Xinix.
Since Xinix contains both interval list and ID list, the interval
lists should be converted to ID lists. To convert interval list
into id list, a scan line pointer can be used. For each interval
list, the upper bound and lower bound elements are stored into
two variables. And a pointer is used to scan the interval from
lower bound to upper bound. So, the pointer can extract all the
intermediate IDs. It returns the ID list and union or
intersection algorithms can be applied to the ID list.
Algorithm 1 shows the algorithm for union operation.
Algorithm 1: Union (ID_List1[], ID_List2[]):
Step 1: Use two index variables i and j, initial values i = 0, j =
0
Step 2: If ID_List1[i] is smaller than ID_List2[j] then store
ID_List1[i] and increment i.
Step 3: If ID_List1[i] is greater than ID_List2[j] then store
ID_List2[j] and increment j.
Step 4: If both are same then store any of them and increment
both i and j.
Step 5: Store remaining elements of the larger list.
This algorithm follows the merge procedure and can be
used to find the union of arrays, lists etc. The same method
can be used to calculate the intersections in the ID lists.
Algorithm 2 describes the calculation of intersections for two
ID lists.
Algorithm 2: Intersection (ID_LIST1[], ID_LIST2[]):
Step 1: Use two index variables i and j, initial values i = 0, j =
0
Step
2:
If ID_LIST11[i] is smaller than ID_LIST2[j] then
Paradigm
2, 4, 6
increment i.
Step 3: If ID_LIST1[i] is greater than ID_LIST2[j] then
In the case of Xinix, If an interval is a single element
increment j.
interval (l = u), it do not need two integers to represent the Step 4: If both are same then store any of them and increment
interval. The boundaries are avoided and the single ID is taken.
both i and j.
Xinix merges the consecutive IDs into intervals only when the
IV. DOCUMENT REORDERING CONCEPT
difference is greater than 1. During decoding process, it only
needs the processing of intervals. So, the structure reduces the
Document reordering can increase the performance of
computational cost.
Extended inverted index for finding the exact keyword.
Silvestri [6] proposed a simple method called Sigsort, that
sorts web pages in lexicographical order based on their URL’s.
This method can be adapted for Xinix. Another method is, sort
Object
ISSN: 2231-5381
(4, 6)
http://www.ijettjournal.org
Page 408
International Journal of Engineering Trends and Technology (IJETT) – Volume 15 Number 8 – Sep 2014
all the words in descending order of their frequencies [5]. This
will improve the search performance of Xinix.
V. EXPERIMENTS
The performance and scalability of Xinix was evaluated by
experiments on Linux Server. Pubmed dataset was used in the
experiments. Pubmed is large dataset, which contains more
than 35000 medical journals. Xinix was implemented using
Python with SQL server.
Figure 1 shows index sizes using different structures. It
shows size of traditional inverted index, Generalized inverted
index and extended inverted index. The size of the traditional
inverted index for the Pubmed dataset is 70 MB. The
Generalized Inverted Index is constructed by merging the
consecutive IDs into interval lists. The size is reduced to 52.5
MB. In the case of Extended inverted index, the size is also
reduced to 38.3 MB. So, the experiments show that the
Extended inverted Index reduces the storage cost, compared to
traditional inverted index and Generalized inverted index.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
Hao Wu, Guoliang Li, and Lizhu Zhou, Generalized Inverted Index for
Keyword Search, in IEEE Transactions on Knowledge and Data
Mining Vol:8 No:1 year 2013
F. Scholer, H. E. Williams, J. Yiannis, and J. Zobel, Compression of
inverted indexes for fast query evaluation, in Proc. of the 25th Annual
International ACM SIGIR Conference on Research and Development
in Information Retrieval, Tammpere, Finland, 2002, pp. 222-229.
M. Zukowski, S. Hman, N. Nes, and P. A. Boncz, Super- scalar RAMCPU cache compression, in Proc. of the 22nd International Conference
on Data Engineering, Atlanta, Georgia, USA, 2006, pp. 59.
W. Shieh, T. Chen, J. J. Shann, and C. Chung, Inverted file
compression through document identifier reassignment, Information
Processing and Management, vol. 39, no. 1, pp. 117-131, 2003.
R. Blanco and A. Barreiro, TSP and cluster-based solutions to the
reassignment of document identifiers, Information Retrieval, vol. 9, no.
4, pp. 499-517, 2006.
F. Silvestri, Sorting out the document identifier assignment problem, in
Proc. of the 29th European Conference on IR Research, Rome, Italy,
2007, pp. 101-112.
Fig. 1 Comparison of Different Indexes
VI. CONCLUSIONS
This paper presents a new index structure called Extended
Inverted index for keyword search in datasets. Xinix has an
effective index structure and algorithms to support keyword
search. Document reordering can be used to improve the
search performance of Xinix. Xinix not only requires smaller
storage size than traditional inverted index, but also has a
better search speed.
ACKNOWLEDGMENT
This work was conducted in Centre for High Performance
Computing Lab FISAT, Angamaly, Kerala and all facilities
provided by M G University, Kerala. We express our special
thanks to Dr. R. Vijayakumar, Professor, School of Computer
Sciences, M G University.
ISSN: 2231-5381
http://www.ijettjournal.org
Page 409
Download