International Journal of Engineering Trends and Technology (IJETT) – Volume 15 Number 8 – Sep 2014 XINIX: Extended Inverted Index for Keyword Search Jomon Joseph1, Pankaj Kumar2, Shyni S T3 1 M.Tech Student, 2 3Asst. Professor School of Computer Sciences, M G University, Kerala, India 2 Asst. Professor Computer Science Dept. FISAT, Kerala, India 1,3 Abstract— Keyword search is the easiest way to access the data in the face of information explosion. Inverted lists are used to index documents to access those documents according to a set of keywords efficiently. These inverted lists are large in size. So, many compression techniques have been proposed to reduce the storage space and disk I/O time. This paper presents a convenient index structure, which is an extension of the Generalized Inverted Index (GINIX). Ginix merges the consecutive IDs in inverted lists into interval lists and it reduces the size of the inverted index. The new index structure is called Extended Inverted Index (XINIX) which extends the structure of Ginix. The primary objective of Xinix is to minimize the storage cost. Xinix not only reduces the storage cost, but also increase the search performance, compared with traditional inverted indexes. Keywords— Inverted Index, Keyword search, Index compression, Search algorithms I. INTRODUCTION Keyword search is a method, used to access text datasets. The datasets include web pages, XML documents, and relational tables. End users use keyword search to retrieve documents by typing in keywords as search queries. The keyword search systems use an inverted index that maps each word in the dataset to a list of IDs of documents. These lists are called inverted lists, also known as posting lists. Each posting list contains a word and all the IDs of documents where this word appears in ascending order. The concept of the inverted file type of index is as follows. Assume a set of documents. Each document is assigned a list of keywords or attributes, with optional relevance weights associated with each keyword. An inverted file is then the sorted list of keywords, with each keyword having links to the documents containing that keyword. This is the kind of index found in most commercial library systems. The use of an inverted file improves search efficiency by several orders of magnitude, a necessity for very large text files. But the real world datasets are large in size. So, keyword search systems use many compression techniques to minimize the storage cost. Variable Byte Coding (VBE), PForDelta [2] etc are some compression techniques used in inverted indexes. But it needs decompression during query processing and it leads to extra computational cost. This paper presents a new index structure called Xinix (Extended Inverted Index), which is an extension of the traditional inverted index and the Ginix (Generalized Inverted Index) [1][4]. Extended inverted index converts consecutive IDs in the inverted list of traditional inverted index to interval ISSN: 2231-5381 lists if the difference between the upper bound and the lower bound elements greater than one. In the case of Ginix, it merges consecutive IDs of traditional inverted lists. But, there are many interval lists having the difference of upper bound and lower bound elements is zero or one. In the case of Xinix, it merges consecutive IDs into interval list, only when the difference between upper bound and lower bound elements is greater than one. This extension reduces the storage cost more effectively, compared to Ginix and existing search algorithms can be used with Xinix. II. BASIC CONCEPTS OF XINIX An inverted list of a particular file is an index data structure storing a mapping from content, such as digits, to its place in a database file, or in a collection of documents. The main objective of an inverted index is to provide fast text searches at a cost of increase processing. Table 1.a shows a collection of titles of several papers and table 1.b shows its inverted index. TABLE I (a) A Collection of Several Paper Titles ID Content 1 Introduction 2 Procedure-Oriented Paradigm 3 Procedure-Oriented Development Tools 4 Object-Oriented Paradigm 5 Object-Oriented Notations and Graphs 6 Steps in Object Oriented Analysis 7 Introduction to Prototyping Paradigm (b) Inverted Index Word ID Introduction Procedure Oriented Development Object Paradigm 1,7 2,3 2,3,4,5,6 3 4,5,6 2,4,6 http://www.ijettjournal.org Page 407 International Journal of Engineering Trends and Technology (IJETT) – Volume 15 Number 8 – Sep 2014 Table I(a) shows a dataset of 7 paper titles. The corresponding Inverted is shown in table I(b). The inverted index contains important words in the datasets and their corresponding document IDs. If the dataset is large, then the size of the inverted index is also large. And there are many consecutive IDs. Table I(c) shows the Generalized Inverted Index for the dataset. (c) Generalized Inverted Index Word ID Introduction (1,1), (7,7) Procedure (2,3) Oriented (2,6) Development (3,3) Object (4,6) Paradigm (2,2), (4,4), (6,6) There are many consecutive IDs on the traditional inverted lists. In the case of Ginix, these consecutive IDs are merged into intervals. Each interval denoted by ‘r’, can be represented by two numbers lower bound (lb(r)), and upper bound (ub(r)). Ginix is not convenient when the difference between ub(r) and lb(r) is 0 or 1. That is, if an interval (l, u) is a single element interval, two integers are still needed to represent the interval. If there are many single element intervals, space cost will be large. Table I(d) shows the proposed inverted index structure, which is an extension of Generalized inverted index. (d) Extended Inverted Index Word ID Introduction 1, 7 Procedure 2, 3 Oriented (2, 6) Development 3 III. SEARCH OPERATION There are two types of search operations, Union and Intersection. The union operation is the OR query operation which returns every document that contains at least one of the query keyword. The intersection operation is used to support AND query semantics in which only those documents that contain all the query keywords are returned. In traditional keyword search system first retrieves the compressed inverted list for each keyword then decompresses these lists into ID lists, and then calculates union or intersection of these lists. The search operation is quiet simple in the case of Xinix. Since Xinix contains both interval list and ID list, the interval lists should be converted to ID lists. To convert interval list into id list, a scan line pointer can be used. For each interval list, the upper bound and lower bound elements are stored into two variables. And a pointer is used to scan the interval from lower bound to upper bound. So, the pointer can extract all the intermediate IDs. It returns the ID list and union or intersection algorithms can be applied to the ID list. Algorithm 1 shows the algorithm for union operation. Algorithm 1: Union (ID_List1[], ID_List2[]): Step 1: Use two index variables i and j, initial values i = 0, j = 0 Step 2: If ID_List1[i] is smaller than ID_List2[j] then store ID_List1[i] and increment i. Step 3: If ID_List1[i] is greater than ID_List2[j] then store ID_List2[j] and increment j. Step 4: If both are same then store any of them and increment both i and j. Step 5: Store remaining elements of the larger list. This algorithm follows the merge procedure and can be used to find the union of arrays, lists etc. The same method can be used to calculate the intersections in the ID lists. Algorithm 2 describes the calculation of intersections for two ID lists. Algorithm 2: Intersection (ID_LIST1[], ID_LIST2[]): Step 1: Use two index variables i and j, initial values i = 0, j = 0 Step 2: If ID_LIST11[i] is smaller than ID_LIST2[j] then Paradigm 2, 4, 6 increment i. Step 3: If ID_LIST1[i] is greater than ID_LIST2[j] then In the case of Xinix, If an interval is a single element increment j. interval (l = u), it do not need two integers to represent the Step 4: If both are same then store any of them and increment interval. The boundaries are avoided and the single ID is taken. both i and j. Xinix merges the consecutive IDs into intervals only when the IV. DOCUMENT REORDERING CONCEPT difference is greater than 1. During decoding process, it only needs the processing of intervals. So, the structure reduces the Document reordering can increase the performance of computational cost. Extended inverted index for finding the exact keyword. Silvestri [6] proposed a simple method called Sigsort, that sorts web pages in lexicographical order based on their URL’s. This method can be adapted for Xinix. Another method is, sort Object ISSN: 2231-5381 (4, 6) http://www.ijettjournal.org Page 408 International Journal of Engineering Trends and Technology (IJETT) – Volume 15 Number 8 – Sep 2014 all the words in descending order of their frequencies [5]. This will improve the search performance of Xinix. V. EXPERIMENTS The performance and scalability of Xinix was evaluated by experiments on Linux Server. Pubmed dataset was used in the experiments. Pubmed is large dataset, which contains more than 35000 medical journals. Xinix was implemented using Python with SQL server. Figure 1 shows index sizes using different structures. It shows size of traditional inverted index, Generalized inverted index and extended inverted index. The size of the traditional inverted index for the Pubmed dataset is 70 MB. The Generalized Inverted Index is constructed by merging the consecutive IDs into interval lists. The size is reduced to 52.5 MB. In the case of Extended inverted index, the size is also reduced to 38.3 MB. So, the experiments show that the Extended inverted Index reduces the storage cost, compared to traditional inverted index and Generalized inverted index. REFERENCES [1] [2] [3] [4] [5] [6] Hao Wu, Guoliang Li, and Lizhu Zhou, Generalized Inverted Index for Keyword Search, in IEEE Transactions on Knowledge and Data Mining Vol:8 No:1 year 2013 F. Scholer, H. E. Williams, J. Yiannis, and J. Zobel, Compression of inverted indexes for fast query evaluation, in Proc. of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Tammpere, Finland, 2002, pp. 222-229. M. Zukowski, S. Hman, N. Nes, and P. A. Boncz, Super- scalar RAMCPU cache compression, in Proc. of the 22nd International Conference on Data Engineering, Atlanta, Georgia, USA, 2006, pp. 59. W. Shieh, T. Chen, J. J. Shann, and C. Chung, Inverted file compression through document identifier reassignment, Information Processing and Management, vol. 39, no. 1, pp. 117-131, 2003. R. Blanco and A. Barreiro, TSP and cluster-based solutions to the reassignment of document identifiers, Information Retrieval, vol. 9, no. 4, pp. 499-517, 2006. F. Silvestri, Sorting out the document identifier assignment problem, in Proc. of the 29th European Conference on IR Research, Rome, Italy, 2007, pp. 101-112. Fig. 1 Comparison of Different Indexes VI. CONCLUSIONS This paper presents a new index structure called Extended Inverted index for keyword search in datasets. Xinix has an effective index structure and algorithms to support keyword search. Document reordering can be used to improve the search performance of Xinix. Xinix not only requires smaller storage size than traditional inverted index, but also has a better search speed. ACKNOWLEDGMENT This work was conducted in Centre for High Performance Computing Lab FISAT, Angamaly, Kerala and all facilities provided by M G University, Kerala. We express our special thanks to Dr. R. Vijayakumar, Professor, School of Computer Sciences, M G University. ISSN: 2231-5381 http://www.ijettjournal.org Page 409