International Journal of Latest Research in Science and Technology Vol.1,Issue 2 :Page No.141-144 ,July-August(2012) http://www.mnkjournals.com/ijlrst.htm ISSN (Online):2278-5299 IMPLEMENTATION OF MULTILEVEL INDEXING IN SEARCH ENGINES Sneh kalra Assistant Professor @Modern Vidya Niketan University,Palwal. snehchhabra@gmail.com Abstract—This paper proposes creation of clusters at multilevel.i.e. creating the indexes at the multiple levels so as to reduce the search time by directing the search to a specific path from higher level indexes to the lower level indexes. The multilevel indexes are created after the hierarchical clusters of the documents have been formed on the basis of document similarity. At each level of documents clusters hierarchy, a separate index is created and the search for a document proceeds from the higher level indexes to the lowest level index which is the document level index and is stored as inverted file. Thus the paper implements the proposed hierarchical algorithm at different levels which optimizes the search process by directing the search to a specific path from higher levels of clustering to the lower levels i.e. from super clusters to mega clusters, then to clusters and finally to the individual documents so that the user gets the best possible matching results in minimum possible time. The paper further presents the graphs showing clusters created at different levels. Keywords— Multilevel indexes, hierarchical clustering, document similarity, inverted files, posting list. I. INTRODUCTION Search Engines are used to find the data in response to the user queries. The indexing phase of these search engines can be viewed as a web content mining process. Starting from a collection of unstructured documents, the indexer extracts a large amount of information like the list of documents, which contain a given term. It also keeps account of number of all the occurrences of each term within every document. This information is maintained in a global index, which is usually represented using an inverted file (IF). The index consists of an array of the posting lists where each posting list is associated with a term and contains the term as well as the identifiers of the documents containing the term. For example, consider the posting list ((job; 5) 1, 4, 14, 20, 27) indicating that the term job appears in five documents having the document identifiers 1,4,14,20,27 respectively. The above posting list can be written as ((job; 5) 1, 3, 10, 6, 7) where the items of the list represent the difference between the successive document identifiers. The following figure shows the structure of index Term No of documents in which term appears Doc id of documents in which terms appears Doc ids stored with different coding scheme Fig. 1 Structure of an Index in search engines ISSN 2278-5299 Clustering is a widely adopted technique aimed at dividing a collection of data into disjoint groups of homogenous elements. A cluster is therefore a collection of objects, which are similar between them and are dissimilar to the objects belonging to other clusters. Clustering algorithms attempt to group together the documents based on their similarities. Thus documents relating to a certain topic will hopefully be placed in a single cluster. So if the documents are clustered, comparisons of the documents against the user’s query are only needed with certain clusters and not with the whole collection of documents. II. RELATED WORK In this paper, a review of previous work on index organization is given. In this field of index organization and maintenance, many algorithms and techniques have already been proposed but they seem to be less efficient in efficiently accessing the index. In [2] the authors introduce a novel pipelining technique for structuring the index–building system that substantially reduces the index constructing time. They demonstrate that for large collections, the pipelining technique can speed up index construction by several hours. They propose and compare different schemes for storing and managing inverted files using an embedded database system. They show that an intelligent scheme for packing inverted lists in the storage structures of the database can provide performance and storage efficiency comparable to tailored inverted file implementations Another proposed work was the threshold based clustering algorithm [8] in which the number of clusters is unknown. However, two documents are classified to the same cluster if 141 Kalra,,International Journal of Latest Research in Science and Technology the similarity between them is below a specified threshold. This threshold is defined by the user before the algorithm starts. It is easy to see that if the threshold is small; all the elements will get assigned to different clusters. If the threshold is large, the elements may get assigned to just one cluster. Thus the algorithm is sensitive to specification of threshold. Another proposed work was the K means clustering algorithm, which initially chooses k documents as cluster representatives and then assigns the remaining n-k documents to one of these clusters on the basis of similarity between the documents. New centroids for the k clusters are recomputed and documents are reassigned according to their similarity with the k new centroids. This process repeats until the position of the centroids become stable. Computing new centroids is expensive for large values of n and the number of iterations required to converge may be large. The given paper implements the multilevel index structure in which the clustering of data is done at multiple levels so as to direct the search to a specific path from high level index to low level index. III. However the indexes at the higher level consist of only the document terms with no extra information stored in them. B. How to Create Clusters Let D={D1, D2,……Dn) be the collection of N textual documents being crawled to which consecutive integers document identifiers 1…n are assigned. N1 is the number of clusters which would be created. This algorithm will make less or equal to the number of cluster i.e. less or equal to N1.first it will find out the similarity between the documents and create the similarity matrices. After that it will take 2 clusters and make one cluster on the basis of similarity. Similarly all clusters are merged to make cluster. Now it will cheek for the termination condition. If it s not a termination condition then it will calculate again similarity matrices. Again take 2 clusters and make one cluster until it reach termination condition. B. Example Illustrating the Formation of Hierarchy of The WebDocuments Super cluster level PROPOSED WORK The indexing phase of the search engine collects information from the web documents gathered in the web repository. This global large sized index is stored in the form of inverted files. The proposed work explains the concept of multilevel indexes which are created at multiple levels of hierarchy of documents clustering. At first level of document hierarchy, the similar documents are clustered into clusters on the basis of documents similarity. Then at the next level, the similar clusters are clustered into mega clusters and at the last level of hierarchy, the similar mega clusters are clubbed together to form super clusters. Mega cluster level No. of docs in which Doc id of docs in term appears which term appears Indexing 50 12,34,45,49… Search engine 59 15,20,34,55… Pipeline 15 3,6,9,12… Fig. 3 Structure of which Higher Level Indexes This is a descriptive index is stored in the form of inverted files. The index consists of an array of posting lists as in case of global large size index. The index at the mega cluster level consists of the terms that come after the intersection of the terms in the clusters that appear in that particular mega cluster. This index is of small size and contains only the similar terms of the clusters within the same mega cluster. The index does not contain any additional information. The highest level index is formed last at the super cluster level which consists of the terms that are obtained after the intersection of the terms in the similar mega cluster within that super cluster and the search starts with this index and proceeds downwards to the lowest level index. IV. IMPLEMENTATION WORK Crawling in search engines Indexing in search engines Cluster level Parallel crawling in search engines Distributed crawling in search engines Global indexing in search engines Pipelined indexing in search engines Documents A. Structure of Multilevel Indexes The structure of the index at the lowest level is in the form of posting lists as shown in figure: Term Search engine Fig. 4. Example Illustrating Hierarchy of Clusters The Fig 4 shows the hierarchy of the documents created. The index formation star means algorithm is applied for creating the clusters. Clusters will be created at first level. For creating clusters at second level same procedure is applied again and then finally hierarchical clustering is done for indexing. Finding Document similarity Parsing Documents Parsed data Matrix created After applying algorithm for hierarchical clustering Clusters created Fig. 5 Work Flow of Implementation For indexing the documents, firstly we have to parse the documents. After that similarity matrix is created and then k ISSN 2278-5299 142 Kalra,,International Journal of Latest Research in Science and Technology V. A. Snapshots of Implemented Work 1) Given input & parsed data The following snapshot represents the parsed data. Which is the initial step for indexing the data? RESULTS A. Graph Representing the Created Clusters at First Level At first level, less number of clusters are created. As at first level the similarity between two documents is less. 10 h, 9 c, 9 a 9 b n u m b e r o f c l u s te r s 8 g, 7 7 g, 7 c, 6 6 a,c,7 7 c b, 6 d f, 5 a, 5 e 5 f 4 g 3 h 2 i 1 j 0 a b c d e f g h i j 10 documents Fig. 8 Clusters Created at First Level B. Graph Representing the Created Clusters at Second Level At second level, more number of clusters are created in comparison to first level. As the similarity between two documents is more 120 g, 98 Fig. 6 Given & Parsed Data number of clusters 100 2) Clustered data The data is now clustered according to the similarity of words. The following fig shows the clusters created that are created after matching the similarity of documents with each other. 80 60 f, 89 d, 76 a, 76 g, 76 f, 69 e,f, 69 67 c, 65 e, 63 h,i, 62 61 d, 54 a,c,5255 a, 54 e, 54 h, 43 40 c, 87 h, 87 i, 43 d, 34 g, 34 e, 32 i, 98 a e, 87 b i, 76 a, 76 c, 76 f, 65 g, 65 c d e 47 h,i, 45 c, 33 f g h, 31 h g, 21 i 20 j 0 a b c d e f 10 documents g h i j Fig. 9 Clusters Created at Second Level VI. CONCLUSION The given paper implements a multilevel indexing structure for a search engine index so as to speed up the search process by directing the search to a specific path from the higher level index to the lower level indexes. The implemented architecture and the structure of the indexes are the answer to one of the biggest problems for search optimization in search engines. The implementation work aims at reducing the search space and search time by maintaining the index at various levels. REFRENCES 1 Fig 7 Clustered Data ISSN 2278-5299 Sergey Melnik, Sriram Raghavan, Beverly Yang, and Hector Garci Molina. Building a Distributed Full–Text Index for the Web. In World Wide Web, pages 396–406,2001. 143 Kalra,,International Journal of Latest Research in Science and Technology 2 Fabrizio Silvestri, Raffaele Perego and Salvatore Orlando. Assigning Document Identifiers to Enhance Compressibility of Web Search Engines Indexes. In the proceedings of SAC, 2004. 3 S.Brin and L.Page.The Anatomy of a Large-Scale Hypertextual Web Search Engine. In the 7th International WWW Conference, (WWW7), pp. 107-117, Brisbane, Australia. Van Rijsbergen C.J. Information Retrieval. Butterworth 1979. [5] Oren Zamir and Oren Etzioni. Web Document Clustering: A feasibility demonstration. In the proceedings of SIGIR, 1998. Soumen Chakrabarti. Mining the Web – Discovering Knowledge from Hypertext Data. Morgan Kaufmann Publications, 2003. 4 5 6 A. Jain and R. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988. 7 P. Elias. Universal codeword sets and representation of the integers. IEEE Trans. on Information Theory, 21(2):194–203, Feb 1975. 8 Sanjiv K. Bhatia. Adaptive K-Means Association for Artificial Intelligence, 2004. 9 Krishna Kummarmuru, Rohit Lotlikar, Shourya Roy. Hierarchical Monothetic Document Clustering Algorithm for Summarization and Browsing Search Results. WWW 2004, New York. 10 Bhatia, S.K. and Deougan , J.S. 1998. Conceptual Clustering in Information Retrieval. IEEE Transactions on Systems, Man and Cybernetics. 11 Dan Bladford and Guy Blelloch. Index compression through document reordering. In IEEE, editor, Proc. Of DCC’02. IEEE, 2002. 12 Parul Gupta, Dr. A.K. Sharma, Hierarchical Clustering based Indexing in Search Engines, communicated to International Journal of Information and Communication Technology. 13 A Framework for Hierarchical Clustering Based Indexing in Search Engines Parul Gupta , A.K. Sharma 14 Kilgour, Frederick G.”The Evolution of the Book search engine “. New York: Oxford University Press, 1998; Katz, Bill. Cuneiform to Computer: A History of Reference Sources. Lanham, MD: Scarecrow Press, 1998. Weinberg, Bella Hass. "Indexes and Religion: Reflections on Research in the History of Indexes". The Indexer, 21 (3): 111-118 (1999). 15 Clustering. “Viewing morphology as an inference process”. In Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval., pp. 191-202, 1993. 17 Korfhage, R., 1997. Information Storage and Retrieval. Cooper. W.S., 1969. “Is inter indexer consistency a hobgoblin?” American Documentation 20. 3: 268-278 18 C. C. Aggarwal, F. Al-Garawi, and P. S. Yu. “Intelligent crawling on the World Wide Web with arbitrary predicates”. In WWW10, Hong Kong, May 2001. 19 S. Chakrabarti. Mining the Web. Morgan Kaufmann, 2003. 21 Michael R. Anderberg. Cluster analysis for applications. Academic Press, 1973. 22 G. Adami, P. Avesani, D. Sona, Clustering Documents in a Web Directory, in Proceedings of the 5th International Workshop on Web Information and Data Management (ACM WIDM 03), September 2003. Junghoo Cho, Hector Garcia-Molina, and Lawrence Page. Efficient crawling through url ordering. In 7th World Wide Web Conference (WWW7), Apr 1998 ISSN 2278-5299 Advances in Web Mining and Web Usage Analysis 2005 - revised papers from 7 th workshop on Knowledge Discovery on the Web, Olfa Nasraoui, Osmar Zaiane, Myra Spiliopoulou, Bamshad Mobasher, Philip Yu, Brij Masand, Eds., Springer Lecture Notes in Artificial Intelligence, LNAI 4198, 2006 25 Web Mining and Web Usage Analysis 2004 - revised papers from 6 th workshop on Knowledge Discovery on the Web, Bamshad Mobasher, Olfa Nasraoui, Bing Liu, Brij Masand, Eds., Springer Lecture Notes in Artificial Intelligence, 2006 American 16 23 24 144