Thus the paper implements the proposed hierarchical algorithm at different levels which optimizes the search process by directing the search to a specific path from higher levels of clustering to the lower levels i.e. from super clusters to mega clusters, then to clusters and finally to the individual documents so that the user gets the best possible matching results in minimum possible time. The paper further presents the graphs showing clusters created at different levels. Keywords— Multilevel indexes, hierarchical clustering, document similarity, inverted files, posting list. I. INTRODUCTION Search Engines are used to find the data in response to the user queries. The indexing phase of these search engines can be viewed as a web content mining process. Starting from a collection of unstructured documents, the indexer extracts a large amount of information like the list of documents, which contain a given term. It also keeps account of number of all the occurrences of each term within every document. This information is maintained in a global index, which is usually represented using an inverted file (IF). The index consists of an array of the posting lists where each posting list is associated with a term and contains the term as well as the identifiers of the documents containing the term. For example, consider the posting list ((job; 5) 1, 4, 14, 20, 27) indicating that the term job appears in five documents having the document identifiers 1,4,14,20,27 respectively. The above posting list can be written as ((job; 5) 1, 3, 10, 6, 7) where the items of the list represent the difference between the successive document identifiers. The following figure shows the structure of index Term No of documents in which term appears Doc id of documents in which terms appears Doc ids stored with different coding scheme Fig. 1 Structure of an Index in search engines ISSN 2278-5299 Clustering is a widely adopted technique aimed at dividing a collection of data into disjoint groups of homogenous elements. A cluster is therefore a collection of objects, which are similar between them and are dissimilar to the objects belonging to other clusters. Clustering algorithms attempt to group together the documents based on their similarities. Thus documents relating to a certain topic will hopefully be placed in a single cluster. So if the documents are clustered, comparisons of the documents against the user’s query are only needed with certain clusters and not with the whole collection of documents. II. RELATED WORK In this paper, a review of previous work on index organization is given. In this field of index organization and maintenance, many algorithms and techniques have already been proposed but they seem to be less efficient in efficiently accessing the index. In [2] the authors introduce a novel pipelining technique for structuring the index–building system that substantially reduces the index constructing time. They demonstrate that for large collections, the pipelining technique can speed up index construction by several hours. They propose and compare different schemes for storing and managing inverted files using an embedded database system. They show that an intelligent scheme for packing inverted lists in the storage structures of the database can provide performance and storage efficiency comparable to tailored inverted file implementations Another proposed work was the threshold based clustering algorithm [8] in which the number of clusters is unknown. However, two documents are classified to the same cluster if 141 Kalra,,International Journal of Latest Research in Science and Technology the similarity between them is below a specified threshold. This threshold is defined by the user before the algorithm starts. It is easy to see that if the threshold is small; all the elements will get assigned to different clusters. If the threshold is large, the elements may get assigned to just one cluster. Thus the algorithm is sensitive to specification of threshold. Another proposed work was the K means clustering algorithm, which initially chooses k documents as cluster representatives and then assigns the remaining n-k documents to one of these clusters on the basis of similarity between the documents. New centroids for the k clusters are recomputed and documents are reassigned according to their similarity with the k new centroids. This process repeats until the position of the centroids become stable. Computing new centroids is expensive for large values of n and the number of iterations required to converge may be large. The given paper implements the multilevel index structure in which the clustering of data is done at multiple levels so as to direct the search to a specific path from high level index to low level index. III. However the indexes at the higher level consist of only the document terms with no extra information stored in them. B. How to Create Clusters Let D={D1, D2,……Dn) be the collection of N textual documents being crawled to which consecutive integers document identifiers 1…n are assigned. N1 is the number of clusters which would be created. This algorithm will make less or equal to the number of cluster i.e. less or equal to N1.first it will find out the similarity between the documents and create the similarity matrices. After that it will take 2 clusters and make one cluster on the basis of similarity. Similarly all clusters are merged to make cluster. Now it will cheek for the termination condition. If it s not a termination condition then it will calculate again similarity matrices. Again take 2 clusters and make one cluster until it reach termination condition. B. Example Illustrating the Formation of Hierarchy of The WebDocuments Super cluster level PROPOSED WORK The indexing phase of the search engine collects information from the web documents gathered in the web repository. This global large sized index is stored in the form of inverted files. The proposed work explains the concept of multilevel indexes which are created at multiple levels of hierarchy of documents clustering. At first level of document hierarchy, the similar documents are clustered into clusters on the basis of documents similarity. Then at the next level, the similar clusters are clustered into mega clusters and at the last level of hierarchy, the similar mega clusters are clubbed together to form super clusters. Mega cluster level No. of docs in which Doc id of docs in term appears which term appears Indexing 50 12,34,45,49… Search engine 59 15,20,34,55… Pipeline 15 3,6,9,12… Fig. 3 Structure of which Higher Level Indexes This is a descriptive index is stored in the form of inverted files. The index consists of an array of posting lists as in case of global large size index. The index at the mega cluster level consists of the terms that come after the intersection of the terms in the clusters that appear in that particular mega cluster. This index is of small size and contains only the similar terms of the clusters within the same mega cluster. The index does not contain any additional information. The highest level index is formed last at the super cluster level which consists of the terms that are obtained after the intersection of the terms in the similar mega cluster within that super cluster and the search starts with this index and proceeds downwards to the lowest level index. IV. IMPLEMENTATION WORK Crawling in search engines Indexing in search engines Cluster level Parallel crawling in search engines Distributed crawling in search engines Global indexing in search engines Pipelined indexing in search engines Documents A. Structure of Multilevel Indexes The structure of the index at the lowest level is in the form of posting lists as shown in figure: Term Search engine Fig. 4. Example Illustrating Hierarchy of Clusters The Fig 4 shows the hierarchy of the documents created. The index formation star means algorithm is applied for creating the clusters. Clusters will be created at first level. For creating clusters at second level same procedure is applied again and then finally hierarchical clustering is done for indexing. Finding Document similarity Parsing Documents Parsed data Matrix created After applying algorithm for hierarchical clustering Clusters created Fig. 5 Work Flow of Implementation For indexing the documents, firstly we have to parse the documents. After that similarity matrix is created and then k Graph Representing the Created Clusters at Second Level At second level, more number of clusters are created in comparison to first level. As the similarity between two documents is more 120 g, 98 Fig. 6 Given & Parsed Data number of clusters 100 2) Clustered data The data is now clustered according to the similarity of words. The following fig shows the clusters created that are created after matching the similarity of documents with each other. 80 60 f, 89 d, 76 a, 76 g, 76 f, 69 e,f, 69 67 c, 65 e, 63 h,i, 62 61 d, 54 a,c,5255 a, 54 e, 54 h, 43 40 c, 87 h, 87 i, 43 d, 34 g, 34 e, 32 i, 98 a e, 87 b i, 76 a, 76 c, 76 f, 65 g, 65 c d e 47 h,i, 45 c, 33 f g h, 31 h g, 21 i 20 j 0 a b c d e f 10 documents g h i j Fig. 9 Clusters Created at Second Level VI. CONCLUSION The given paper implements a multilevel indexing structure for a search engine index so as to speed up the search process by directing the search to a specific path from the higher level index to the lower level indexes. 