International Journal of Engineering Trends and Technology (IJETT) – Volume 21 Number 4 – March 2015 An Efficient Concept Based Mining Model for Web Page Clustering Mr. P. S. Gamare 1, Mr. Sandip B. Khedkar2, Mr. Maheshwar A. Panindre 3, Mr. Ketan D. Bhatkar 4 1 Assistant Professor, 234Student, Department of computer Engineering, Rajendra Mane college of engineering and technology, Ambav, Devrukh, Ratnagiri, Maharashtra, India Abstract— Effective representation of web search results remains an open problem in the information retrieval. These problems can be resolved by concept based mining model using modified Agglomerative Hierarchical Clustering Algorithm. The proposed mining model consists of sentence-based concept analysis, document-based concept analysis, and concept-based similarity measure. This model can efficiently find out significant matching concepts between documents according to semantics of their sentences. The aim behind the concept-based analysis is to achieve an accurate analysis of concepts on the sentence, document and corpus levels rather than single term analysis on the document only. Here we summarize the data into clusters. For clustering we are using the AHC algorithm with some modifications. The existing AHC algorithm is not suitable for large data sets. I. INTRODUCTION Effective representation of web search results remains an open problem in the information retrieval. These problems can be resolved by concept based mining model using modified Agglomerative Hierarchical Clustering Algorithm. The proposed mining model consists of sentencebased concept analysis, document-based concept analysis, and concept-based similarity measure. This model can efficiently find out significant matching concepts between documents according to semantics of their sentences. Clustering, one of the traditional data mining techniques is an unsupervised learning paradigm where clustering methods try to identify inherent groupings of the text documents, so that a set of clusters is produced in which clusters exhibit high intra cluster similarity and low inter cluster similarity. The proposed model captures the semantic structure of each term within a sentence and document rather than the frequency of the term only within a document. Three measures for analyzing concepts on the sentence, document, and corpus levels are computed in the proposed model. The aim behind the concept-based analysis is to achieve an accurate analysis of concepts on the sentences, documents and corpus levels rather than single term analysis on the document only. Here we summarize the data into clusters. For clustering we are using the AHC algorithm with some modifications. The existing AHC algorithm is not suitable for large data sets. Fig. proposed system architecture A. Preprocessing: • Separate sentences • Label terms • Remove stop words • Remove stem words III. CONCEPT BASED MINING MODEL A. Sentence-Based Concept Analysis To analyze each concept at the sentence level, we proposed a new Concept-based frequency measure, called the conceptual term frequency (ctf). The (ctf) calculations of Concept c in sentence s and document d are as follows: B. Calculating ctf of Concept c in Document d: A concept c can have many ctf values in different sentences in the same document d. Thus, the conceptual term frequency (ctf) value of concept c in document d is calculated by: II. PROPOSED SYSTEM ARCHITECTURE Where sn is the total number of sentences that contain concept c in document d. Taking the average of the conceptual term frequency (ctf) values of concept c in its sentences of document d measures the overall importance of concept c to the meaning of its sentences in document d. A concept, which ISSN: 2231-5381 http://www.ijettjournal.org Page 219 International Journal of Engineering Trends and Technology (IJETT) – Volume 21 Number 4 – March 2015 has ctf values in most of the sentences in a document, has a major contribution to the meaning of its sentences that leads to discover the topic of the document. Thus, calculating the average of the ctf values measures the overall importance of each concept to the semantics of a document through the sentences. C. Document-Based Concept Analysis To analyze each concept at the document level, the conceptbased term frequency tf , the number of occurrences of a concept (word or phrase) c in the original document, is calculated. The tf is a local measure on the document level. well as to the main topics of the document. Last, the number of documents that contains the analyzed concepts is used to discriminate among documents in calculating the similarity. These aspects are measured by the proposed concept-based similarity measure which measures the importance of each concept at the sentence level by the ctf measure, document level by the tf measure, and corpus level by the df measure. The concept-based measure exploits the information extracted from the concept-based analysis algorithm to better judge the similarity between the documents. The concept-based similarity between two documents, d1 and d2 is calculated by D. Corpus-Based Concept Analysis To extract concepts that can discriminate between documents, the concept-based document frequency df, the number of documents containing concept c, is calculated. The df is a global measure on the corpus level. This measure is used to reward the concepts that only appear in a small number of documents as these concepts can discriminate their documents among others. The process of calculating ctf, tf, and df measures in a corpus is attained by the proposed algorithm which is called Concept-based Analysis Algorithm. E. Concept-Based Analysis Algorithm 1. ddoci is a new Document 2. L is an empty List (L is a matched concept list) 3. sdoci is a new sentence in ddoci 4. Build concepts list Cdoci from sdoci 5. for each concept ci 2 Ci do 6. compute ctfi of ci in ddoci 7. compute tfi of ci in ddoci 8. compute dfi of ci in ddoci 9. dk is seen document, where k ={0, 1, . . . , doci – 1} 10. sk is a sentence in dk 11. Build concepts list Ck from sk 12. for each concept cj 2 Ck do 13. if (ci == cj) then 14. Update dfi of ci 15. Compute ctf weight = avg(ctfi, ctfj) 16. add new concept matches to L 17. end if 18. end for 19. end for 20. output the matched concepts list L F. Concept-Based Similarity Measures Concepts convey local context information, which is essential in determining an accurate similarity between documents. A concept-based similarity measure, based on matching concepts at the sentence, document, corpus and combined approach rather than on individual terms (words) only, is devised. The concept-based similarity measure relies on three critical aspects. First, the analyzed labeled terms are the concepts that capture the semantic structure of each sentence. Second, the frequency of a concept is used to measure the contribution of the concept to the meaning of the sentence, as ISSN: 2231-5381 G. AHC Algorithm having three types: • Single Linkage Method. The similarity between two clusters S and T is Calculated based on the minimal distance between the elements belonging to the corresponding clusters. This method is also called ―nearest neighbour‖ clustering method. • Complete Linkage Method. The similarity between two clusters S and T is calculated based on the maximal distance between the elements belonging to the corresponding clusters. This method is also called ―furthest neighbour‖ clustering method. • Average Linkage Method. The similarity between two clusters S and T is calculated based on the average distance between the elements belonging to the corresponding clusters. This method takes into account all possible pairs of distances between the objects in the clusters, and is considered more reliable and robust to outliers. http://www.ijettjournal.org Page 220 International Journal of Engineering Trends and Technology (IJETT) – Volume 21 Number 4 – March 2015 This method is also known as UPGMA (Unweighted PairGroup Method using Arithmetic averages). H. Output Clusters: Finally we get the relevant output clusters. ACKNOWLEDGMENT The authors would like to thank all the reviewers for their valuable suggestions and guidance. Authors would also like to thank Rajendra Mane College of Engineering and Technology. IV. FUTURE SCOPE REFERENCES • This project can be used for retrieving data from large data centers. • This project is very useful for retrieving relevant data from the web through the search engines. • It also reduces the size of web documents. [1] [2] [3] [4] V. CONCLUSION We get: [5] • Extraction of relevant web pages. •Accurate measure of the similarity between web pages. [6] • Clusters with 100% accuracy. [7] ISSN: 2231-5381 Shady Shehata, Fakhri Karray, Mohamed S. Kamel. ―An efficient concept based mining model for enhancing text clustering,‖IEEE transactions on knowledge and data engineering, Vol. 22, no. 10, October 2010. Margaret H. Dunham,―Data mining Introductory and Advanced Topics,‖ Pearson education 2006. Khaled M. Hammouda. ―Web mining: clustering web documents a preliminary review,‖ February 26, 2001. Alexander Strehl, JoydeepGhosh, Raymond Mooney,―Impact of similarity measures on web-page clustering,‖AAAI Technical Report WS-00-01. Antonio Latorre, Jose M. Pena, Victor Robles, Maria S. Perez, ―A Survey in Web Page Clustering Techniques‖, Department of Computer Architecture and Technology, Technical University of Madrid, Spain. Ron Bekkerman, ShlomoZilberstein, James Allan, ―Web Page Clustering using Heuristic Search in the Web Graph,‖ University of Massachusetts Amherst MA, USA. N. Oikonomakou,M.Vazirgiannis, ―A Review of Web Document Clustering Approaches,‖ Dept. of Informatics, Athens University Economics & Business Patision 76, 10434, Greece. http://www.ijettjournal.org Page 221