An Efficient Concept Based Mining Model for Web Page Clustering

advertisement
International Journal of Engineering Trends and Technology (IJETT) – Volume 21 Number 4 – March 2015
An Efficient Concept Based Mining Model for
Web Page Clustering
Mr. P. S. Gamare 1, Mr. Sandip B. Khedkar2, Mr. Maheshwar A. Panindre 3, Mr. Ketan D. Bhatkar 4
1
Assistant Professor, 234Student, Department of computer Engineering, Rajendra Mane college of engineering and technology,
Ambav, Devrukh, Ratnagiri, Maharashtra, India
Abstract— Effective representation of web search results remains
an open problem in the information retrieval. These problems
can be resolved by concept based mining model using modified
Agglomerative Hierarchical Clustering Algorithm. The proposed
mining model consists of sentence-based concept analysis,
document-based concept analysis, and concept-based similarity
measure. This model can efficiently find out significant matching
concepts between documents according to semantics of their
sentences. The aim behind the concept-based analysis is to
achieve an accurate analysis of concepts on the sentence,
document and corpus levels rather than single term analysis on
the document only. Here we summarize the data into clusters.
For clustering we are using the AHC algorithm with some
modifications. The existing AHC algorithm is not suitable for
large data sets.
I. INTRODUCTION
Effective representation of web search results
remains an open problem in the information retrieval. These
problems can be resolved by concept based mining model
using modified Agglomerative Hierarchical Clustering
Algorithm. The proposed mining model consists of sentencebased concept analysis, document-based concept analysis, and
concept-based similarity measure. This model can efficiently
find out significant matching concepts between documents
according to semantics of their sentences. Clustering, one of
the traditional data mining techniques is an unsupervised
learning paradigm where clustering methods try to identify
inherent groupings of the text documents, so that a set of
clusters is produced in which clusters exhibit high intra cluster
similarity and low inter cluster similarity. The proposed model
captures the semantic structure of each term within a sentence
and document rather than the frequency of the term only
within a document. Three measures for analyzing concepts on
the sentence, document, and corpus levels are computed in the
proposed model. The aim behind the concept-based analysis is
to achieve an accurate analysis of concepts on the sentences,
documents and corpus levels rather than single term analysis
on the document only. Here we summarize the data into
clusters. For clustering we are using the AHC algorithm with
some modifications. The existing AHC algorithm is not
suitable for large data sets.
Fig. proposed system architecture
A. Preprocessing:
• Separate sentences
• Label terms
• Remove stop words
• Remove stem words
III. CONCEPT BASED MINING MODEL
A. Sentence-Based Concept Analysis
To analyze each concept at the sentence level, we proposed a
new Concept-based frequency measure, called the conceptual
term frequency (ctf). The (ctf) calculations of Concept c in
sentence s and document d are as follows:
B. Calculating ctf of Concept c in Document d:
A concept c can have many ctf values in different sentences in
the same document d. Thus, the conceptual term frequency
(ctf) value of concept c in document d is calculated by:
II. PROPOSED SYSTEM ARCHITECTURE
Where sn is the total number of sentences that contain concept
c in document d. Taking the average of the conceptual term
frequency (ctf) values of concept c in its sentences of
document d measures the overall importance of concept c to
the meaning of its sentences in document d. A concept, which
ISSN: 2231-5381
http://www.ijettjournal.org
Page 219
International Journal of Engineering Trends and Technology (IJETT) – Volume 21 Number 4 – March 2015
has ctf values in most of the sentences in a document, has a
major contribution to the meaning of its sentences that leads to
discover the topic of the document. Thus, calculating the
average of the ctf values measures the overall importance of
each concept to the semantics of a document through the
sentences.
C. Document-Based Concept Analysis
To analyze each concept at the document level, the conceptbased term frequency tf , the number of occurrences of a
concept (word or phrase) c in the original document, is
calculated. The tf is a local measure on the document level.
well as to the main topics of the document. Last, the number
of documents that contains the analyzed concepts is used to
discriminate among documents in calculating the similarity.
These aspects are measured by the proposed concept-based
similarity measure which measures the importance of each
concept at the sentence level by the ctf measure, document
level by the tf measure, and corpus level by the df measure.
The concept-based measure exploits the information extracted
from the concept-based analysis algorithm to better judge the
similarity between the documents.
The concept-based similarity between two documents, d1 and
d2 is calculated by
D. Corpus-Based Concept Analysis
To extract concepts that can discriminate between documents,
the concept-based document frequency df, the number of
documents containing concept c, is calculated. The df is a
global measure on the corpus level. This measure is used to
reward the concepts that only appear in a small number of
documents as these concepts can discriminate their documents
among others. The process of calculating ctf, tf, and df
measures in a corpus is attained by the proposed algorithm
which is called Concept-based Analysis Algorithm.
E. Concept-Based Analysis Algorithm
1. ddoci is a new Document
2. L is an empty List (L is a matched concept list)
3. sdoci is a new sentence in ddoci
4. Build concepts list Cdoci from sdoci
5. for each concept ci 2 Ci do
6. compute ctfi of ci in ddoci
7. compute tfi of ci in ddoci
8. compute dfi of ci in ddoci
9. dk is seen document, where k ={0, 1, . . . , doci – 1}
10. sk is a sentence in dk
11. Build concepts list Ck from sk
12. for each concept cj 2 Ck do
13. if (ci == cj) then
14. Update dfi of ci
15. Compute ctf weight = avg(ctfi, ctfj)
16. add new concept matches to L
17. end if
18. end for
19. end for
20. output the matched concepts list L
F. Concept-Based Similarity Measures
Concepts convey local context information, which is essential
in determining an accurate similarity between documents. A
concept-based similarity measure, based on matching
concepts at the sentence, document, corpus and combined
approach rather than on individual terms (words) only, is
devised. The concept-based similarity measure relies on three
critical aspects. First, the analyzed labeled terms are the
concepts that capture the semantic structure of each sentence.
Second, the frequency of a concept is used to measure the
contribution of the concept to the meaning of the sentence, as
ISSN: 2231-5381
G. AHC Algorithm having three types:
• Single Linkage Method.
The similarity between two clusters S and T is Calculated
based on the minimal distance between the elements
belonging to the corresponding clusters. This method is also
called ―nearest neighbour‖ clustering method.
• Complete Linkage Method.
The similarity between two clusters S and T is calculated
based on the maximal distance between the elements
belonging to the corresponding clusters. This method is also
called ―furthest neighbour‖ clustering method.
• Average Linkage Method.
The similarity between two clusters S and T is calculated
based on the average distance between the elements belonging
to the corresponding clusters. This method takes into account
all possible pairs of distances between the objects in the
clusters, and is considered more reliable and robust to outliers.
http://www.ijettjournal.org
Page 220
International Journal of Engineering Trends and Technology (IJETT) – Volume 21 Number 4 – March 2015
This method is also known as UPGMA (Unweighted PairGroup Method using Arithmetic averages).
H. Output Clusters:
Finally we get the relevant output clusters.
ACKNOWLEDGMENT
The authors would like to thank all the reviewers for
their valuable suggestions and guidance. Authors would also
like to thank Rajendra Mane College of Engineering and
Technology.
IV. FUTURE SCOPE
REFERENCES
• This project can be used for retrieving data from large data
centers.
• This project is very useful for retrieving relevant data from
the web through the search engines.
• It also reduces the size of web documents.
[1]
[2]
[3]
[4]
V. CONCLUSION
We get:
[5]
• Extraction of relevant web pages.
•Accurate measure of the similarity between web pages.
[6]
• Clusters with 100% accuracy.
[7]
ISSN: 2231-5381
Shady Shehata, Fakhri Karray, Mohamed S. Kamel. ―An efficient
concept based mining model for enhancing text clustering,‖IEEE
transactions on knowledge and data engineering, Vol. 22, no. 10,
October 2010.
Margaret H. Dunham,―Data mining Introductory and Advanced
Topics,‖ Pearson education 2006.
Khaled M. Hammouda. ―Web mining: clustering web documents a
preliminary review,‖ February 26, 2001.
Alexander Strehl, JoydeepGhosh, Raymond Mooney,―Impact of
similarity measures on web-page clustering,‖AAAI Technical
Report WS-00-01.
Antonio Latorre, Jose M. Pena, Victor Robles, Maria S. Perez, ―A
Survey in Web Page Clustering Techniques‖, Department of
Computer Architecture and Technology, Technical University of
Madrid, Spain.
Ron Bekkerman, ShlomoZilberstein, James Allan, ―Web Page
Clustering using Heuristic Search in the Web Graph,‖ University
of Massachusetts Amherst MA, USA.
N. Oikonomakou,M.Vazirgiannis, ―A Review of Web Document
Clustering Approaches,‖ Dept. of Informatics, Athens University
Economics & Business Patision 76, 10434, Greece.
http://www.ijettjournal.org
Page 221
Download