PAGE CONTENT RANK: AN APPROACH TO THE WEB CONTENT MINING

advertisement
International Journal of Engineering Trends and Technology (IJETT) – Volume 22 Number 2- April 2015
PAGE CONTENT RANK: AN APPROACH TO THE WEB
CONTENT MINING
Urvashi1, Mr. Rajesh Singh 2
1
2
M.Tech Student, Department of CSE, B.S.Anangpuria Institute of Technology and Management, Alampur, India
Assistant professor, Department of CSE, B.S.Anangpuria Institute of Technology and Management, Alampur, India
Abstract: World Wide Web is a system of
interlinked hypertext documents that are accessed
via the internet.. However, Information on Web
continues to expand in size and complexity. Making
the retrieval of the required web page on the web,
efficiently and effectively, is a challenge. Web
structure mining plays an effective role in finding or
extracting the relevant information.
I.
INRODUCTION:
In this paper, I proposed a new algorithm, based on
Page Content Rank (PCR), based on structure and
content.In the proposed work, a new approach is
introduced to rank the relevant pages based on the
content and keywords. Methods of web data mining
can be divided into several categories according to a
kind of mined information and goals that particular
categories set: Web structure mining (WSM), Web
usage mining (WUM), and Web Content Mining
(WCM). The objective of this paper is to propose a
new WCM method of a page relevance ranking based
on the page content exploration.
II.
WEB MINING:
3. Web usage mining focuses on techniques that could
predict the behavior of users while they are interacting
with the WWW. Web usage mining collects the data
from Web log records to discover user access patterns
of Web pages.
Web Mining
Web Content
Mining
Web Usage
Mining
Web Structure
Mining
Text & Multimedia
documents
Web Log
Records
Hyperlink
Structure
The general process of web mining is:
Extracting valuable knowledge from Web or analyzing
data from different perspectives and summarizing it
into useful information - information that can be used
to do important tasks. Two different approaches are:
One approach is process-based and the other is databased. Data-based definition is more widely accepted
today.
Mining is of three types:- Web content mining, Web
structure mining, Web usage mining.
1. Web content mining: targets the knowledge
discovery, in which the main objects are the traditional
collections of text documents and, more recently, also
the collections of multimedia documents such as
images, videos, audios, which are embedded in or
linked to the Web pages.
ISSN: 2231-5381
2. Web structure mining focuses on the hyperlink
structure of the Web. The different objects are linked in
some way. Simply applying the traditional processes
and assuming that the events are independent can lead
to wrong conclusions. However, the appropriate
handling of the links could lead to potential
correlations, and then improve the predictive accuracy
of the learned models.
Resource
Discovery
Information
pre processing
Generalization
Some terms regarding mining:
 Resource Discovery, whose task is retrieving
web documents, is the process of retrieving the web
resources.
 Information Pre-processing is the transform
process of the result of resource discovery.
 Generalization is to uncover general patterns
at individual web sites and across multiple sites. In this
step, machine learning and traditional data mining
techniques are typically used.
 Pattern Analysis is the validation of the
mined patterns.
http://www.ijettjournal.org
Page 74
Pattern
Analysis
International Journal of Engineering Trends and Technology (IJETT) – Volume 22 Number 2- April 2015
The Subtasks of Web Usage Mining are:
Site Files
Pre processing
Row Logs
Mining Algorithms
User Session
Files
An access log file in web usage contains information
about user visits in Common Log Format. In this
format, each user request to any URL corresponds to a
record in access log file. Each record is a tuple
containing 7 attributes. Session information is 2-tuple
containing an IP address of user and a sequential list of
web pages that are visited in this session.
Si = ( IPi, PAGESi )
PAGESi = { (URLi)1 , (URLi)2……………,(URLi)k
}
After applying data preprocessing and data reduction,
session information is obtained from web log data. In
the next section data preprocessing step is given.
Search for frequent patterns by some famous algos:
1. Breadth First Search(BFS): In breadth first search,
the lattice of equivalence classes is generated by the
recursive application, exploring the whole lattice in a
bottom up manner. All child length-n patterns are
generated before moving into parent patterns with
length n+1.
2. Depth First Search(DFS): In the depth first search,
all patterns are explored by a single path coming from
its child before exploring all patterns having smaller
length from corresponding pattern. In breadth first
search we only need to keep the id lists of current
patterns with length n in the memory.
3. GSP Algorithm: GSP makes multiple passes over
the session set. Given a set of frequent n-1 patterns, the
candidate set for next generation are generated from
input set according to the thresholds.
ISSN: 2231-5381
Rules,
Patterns,
Statistics
Pattern Analysis
Modified Rules,
Patterns, Statistics
SPADE Algorithm: In the SPADE algorithm, firstly
Session id-timestamp list of atoms created. Then these
lists are sorted with respect to the support of each atom.
Then, if support of any atom is below the input
threshold it is eliminated automatically. Next, frequent
patterns from single atoms are generated according to
union operation ∨ based on prefix-based approach
defined above. Finally, all frequent items with length n
> 2 are discovered in their length-1 prefix class
independently.
In our experiments, GSP has given the worst results
because it does not use pattern lattice structure and at
each step it has to perform a session scan. DFS is better
than BFS because it eliminates infrequent patterns at
each level and it keeps less patterns in memory.
SPADE is the best one, because it works on prefixbased equivalence classes, which is a much smaller
search-space.
Three main page ranking and document clustering
techniques are as follows:
1. PageRank Algorithm: PageRank was developed at
Stanford University by Larry Page (cofounder of
Google search engine) and Sergey Brin. Google uses
this algorithm to order its search results in such a way
that important documents move up in the results of a
search while moving the less important pages down in
its list. This algorithm states that if a page has some
important incoming links to it, then its outgoing links
to other pages also become important, thus it takes
backlinks into account and propagates the ranking
through links. When some query is given, Google
combines precomputed PageRank scores with text
matching scores to obtain an overall ranking for each
resulted web page in response to the query.
http://www.ijettjournal.org
Page 75
International Journal of Engineering Trends and Technology (IJETT) – Volume 22 Number 2- April 2015
A simplified formula of PageRank is defined as:


PR (v)
, or
N (v )
PR(u)=(1d)+d
PR (v)
N (v )
PR (u)=c
v B (u )

vB (u )
Iteration
PR(A)
PR(B)
PR(C)
0
1
1
1
1
1
1.25
0.81
2
1.21
1.2
0.8
3
1.2
1.2
0.8
4
1.2
1.2
0.8
....
....
....
....
where d is a damping factor and (1-d) as the page rank
distribution from non-directly linked pages. The
PageRanks for pages A, B, C can be calculated using
(2) as shown below:
PR(A)= (1-d)+d((PR(B)/2+PR(C)/2 )
PR(B)= (1-d)+d( PR(A)/1+PR(C)/2 )
3. Weighted PageRank Algorithm:
PR(C)= (1-d)+d( PR(B)/2)
By calculating the above equations with d=0.5(say), the
page ranks of pages A, B and C become:
PR(A)=1.2, PR(B)=1.2, PR(C)=0.8
A
B
C
2. Iterative Method of Page Rank:
It is easy to solve the equation system for a small set of
pages to determine the page rank solution by inspection
method. In iterative calculation, each page is assigned a
starting page rank value of 1 as shown in Table and
many iterations could be followed to normalize the
page ranks.
WPR assumes that more popular the web pages are,
more linkages other web pages tend to have to them or
are linked to by them. This algorithm assigns larger
rank values to more important pages instead of dividing
the rank value of a page evenly among its outgoing
linked pages. Each outlink page gets a value
proportional to its popularity or importance and this
popularity is measured by its number of incoming and
outgoing links. The popularity is assigned in terms of
weight values to the incoming and outgoing links,
which are denoted as Win(v,u) and Wout(v,u)
respectively. Win (v,u) is the weight of link(v, u)
calculated based on the number of incoming links of
page u and the number of incoming links of all
reference (outgoing linked) pages of page v.
Win
( v ,u )
=
Iu
I
pR (v )
, where Iu and Ip represent the
p
number of inlinks of page u and page p, respectively.
R(v) denotes the reference page list of page v.
Wout (v,u) is the weight of link(v, u) calculated based
on the number of outlinks of page u and the number of
outlinks of all reference pages of page v.
Wout(v,u)=
Ou
, where Ou and Op represent the
 Op
pR (v )
number of outlinks of page u and page p, respectively.
The original PageRank formula is modified as given :
ISSN: 2231-5381
http://www.ijettjournal.org
Page 76
International Journal of Engineering Trends and Technology (IJETT) – Volume 22 Number 2- April 2015
WPR(u)  (1  d )  d
WPR(v)W
in
( v ,u )
vB ( u )
W(out
v ,u )
Clustering : Clustering divides a set of objects into
groups such that the objects in the same group are
similar to each other. In the context of web document
clustering, objects are replaced by documents and are
grouped together based upon some measure like
similarity of content or of hyperlinked structure. Most
of the search engines return a large and unmanageable
list of documents containing the query keywords.
Finding the required documents from such a large list
is usually difficult, often impossible. As a solution, the
search engines can group a set of returned documents
with the aim of finding semantically meaningful
clusters, rather than a list of ranked documents. Web
clustering may be based on content alone, may be
based on both content and links or may only be based
on links.
Proposed Architecture for CLUSTERING AND
RANKING:
sequential pattern generator.
Step 3: The level weight are calculated for every page
X present in the sequential pattern.
Step 4: The rank is calculated for every page X present
in the sequential pattern. The improved rank is
calculated as the summation of previous rank and
assigned weight value.
Algorithm: Rank improve (Q,n), Given: Set of n
queries and corresponding clicked
URLs stored array Q [qi, URL1……., URLm], 1≤i≤n
Output: A set C= {C1, C2…, Ck} of k query.
K=0;
// Start of algorithm
For (each query P in Q)
Set Clusterid (P) = NULL;
For (each P€ Q with clustered (P) = NULL)
{
I =n, page= Q (n);
User
WWW
Clusterid (p) =ck;
Weight(X) =In (lenpar(X))
Web Crawler
level(X)
Query Interface
Page_ rank(X) = (1-d) +d Σ PR (v)
Indexer
V € B(X) Nv
Query Processor
Index
New Page _rank(X) =Page_ rank + Weight(X)
While (i>1) and (Q [i/2] (New Page _rank(X)) do
Rank
Calculator
Cluster
Generator
Similarity
Calculator
{
Q[i] = Q [i/2];
I=i/2;
Rank improvement: This module takes the input from
the query processor and matched documents of a user
query and an improvement is applied to improve the
rank score of the returned pages. The module operates
online at the query time and applies the improvement
on the current documents.
Step 1: Given an input user query q and matched
document D collected from the query processor, the
cluster Ck is found to which the query q belongs.
Step 2: Sequential pattern of the concerned cluster is
retrieved from the local repository maintained by the
ISSN: 2231-5381
}
Q[i] =New Page _rank;
return true; }
K=k+1;}
III. CONCLUSION:
The paper describes Page content ranking and
algorithms and experiences with its use in the Web
mining. It was found with a number of examples the
method has better behavior that popular Page Rank
http://www.ijettjournal.org
Page 77
International Journal of Engineering Trends and Technology (IJETT) – Volume 22 Number 2- April 2015
Ranjna Gupta,”Web Search Result Optimization
by Mining the Search Engine Query logs,”.
ProceedingIEEE International Conference on
methods and models in Computer Science
(ICM2CS-2010).
algorithm.Obviously, we would like to state a
hypothesis that:: The PCR identifies pages which are
more significant with respect to their content and better
explains given topic than the Page Rank algorithm.
However, more experiments have to be performed as a
future work in order to validate the hypothesis.

A.Spink,
D.Wolfram,B.J.Jansen,T.Saracevis,”Searching the
Web :The public and their queries ”.jouurnel of
the amercianSocitey for information Science and
technology 52(3),2001,226-234.

R.Cooley, B.Mobasher and J.Srivastava, “Web
mining: Information and pattern discovery on the
World Wide Web,”. In 9th IEEE International
Conference on Tools with Artificial Intelligence
(ICTAI 97) 1997.

M.H.Dunham,Companion slides for the text ,”
Data mining Introductory and advanced topics
”.Prentice Hall,2002.

Thorsten joachims,”optimizing search engine
using clickthough data” Proceeding of the 8th
ACM SIGKDD international conference on
knowledge discovery and data mining
,2002,pp:133-142, New York.

H.Ma, H.Yang,I.King,and M.R.Lyu,”learning
latent semantic relations from clickthough data
from query suggestion ”.InCIKM’08:Proceeding
oh the 17th ACM conference on information and
knowledge management ,pages 709-708,New
York,ny,USA,2008,ACM.
IV. FUTURE WORK:
There are some possibilities of the future development
of PCR. Certainly, the method should be tested on data
samples of more respective sizes. A weak point of the
PCR implementation is the time complexity of
obtaining the starting set of pages Rq,n. Possibility how
improve PCR is a continuous adaptability of the system
depending on user reactions. So, WPCR will be
adapted more cordially, as a standardize technique to
improve page content problem.
REFERENCES:

Smizansky, J., 2004. Web data mining. Master
Thesis, Faculty of Mathematics and Physics,
Charles University in Prague. (in Czech).

A. Arasu, J.Cho, H. Garcia -Molina, A.Paepcke,
and S. Raghavan, “Searching the Web,” ACM
Transactions on Internet Technology, Vol. 1, No.
1, pp. 97_101, 2001.

Gibson, J., Wellner, B., Lubar, S, "Adaptive webpage content identification", In WIDM ’07:
Proceedings of the 9th annual ACM international
workshop on Web information and data
management. New York, USA,2007.

Han, J. and Kamber, M. “Data Mining: Concepts
and
Techniques”,
Morgan
Kaufmann
Publishers,2001.

J.Wen, J.Mie, and H.zhang,” Clustering user
queries of a search engine ”.In Proc of 10th
International WWW Conference .W3C,2001.

IsakTaka ,SarahZelikovitz,AmandaSpink,”Web
Search
log
to
Identify
Query
Classficationterms”Proceeding
of
IEEE
International
Conference
on
Information
Technology (ITNG’07),pp:50-57,2008.

A.K Sharma,Neha Aggarwal, Neelam Dhuan and
ISSN: 2231-5381
http://www.ijettjournal.org
Page 78
Download