seminar report

advertisement
Sample Example of Seminar Report
Royal Education Society’s
COLLEGE OF COMPUTER SCIENCE & INFORMATION
TECHNOLOGY, LATUR
SEMINAR REPORT
On
Web Mining
Submitted by
Shinde Shital Narayan
(Exam Seat No: KI 2148)
in partial fulfillment for the award of the degree
of
B.Sc.(S.E) Third Year
SWAMI RAMANAND TEERTH MARATHAWADA
UNIVERSITY, NANDED.
Winter 2014
I
Royal Education Society’s
COLLEGE OF COMPUTER SCIENCE AND INFORMATION
TECHNOLOGY, LATUR.
CERTIFICATE
This is to certify that the Seminar entitled “Web Mining” has been carried
out by Shinde Shital Narayan under my guidance in partial fulfillment of the
degree of B.Sc.(S.E)T.Y. of SRTMU, Nanded during the academic year
2014-2015. To the best of my knowledge and belief this work has not been
submitted elsewhere for the award of any other degree.
Seminar Guide
Mr.S.S.Ingale
H.O.D.
Mr.I.M.Kazi
II
Principal
Dr. M.R. Patil
ACKNOWLEDGEMENT
(Shinde Shital Narayan)
III
INDEX
1.
2.
3.
4.
Topic
COVER PAGE
CERTIFICATE
ACKNOWLEDGEMENT
Towards Semantic Web Mining
1.1
The Semantic
1.2
Web Mining
1.3
Extracting Semantics from the Web
1.4
Exploiting Semantics for Web Mining
1.5
Mining the Semantic Web
PageRank
2.1
Motivation
2.2
Structure of the Web
2.3
Simplified Version of PageRank
2.4
Random Surfer Model
2.5
Implementation
Properties and Approaches
3.1
Convergence
3.2
Personalized PageRank
3.3
Page Rank and Google
3.4
Manipulating by Commercial Interests
3.5
Estimating Web Traffic
3.6
Other Approaches
Conclusion
BIBLIOGRAPHY
Page No.
I
II
III
1
2
3
4
5
6
6
7
8
9
10
10
10
11
11
12
13
14
1. Towards Semantic Web Mining
1.1 The Semantic Web
The increasing usage of the current World Wide Web leads to a new
challenge of optimizing the interchange of information, due to the fact that a
huge amount of data is interpretable by humans only. The Semantic Web
deals with an idea of Tim Berners-Lee1 – to enrich the Web by machineunderstandable information which supports the user in his tasks. Machine
processable information for instance can lead a quite powerful search engine
to more relevant pages and can improve precision and recall. The Semantic
Web is built up by techniques such as XML2, RDF3, ontologies and logic.
The content of the Semantic Web is represented by ontologies and metadata.
Thus a well-agreed upon core structure is provided which can easily be
mapped onto existing ontologies. Further the definitions will be extended by
axioms, lexicons and knowledge bases. Trust and proof can be applied by
the use of digital signatures. Figure 1 shows the layer structure for the
Semantic Web suggested by Tim Berners-Lee.
1
Figure 1: The layer structure of the Semantic Web
1.2 Web Mining
The characteristic feature of Web Mining is the use of Data Mining
techniques to elaborate on content, structure and usage of Web resources.
Web Mining is an invaluable help in the transformation from human
understandable content to machine understandable semantics.Web content
mining is a form of text mining. It takes advantage of the semi-structured
nature of the Web page text caused by HTML tags or XML markup. For
instance Web content mining is used to detect co-occurrences of
semantically related terms in texts like ”copper” combined with ”gold” in
articles concerning Canada and ”copper” combined with ”silver” in articles
concerning the US.Web structure mining usually operates on the hyperlink
structure of the Web. The primary resource for mining the Web structure is a
set of pages ranging from a single site to the Web as a whole.
2
Hyperlink topology information is found in authority pages, which are
defined in relation to hubs as their counterpart. Hubs are pages that have a
link to a certain amount of authorities. The PageRank algorithm implements
this concept by
stating that the relevance of a page is increasing with the number of
hyperlinks to it from other relevant pages.Regarding Web usage mining, the
primary Web resource that is being mined is a record of the requests made
by visitors of a Web site. These records are often collected by a logging
algorithm on the Web server. Web usage mining discovers information
about related interests of a particular group of Web users. Sequence mining
elaborates on optimizing the succession of going from one page to the next
according to the behavior of a mass of users. Web usage mining can be
combined with the other techniques in order to detect frequently used paths.
1.3 Extracting Semantics from the Web
The precondition for managing knowledge in an automatic way,
instead of accessing unstructured material, is to add semantic annotation to
Web documents. All approaches discussed here assist the knowledge
engineer in extracting the semantics, but cannot completely replace him. A
computer can hardly be enabled to fully consider background knowledge,
experience or social conventions.
Ontology learning was created for the semi-automatic extraction of
semantics from the Web in order to build up ontologies. The techniques
produce intermediate results which finally have to be integrated into the
ontology. The process of ontology merging takes as input two or more
source ontologies and returns a merged one based on the given source
ontologies.
3
The approaches rely on syntactic and semantic matching heuristics which
are adjusted to the behavior of experienced ontology engineers.Instance
learning in this context means information extraction from texts. Information
extraction is a set of automatic methods for locating important facts in
electronic documents for subsequent use.
1.4 Exploiting Semantics for Web Mining
Semantics can be exploited for different purposes. The first major
application area is the explicit encoding of semantics for mining the Web
content. In [BHS02] the input data is preprocessed and ontology-based
heuristics for feature selection and feature aggregation are applied. Based on
these representations multiple clustering results using the K-Means
algorithm are computed. These results can be explained by the
corresponding selection of concepts in the ontology.In Web structure mining
the techniques can be enriched by taking content into account.For example
the PageRank algorithm co-operates with a keyword analysis algorithm, but
the two are independent of one another. The most basic form of mining the
usage of the Web is to use hand-crafted ontologies, in combination with
automated schemes. Web pages are classified according to multiple concept
hierarchies that reflect content, structure and service. In this context a path is
a sequence of concepts in a concept hierarchy allowing to identify different
strategies of search. Semantics can be exploited best, if the gap between the
underlying model generating the pages and the model analyzing requests for
those pages is evanescently small.
4
1.5 Mining the Semantic Web
In the Semantic Web, content and structure are strongly interwined.
Therefore the distinction between structure and content mining vanishes. An
important group of those techniques is formed by Relational Data Mining. It
comprises
techniques
for
classification,
regression,
clustering
and
association analysis to look for patterns that involve multiple relations in a
relational database. The algorithms can be transformed in order to deal with
RDF or ontology-based data. Mining the usage can be enhanced further, if
the semantics are contained explicitly in the pages by referring to concepts
of ontologies.
5
2. PageRank
2.1 Motivation
This section describes PageRank, a method for rating Web pages objectively
and mechanically,paying attention to human interest. Web search engines
have to arrange with inexperienced users and pages manipulating
conventional ranking functions. Any evaluation strategy which counts
replicable features of Web pages is unimmunized to manipulation. The task
is to take advantage of the hyperlink structure of the Web to produce a
global importance ranking of every Web page. This ranking is called PageRank.
2.2 Structure of the Web
The structure of the Web is based on a graph with about 150 million
nodes (Web pages)and 1.7 billion edges (hyperlinks)4. If Web pages A and
B link to a page C, A and B arecalled the backlinks of C. This circumstance
is illustrated in Figure 2. In generall, highly linked pages are more important.
Thus they have more backlinks. But the important backlinks are often less in
quantity. For example a Web page with a single backlink from Yahoo has to
be ranked higher than a page with a couple of backlinks from unknown or
private sites. A Web page has a high rank, if the sum of the ranks of its
backlinks is also high.
6
Figure 2: A and B are backlinks of C
2.3 Simplified Version of PageRank
Let u, v be Web pages. Then let Bu be the set of pages that point to u.
Further letNv be the number of links from v. Let c < 1 be a factor for
normalization. We define a simple ranking R, which is a simplified version
of PageRank:
The rank of a page is divided among its forward links evenly to contribute to
the ranks of the pages they point to. The equation is recursive. But there is a
problem with this simplified function. If there where two Web pages that
point to each other but to no other page while some other Web page points to
one of them, a loop will be generated during the iteration. This loop will
accumulate the rank but will never distribute any ranks. This trap formed by
loops in a graph without outedges are called rank sinks.
7
2.4 Random Surfer Model
To avoid rank sinks a model of a Web surfer has to be created. This
surfer simply keeps clicking on hyperlinks at random. The task is modeling
the behavior that the surfer periodically gets bored and jumps to a random
page. Therefore let E(u) be a vector over the Web pages that correspond to a
source of ranks. The random surfer chooses a page based on a distribution in
E. Now PageRank can be defined as an assignment R′ to a Web page which
satisfies the following formula such that c is maximized and the L1 norm of
R′ = 1 (convergence criteria):
Dangling links are hyperlinks that point to any page with no outgoing links.
They do not affect the ranking of any other page directly but they have an
influence on computation performance. Thus the dangling links are removed
from the system until all PageRanks are calculated. In the end they are added
back for final calculation.
2.5 Implementation
The Page Rank algorithm starts with the conversion of each URL
from the database into an integer. The next step is to store each hyperlink in
a database using the integer IDs to identify the Web pages. The iteration is
initiated after sorting the link structure by the parent ID and removing
dangling links. A good initial assignment has to be chosen to speed up
convergence.
8
The weights from the current time step are kept in memory and the
previous weights are accessed on disk in linear time. After the weights have
converged the dangling links are added back and the rankings are
recomputed. The calculation performs well but could be made faster by
easing the convergence criteria and using more efficient optimization
strategies.
9
3. Properties and Approaches
3.1 Convergence
This section deals with special properties of PageRank and
approaches implementing this technique. Concerning convergence the
scaling factor of the PageRank algorithm is linear in log n. For example
PageRank executed on a 161 million link database converges in 45
iterations. On a 322 million link database it converges in 52 iterations.
PageRank scales well even for extremely large data sets.
3.2 Personalized PageRank
The E vector corresponds to the distribution of Web pages a random
surfer jumps to. One extreme are pages with many related links. They have
an overly high ranking. For instance copyright warnings, disclaimer,
archives of highly interlinked mailing lists etc. Another extreme is to have E
consist entirely of a single Web page. This page and its immediate links then
will have highest PageRank. Trouble of that kind could be saved by guessing
a large part of the users interests. This can be done by integrating bookmarks
and homepages of the user in the E vector. Such personalized page ranks
may have a number of applications, including personal search engines.
3.3 Page Rank and Google
A conventional search engine finds all Web pages whose titles contain
all of the query words. That procedure ensures high precision. Sorting results
by PageRank additionally ensures high quality. PageRank works remarkably
well and has a huge community by being integrated in Google. Altavista
returns at first
10
root pages of servers. In this approach the URL length is used as
quality heuristic. At Google a full-text search engine and PageRank are
combined. There are also used standard IR measures, proximity and anchor
text for ranking.
3.4 Manipulating by Commercial Interests
These types of personalized PageRanks are virtually immune to
manipulation by commercial interests. At worst, there could be manipulation
in the form of buying advertisements (links) on important sites. But this
seems well under control since it is very cost expensive. A compromise
between the two extremes of uniform E and single page E is to let E consist
of all the root level pages of all Web servers.
3.5 Estimating Web Traffic
Concerning differences between PageRank and the usage of the Web,
there may be things that people like to look at, but do not want to mention on
their Web pages (e.g. political parties or religious groups). These pages
would have a high usage while the ranking remains low. In this case data
from Web usage mining may be used as a start vector of PageRank.
3.6 Other Approaches
Page Rank as a backlink predictor avoids local maxima that citation
counting gets stuck in. So it is sometimes a better citation count
approximation than citation counts themselves.For user navigation there was
developed a Web proxy application that annotates PageRank on each
hyperlink that a user sees. That gives the user a hint to decide which links in
a long listing are more likely to
11
be interesting than others. The original goal of PageRank was a way to sort
backlinks. If there were a large number of backlinks for a document, the
most important backlinks could be displayed first. For example people who
run a news site always want to keep track of any significant backlinks.
12
4. Conclusion
In this paper the combination of the two fast-developing research
areas Semantic Web and Web Mining are illustrated on the example of
PageRank. The first section provided information about how Semantic Web
Mining can improve the results of Web Mining by exploiting the new
semantic structures in the Web. It was also mentioned how the construction
of the Semantic Web can make use of Web Mining techniques. The
following sections dealt with the PageRank algorithm that computes a global
ranking of all Web pages based on their location in the Web’s graph
structure. During this procedure more important and central Web pages are
given preference. PageRank allows creating a view of the Web from a
particular perspective. It is quite secure against manipulation. Integrating
PageRank in applications can improve traffic estimation and user navigation.
13
BIBLIOGRAPHY
Book(s):
1) Berendt, B., Hotho, A., Stumme, G.“Towards Semantic Web Mining.” ,
In Proceedings of the 6th International Semantic Web Conference, pp. 264278, 2002.
2) Brin, S., Motwani, R., Page, L.,Winograd, T. “The PageRank Citation
Ranking: Bringing Order to the Web.” Technical Report, 1998.
3) Hotho, A.,Studer, R., Stumme, G., Volz, R. “Semantic Web – State of the
Art and Future Directions.” KI (3/03), pp. 5-17, 2003.
Website(s):
1) en.wikipedia.org/wiki/PageRank
2) dbpubs.stanford.edu/pub/1999-66
3) citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.31.1768
14
Download