Sample Example of Seminar Report Royal Education Society’s COLLEGE OF COMPUTER SCIENCE & INFORMATION TECHNOLOGY, LATUR SEMINAR REPORT On Web Mining Submitted by Shinde Shital Narayan (Exam Seat No: KI 2148) in partial fulfillment for the award of the degree of B.Sc.(S.E) Third Year SWAMI RAMANAND TEERTH MARATHAWADA UNIVERSITY, NANDED. Winter 2014 I Royal Education Society’s COLLEGE OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY, LATUR. CERTIFICATE This is to certify that the Seminar entitled “Web Mining” has been carried out by Shinde Shital Narayan under my guidance in partial fulfillment of the degree of B.Sc.(S.E)T.Y. of SRTMU, Nanded during the academic year 2014-2015. To the best of my knowledge and belief this work has not been submitted elsewhere for the award of any other degree. Seminar Guide Mr.S.S.Ingale H.O.D. Mr.I.M.Kazi II Principal Dr. M.R. Patil ACKNOWLEDGEMENT (Shinde Shital Narayan) III INDEX 1. 2. 3. 4. Topic COVER PAGE CERTIFICATE ACKNOWLEDGEMENT Towards Semantic Web Mining 1.1 The Semantic 1.2 Web Mining 1.3 Extracting Semantics from the Web 1.4 Exploiting Semantics for Web Mining 1.5 Mining the Semantic Web PageRank 2.1 Motivation 2.2 Structure of the Web 2.3 Simplified Version of PageRank 2.4 Random Surfer Model 2.5 Implementation Properties and Approaches 3.1 Convergence 3.2 Personalized PageRank 3.3 Page Rank and Google 3.4 Manipulating by Commercial Interests 3.5 Estimating Web Traffic 3.6 Other Approaches Conclusion BIBLIOGRAPHY Page No. I II III 1 2 3 4 5 6 6 7 8 9 10 10 10 11 11 12 13 14 1. Towards Semantic Web Mining 1.1 The Semantic Web The increasing usage of the current World Wide Web leads to a new challenge of optimizing the interchange of information, due to the fact that a huge amount of data is interpretable by humans only. The Semantic Web deals with an idea of Tim Berners-Lee1 – to enrich the Web by machineunderstandable information which supports the user in his tasks. Machine processable information for instance can lead a quite powerful search engine to more relevant pages and can improve precision and recall. The Semantic Web is built up by techniques such as XML2, RDF3, ontologies and logic. The content of the Semantic Web is represented by ontologies and metadata. Thus a well-agreed upon core structure is provided which can easily be mapped onto existing ontologies. Further the definitions will be extended by axioms, lexicons and knowledge bases. Trust and proof can be applied by the use of digital signatures. Figure 1 shows the layer structure for the Semantic Web suggested by Tim Berners-Lee. 1 Figure 1: The layer structure of the Semantic Web 1.2 Web Mining The characteristic feature of Web Mining is the use of Data Mining techniques to elaborate on content, structure and usage of Web resources. Web Mining is an invaluable help in the transformation from human understandable content to machine understandable semantics.Web content mining is a form of text mining. It takes advantage of the semi-structured nature of the Web page text caused by HTML tags or XML markup. For instance Web content mining is used to detect co-occurrences of semantically related terms in texts like ”copper” combined with ”gold” in articles concerning Canada and ”copper” combined with ”silver” in articles concerning the US.Web structure mining usually operates on the hyperlink structure of the Web. The primary resource for mining the Web structure is a set of pages ranging from a single site to the Web as a whole. 2 Hyperlink topology information is found in authority pages, which are defined in relation to hubs as their counterpart. Hubs are pages that have a link to a certain amount of authorities. The PageRank algorithm implements this concept by stating that the relevance of a page is increasing with the number of hyperlinks to it from other relevant pages.Regarding Web usage mining, the primary Web resource that is being mined is a record of the requests made by visitors of a Web site. These records are often collected by a logging algorithm on the Web server. Web usage mining discovers information about related interests of a particular group of Web users. Sequence mining elaborates on optimizing the succession of going from one page to the next according to the behavior of a mass of users. Web usage mining can be combined with the other techniques in order to detect frequently used paths. 1.3 Extracting Semantics from the Web The precondition for managing knowledge in an automatic way, instead of accessing unstructured material, is to add semantic annotation to Web documents. All approaches discussed here assist the knowledge engineer in extracting the semantics, but cannot completely replace him. A computer can hardly be enabled to fully consider background knowledge, experience or social conventions. Ontology learning was created for the semi-automatic extraction of semantics from the Web in order to build up ontologies. The techniques produce intermediate results which finally have to be integrated into the ontology. The process of ontology merging takes as input two or more source ontologies and returns a merged one based on the given source ontologies. 3 The approaches rely on syntactic and semantic matching heuristics which are adjusted to the behavior of experienced ontology engineers.Instance learning in this context means information extraction from texts. Information extraction is a set of automatic methods for locating important facts in electronic documents for subsequent use. 1.4 Exploiting Semantics for Web Mining Semantics can be exploited for different purposes. The first major application area is the explicit encoding of semantics for mining the Web content. In [BHS02] the input data is preprocessed and ontology-based heuristics for feature selection and feature aggregation are applied. Based on these representations multiple clustering results using the K-Means algorithm are computed. These results can be explained by the corresponding selection of concepts in the ontology.In Web structure mining the techniques can be enriched by taking content into account.For example the PageRank algorithm co-operates with a keyword analysis algorithm, but the two are independent of one another. The most basic form of mining the usage of the Web is to use hand-crafted ontologies, in combination with automated schemes. Web pages are classified according to multiple concept hierarchies that reflect content, structure and service. In this context a path is a sequence of concepts in a concept hierarchy allowing to identify different strategies of search. Semantics can be exploited best, if the gap between the underlying model generating the pages and the model analyzing requests for those pages is evanescently small. 4 1.5 Mining the Semantic Web In the Semantic Web, content and structure are strongly interwined. Therefore the distinction between structure and content mining vanishes. An important group of those techniques is formed by Relational Data Mining. It comprises techniques for classification, regression, clustering and association analysis to look for patterns that involve multiple relations in a relational database. The algorithms can be transformed in order to deal with RDF or ontology-based data. Mining the usage can be enhanced further, if the semantics are contained explicitly in the pages by referring to concepts of ontologies. 5 2. PageRank 2.1 Motivation This section describes PageRank, a method for rating Web pages objectively and mechanically,paying attention to human interest. Web search engines have to arrange with inexperienced users and pages manipulating conventional ranking functions. Any evaluation strategy which counts replicable features of Web pages is unimmunized to manipulation. The task is to take advantage of the hyperlink structure of the Web to produce a global importance ranking of every Web page. This ranking is called PageRank. 2.2 Structure of the Web The structure of the Web is based on a graph with about 150 million nodes (Web pages)and 1.7 billion edges (hyperlinks)4. If Web pages A and B link to a page C, A and B arecalled the backlinks of C. This circumstance is illustrated in Figure 2. In generall, highly linked pages are more important. Thus they have more backlinks. But the important backlinks are often less in quantity. For example a Web page with a single backlink from Yahoo has to be ranked higher than a page with a couple of backlinks from unknown or private sites. A Web page has a high rank, if the sum of the ranks of its backlinks is also high. 6 Figure 2: A and B are backlinks of C 2.3 Simplified Version of PageRank Let u, v be Web pages. Then let Bu be the set of pages that point to u. Further letNv be the number of links from v. Let c < 1 be a factor for normalization. We define a simple ranking R, which is a simplified version of PageRank: The rank of a page is divided among its forward links evenly to contribute to the ranks of the pages they point to. The equation is recursive. But there is a problem with this simplified function. If there where two Web pages that point to each other but to no other page while some other Web page points to one of them, a loop will be generated during the iteration. This loop will accumulate the rank but will never distribute any ranks. This trap formed by loops in a graph without outedges are called rank sinks. 7 2.4 Random Surfer Model To avoid rank sinks a model of a Web surfer has to be created. This surfer simply keeps clicking on hyperlinks at random. The task is modeling the behavior that the surfer periodically gets bored and jumps to a random page. Therefore let E(u) be a vector over the Web pages that correspond to a source of ranks. The random surfer chooses a page based on a distribution in E. Now PageRank can be defined as an assignment R′ to a Web page which satisfies the following formula such that c is maximized and the L1 norm of R′ = 1 (convergence criteria): Dangling links are hyperlinks that point to any page with no outgoing links. They do not affect the ranking of any other page directly but they have an influence on computation performance. Thus the dangling links are removed from the system until all PageRanks are calculated. In the end they are added back for final calculation. 2.5 Implementation The Page Rank algorithm starts with the conversion of each URL from the database into an integer. The next step is to store each hyperlink in a database using the integer IDs to identify the Web pages. The iteration is initiated after sorting the link structure by the parent ID and removing dangling links. A good initial assignment has to be chosen to speed up convergence. 8 The weights from the current time step are kept in memory and the previous weights are accessed on disk in linear time. After the weights have converged the dangling links are added back and the rankings are recomputed. The calculation performs well but could be made faster by easing the convergence criteria and using more efficient optimization strategies. 9 3. Properties and Approaches 3.1 Convergence This section deals with special properties of PageRank and approaches implementing this technique. Concerning convergence the scaling factor of the PageRank algorithm is linear in log n. For example PageRank executed on a 161 million link database converges in 45 iterations. On a 322 million link database it converges in 52 iterations. PageRank scales well even for extremely large data sets. 3.2 Personalized PageRank The E vector corresponds to the distribution of Web pages a random surfer jumps to. One extreme are pages with many related links. They have an overly high ranking. For instance copyright warnings, disclaimer, archives of highly interlinked mailing lists etc. Another extreme is to have E consist entirely of a single Web page. This page and its immediate links then will have highest PageRank. Trouble of that kind could be saved by guessing a large part of the users interests. This can be done by integrating bookmarks and homepages of the user in the E vector. Such personalized page ranks may have a number of applications, including personal search engines. 3.3 Page Rank and Google A conventional search engine finds all Web pages whose titles contain all of the query words. That procedure ensures high precision. Sorting results by PageRank additionally ensures high quality. PageRank works remarkably well and has a huge community by being integrated in Google. Altavista returns at first 10 root pages of servers. In this approach the URL length is used as quality heuristic. At Google a full-text search engine and PageRank are combined. There are also used standard IR measures, proximity and anchor text for ranking. 3.4 Manipulating by Commercial Interests These types of personalized PageRanks are virtually immune to manipulation by commercial interests. At worst, there could be manipulation in the form of buying advertisements (links) on important sites. But this seems well under control since it is very cost expensive. A compromise between the two extremes of uniform E and single page E is to let E consist of all the root level pages of all Web servers. 3.5 Estimating Web Traffic Concerning differences between PageRank and the usage of the Web, there may be things that people like to look at, but do not want to mention on their Web pages (e.g. political parties or religious groups). These pages would have a high usage while the ranking remains low. In this case data from Web usage mining may be used as a start vector of PageRank. 3.6 Other Approaches Page Rank as a backlink predictor avoids local maxima that citation counting gets stuck in. So it is sometimes a better citation count approximation than citation counts themselves.For user navigation there was developed a Web proxy application that annotates PageRank on each hyperlink that a user sees. That gives the user a hint to decide which links in a long listing are more likely to 11 be interesting than others. The original goal of PageRank was a way to sort backlinks. If there were a large number of backlinks for a document, the most important backlinks could be displayed first. For example people who run a news site always want to keep track of any significant backlinks. 12 4. Conclusion In this paper the combination of the two fast-developing research areas Semantic Web and Web Mining are illustrated on the example of PageRank. The first section provided information about how Semantic Web Mining can improve the results of Web Mining by exploiting the new semantic structures in the Web. It was also mentioned how the construction of the Semantic Web can make use of Web Mining techniques. The following sections dealt with the PageRank algorithm that computes a global ranking of all Web pages based on their location in the Web’s graph structure. During this procedure more important and central Web pages are given preference. PageRank allows creating a view of the Web from a particular perspective. It is quite secure against manipulation. Integrating PageRank in applications can improve traffic estimation and user navigation. 13 BIBLIOGRAPHY Book(s): 1) Berendt, B., Hotho, A., Stumme, G.“Towards Semantic Web Mining.” , In Proceedings of the 6th International Semantic Web Conference, pp. 264278, 2002. 2) Brin, S., Motwani, R., Page, L.,Winograd, T. “The PageRank Citation Ranking: Bringing Order to the Web.” Technical Report, 1998. 3) Hotho, A.,Studer, R., Stumme, G., Volz, R. “Semantic Web – State of the Art and Future Directions.” KI (3/03), pp. 5-17, 2003. Website(s): 1) en.wikipedia.org/wiki/PageRank 2) dbpubs.stanford.edu/pub/1999-66 3) citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.31.1768 14