An Introduction to Internet Data Mining Sumit Ahlawat ..


International Journal of Engineering Trends and Technology (IJETT) – Volume 10 Number 6 - Apr 2014

An Introduction to Internet Data Mining

Sumit Ahlawat



M.Tech Student, Department of CSE, Shri Baba Mastnath Engineering College, Rohtak (INDIA)




In this paper we discuss mining with respect to web data referred here as web data mining. We have categorized web data mining into threes areas; web content mining, web structure mining and web usage mining. We have highlighted and discussed various research issues involved in each of these web data mining category. We believe that web data mining will be the topic of exploratory research in near future. Data mining (the analysis step of the "Knowledge Discovery in Databases" process, or KDD), an interdisciplinary subfield of computer science, is the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use. Aside from the raw analysis step, it involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating.


Data mining, the extraction of hidden predictive information from large databases, is a powerful new technology with great potential to help companies focus on the most important information in their data warehouses..

The automated, prospective analyses offered by data mining move beyond the analyses of past events provided by retrospective tools typical of decision support systems..Most companies already collect and refine massive quantities of data. Web mining describes the application of traditional data mining techniques onto the web resources and has facilitated the further development of these techniques to consider the specific structures of web data. The analyzed web resources contain the actual web site the hyperlinks connecting these sites and the path that online users take on the web to reach a particular site.

Web usage mining then refers to the derivation of useful knowledge from these data inputs. The content of the raw data for web usage mining on the one hand, and the expected knowledge to be derived from it on the other, pose a special challenge. While the input data are mostly web server logs and other primarily technically oriented data, the desired output is an understanding of user behavior in the domain of online information search, online shopping, online learning etc. This requires on the one hand an understanding and formal modeling of the behavior examined in the domain and on the other a picture of how the input data figures in these models. We are investigating

"semantic web" approaches as a promising avenue for the formal and computational aspects of this goal. The contents aspects of this goal require an understanding of behavioral theories in the investigated domains and a highly interdisciplinary research approach. The eventual presentation of the mining results for domain experts should consider general aspects of user interface design as well as domain-specific customs. Further, the development of visualizations as an important design element of user oriented mining systems is in the focus of our research efforts.Data mining techniques can be implemented rapidly on existing software and hardware platforms to enhance the value of existing information resources, and can be integrated with new products and systems as they are brought on-line. When implemented on high performance client/server or parallel processing computers, data mining tools can analyze massive databases to deliver answers to questions such as, "Which clients are most likely to respond to my next promotional mailing, and why?"



Web structure mining, one of three categories of web mining for data, is a tool used to identify the relationship between Web pages linked by information or direct link connection. This structure data is discoverable by the provision of web structure schema through database techniques for Web pages. This connection allows a search engine to pull data relating to a search query directly to the linking Web page from the Web site the content rests upon.

This completion takes place through use of spiders scanning the Web sites, retrieving the home page, then,

ISSN: 2231-5381

Page 289

International Journal of Engineering Trends and Technology (IJETT) – Volume 10 Number 6 - Apr 2014 linking the information through reference links to bring forth the specific page containing the desired information.

Structure mining uses minimize two main problems of the

World Wide Web due to its vast amount of information.

The first of these problems is irrelevant search results.

Relevance of search information become misconstrued due to the problem that search engines often only allow for low precision criteria. The second of these problems is the inability to index the vast amount if information provided on the Web. This causes a low amount of recall with content mining. This minimization comes in part with the function of discovering the model underlying the Web hyperlink structure provided by Web structure mining.

The main purpose for structure mining is to extract previously unknown relationships between Web pages. This structure data mining provides use for a business to link the information of its own Web site to enable navigation and cluster information into site maps. This allows its users the ability to access the desired information through keyword association and content mining. Hyperlink hierarchy is also determined to path the related information within the sites to the relationship of competitor links and connection through search engines and third party co-links.This enables clustering of connected Web pages to establish the relationship of these pages.

On the WWW, the use of structure mining enables the determination of similar structure of Web pages by clustering through the identification of underlying structure.

This information can be used to project the similarities of web content. The known similarities then provide ability to maintain or improve the information of a site to enable access of web spiders in a higher ratio. The larger the amount of Web crawlers, the more beneficial to the site because of related content to searches. pages of a particular site increases the level of return visitation to the site and recall by search engines relating to the information or product provided by the company. This also enables marketing strategies to provide results that are more productive through navigation of the pages linking to the homepage of the site itself.



Web usage mining is the process of extracting useful information from server logs e.g. use Web usage mining is the process of finding out what users are looking for on the

Internet. Some users might be looking at only textual data, whereas some others might be interested in multimedia data. Web Usage Mining is the application of data mining techniques to discover interesting usage patterns from Web data in order to understand and better serve the needs of

Web-based applications. Usage data captures the identity or origin of Web users along with their browsing behavior at a

Web site. Web usage mining itself can be classified further depending on the kind of usage data considered:

Web Server Data: The user logs are collected by the Web server. Typical data includes IP address, page reference and access time.

Application Server Data: Commercial application servers have significant features to enable e-commerce applications to be built on top of them with little effort. A key feature is the ability to track various kinds of business events and log them in application server logs.

Application Level Data: New kinds of events can be defined in an application, and logging can be turned on for them thus generating histories of these specially defined events. It must be noted, however, that many end applications require a combination of one or more of the techniques applied in the categories above.

With improved navigation of Web pages on business Web sites, connecting the requested information to a search engine becomes more effective. This stronger connection allows generating traffic to a business site to provide results that are more productive. The more links provided within the relationship of the web pages enable the navigation to yield the link hierarchy allowing navigation ease. This improved navigation attracts the spiders to the correct locations providing the requested information, proving more beneficial in clicks to a particular site.

Studies related to work [Weichbroth et al.] are concerned with two areas: constraint-based data mining algorithms applied in Web Usage Mining and developed software tools




Therefore, Web mining and the use of structure mining can provide strategic results for marketing of a Web site for production of sale. The more traffic directed to the Web

Web content mining is the mining, extraction and integration of useful data, information and knowledge from

Web page content. The heterogeneity and the lack of structure that permits much of the ever-expanding information sources on the World Wide Web, such as hypertext documents, makes automated discovery,

ISSN: 2231-5381

Page 290

International Journal of Engineering Trends and Technology (IJETT) – Volume 10 Number 6 - Apr 2014 organization, and search and indexing tools of the Internet and the World Wide Web such as Lycos, Alta Vista,

WebCrawler, ALIWEB, MetaCrawler, and others provide some comfort to users, but they do not generally provide structural information nor categorize, filter, or interpret documents. In recent years these factors have prompted researchers to develop more intelligent tools for information retrieval, such as intelligent web agents, as well as to extend database and data mining techniques to provide a higher level of organization for semi-structured data available on the web. The agent-based approach to web mining involves the development of sophisticated AI systems that can act autonomously or semi-autonomously on behalf of a particular user, to discover and organize web-based information.

Web mining is an important component of content pipeline for web portals. It is used in data confirmation and validity verification, data integrity and building taxonomies, content management, content generation and opinion mining.

Inparticular, we discussed web data mining with respect to web structure, web content and web usage. It is a basic introduction of data mining and we can relate the importance of the data minin from above discussion. We get enhanced results, cost profitabilty, useful for even governments, and enabled e-commerce to do personalized marketing.



Web content mining is differentiated from two different points of view: Information Retrieval View and Database

View. R. Kosala et al. summarized the research works done for unstructured data and semi-structured data from information retrieval view. It shows that most of the researches use bag of words, which is based on the statistics about single words in isolation, to represent unstructured text and take single word found in the training corpus as features. For the semi-structured data, all the works utilize the HTML structures inside the documents and some utilized the hyperlink structure between the documents for document representation. As for the database view, in order to have the better information management and querying on the web, the mining always tries to infer the structure of the web site to transform a web site to become a database.





[5] Sourav S. Bhowmick, S. K. Madria, W.-K. Ng, E.-P. Lim, Web

Bags : Are They Useful in Web warehouse? In proceedings for 5th

International Conference on Foundation of Data Organization, Japan, Nov.

1998. 6. T. Bray, Measuring the Web. In Proceedings of the 5th Intl.

WWW Conference, Paris, France, 1996.

[7] Agrawal, R. and Srikant, R. Fast algorithm for mining association rules. VLDB-94, 1994.

[8] Chakrabarti, S. Mining the Web: Discovering Knowledge from

Hypertext Data. Morgan Kaufmann Publishers, 2002.

[9] Chang, C. and Lui, S-L. IEPAD: Information extraction based on pattern discovery. WWW-10, 2001.

There are several ways to represent documents; vector space model is typically used. The documents constitute the whole vector space. If a term t occurs n(D, t) in document

D, the t-th coordinate of D is n(D, t) . When the length of the words in a document goes to [corrupted text]. This representation does not realize the importance of words in a document. To resolve this, tf-idf (Term Frequency Times

Inverse Document Frequency) is introduced.




By multi-scanning the document, we can implement feature selection. Under the condition that the category result is rarely affected, the extraction of feature subset is needed.

The general algorithm is to construct an evaluating function to evaluate the features. As feature set, Information Gain,

Cross Entropy, Mutual Information, and Odds Ratio are usually used. The classifier and pattern analysis methods of text data mining are very similar to traditional data mining techniques. The usual evaluative merits are Classification

Accuracy, Precision, Recall and Information Score.

ISSN: 2231-5381

Page 291
