ANALYSIS AND COMPARISON OF CLIENT BASED WEB USAGE MINING TECHNIQUE

advertisement
International Journal of Application or Innovation in Engineering & Management (IJAIEM)
Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com
Volume 2, Issue 6, June 2013
ISSN 2319 - 4847
ANALYSIS AND COMPARISON OF CLIENT
BASED WEB USAGE MINING
TECHNIQUE
1
1
Devendra Patil, 2Ravindra Gupta
Department of Computer Science & Engineering
Sri Satya Sai Institute of Science & Technology
Sehore (M. P.)
2
Abstract
Web Mining is the use of the data mining techniques to automatically discover and extract information from web
documents/services.Web Mining is the extraction of interesting and potentially useful patterns and implicit information from
artifacts or activity related to the World Wide Web. Web usage mining provides the support for the web site design, providing
personalization server and other business making decision, etc. It has quickly become one of the most important areas in
Computer and Information Sciences because of its direct applications in e-commerce, CRM, Web analytics, information
retrieval and filtering, and Web information systems. To maintain website we need to improve its design. To improve the design
of website we should find out how it is used by analyzing users’ browsing behavior.
1. Introduction
Statistical Analysis and Web Usage Mining are two ways to analyze users’ web browsing behavior. The result of
Statistical Analysis contains Page Views, Page Browsing Time, and so on. Web usage mining applies data mining
methods to discover web usage pattern through web usage data. Item-Set Mining, Sequential Pattern Mining, and
Graph Mining, are examples of data mining methods that can be used to analyze web usage data [1-2].
Web usage data can be collected from three sources: server level, client level, and proxy level [3]. Statistical analysis
and web usage mining usually use Server Log as main data source which is server level data source. Server log is a file
that automatically created by web server and kept on server. It contains some data about requests which are sent to web
server. During reconstruction of user’s session, server log may not be fully reliable because in some cases such as page
cashing and POST method, data are not recorded in server log [3]. Moreover, because of it only contains server side
data, cannot be useful when we are interested in user’s activities on client side such as hitting back button of browser,
switching between tabs or windows and so on. To cope with such problems we also need to consider client side data [4].
Web Mining can be broadly divided into three distinct categories, according to the kinds of data to be mined:
1.1 Web Content Mining:
Web Content Mining is the process of extracting useful information from the contents of Web documents. Content data
corresponds to the collection of facts a Web page was designed to convey to the users. It may consist of text, images,
audio, video, or structured records such as lists and tables. Text mining and its application to Web content has been the
most widely researched. Some of the research issues addressed in text mining are, topic discovery, extracting
association patterns, clustering of web documents and classification of Web Pages. Research activities in this field also
involve using techniques from other disciplines such as Information Retrieval (IR) and Natural Language Processing
(NLP). While there exists a significant body of work in extracting knowledge from images - in the fields of image
processing and computer vision - the application of these techniques to Web content mining has not been very rapid.
1.2 Web Structure Mining
The structure of a typical Web graph consists of Web pages as nodes, and hyperlinks as edges connecting between two
related pages. Web Structure Mining can be regarded as the process of discovering structure information from the Web.
This type of mining can be further divided into two kinds based on the kind of structural data used.
Fig 1.1
Volume 2, Issue 6, June 2013
Page 547
International Journal of Application or Innovation in Engineering & Management (IJAIEM)
Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com
Volume 2, Issue 6, June 2013
ISSN 2319 - 4847
Hyperlinks: A Hyperlink is a structural unit that connects a Web page to different location, either within the same
Web page or to a different Web page. A hyperlink that connects to a different part of the same page is called an IntraDocument Hyperlink, and a hyperlink that connects two different pages is called an Inter-Document Hyperlink.
Document Structure: In addition, the content within a Web page can also be organized in a
tree-structured format,
based on the various HTML and XML tags within the page.
1.3 Web Usage Mining:
Web Usage Mining is the application of data mining techniques to discover interesting usage patterns from Web data,
in order to understand and better serve the needs of Web-based applications. Usage data captures the identity or origin
of Web users along with their browsing behavior at a Web site. Web usage mining itself can be classified further
depending on the kind of usage data considered:
 Web Server Data: They correspond to the user logs that are collected at Web server. Some of the typical data
collected at a Web server include IP addresses, page references, and access time of the users.
 Application Server Data: Commercial application servers, e.g. Web logic [BEA], Broad Vision [BV], Story Server
[VIGN], etc. have significant features in the framework to enable Ecommerce applications to be built on top of
them with little effort. A key feature is the ability to track various kinds of business events and log them in
application server logs.
 Application Level Data: Finally, new kinds of events can always be defined in an application, and logging can be
turned on for them – generating histories of these specially defined events.
Fig 1.2 Taxonomy of Web Mining
2. Literature Survey
Sebastian A. Rios and Juan D. Velasquez [28] developed a Semantic WUM process, which uses a concept-based
approach to add semantics into the mining process. The solution proposed, was applied to a real web site to produce
offline enhancements of contents and structure. The method was compared with four different WUM methods. But the
results were not so promising due to the use of heavy statistical processing. Then, Yan LI, Boqin FENG and Qinjiao
MAO proposed algorithms avoid the complicated procedure of mining site topology and don’t produce the user privacy
issues. The modification of the reference length of the pages after path completion has also been implemented; it is very
helpful for more accurate investigation on the user access pattern [29]. They have merely achieved the success to get
the real-time but they have not considered the other user activities like minimization or tab-shifts etc. Thereafter, Saud
R.Aghabozorgi and Ying Wah proposed an off-line model based web usage mining that is generated by clustering
algorithm which uses users’ transactions periodically to change the off-line model to a dynamic-model.
3. Proposed Work
In general most of the users have tendency to open several pages simultaneously and in between, use some non
browsing applications such as Ms-word, Excel etc for their own personal work, in such cases data recorded in server
log only shows the requested time of the web pages and cannot help us to find out which web page and for how long
has been really browsed on client machine. The calculated browsing time comparison shows that the time is reduced by
considering the actual scenario of web page usage, which gives realistic browsing time of the user behavior at the web
page.
Web Usage Mining and its algorithms have a bigger scope as far as research is concerned. Web mining and its
application area is still in its infancy and requires more research. Besides Web content and Web Link, the Web Usage
Mining is one of the most important areas of web mining research. We have got different application areas like
Volume 2, Issue 6, June 2013
Page 548
International Journal of Application or Innovation in Engineering & Management (IJAIEM)
Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com
Volume 2, Issue 6, June 2013
ISSN 2319 - 4847
Business Intelligence, E-Commerce or alike. These application areas have got more research interest. The kind of data
we can recently have from Web Log is not adequate. So, this research area is also highly promising. Web Mining and
specifically Web Usage Mining can give rise to different application areas, which will really be beneficial for Web
Users, society, and obviously for the governments. Log files are the best source to know user behavior. But the raw log
files contains unnecessary details like image access, failed entries etc., which will affect the accuracy of pattern
discovery and analysis. So preprocessing stage is an important work in mining to make efficient pattern analysis. To
get accurate mining results user’s session details are to be known.The research in future could be targeted to create
more efficient session reconstructions through graphs and mining the sessions using graph mining as quality sessions
gives more accurate patterns for analysis of users. Data Security and Privacy of data should be taken into consideration
for Web Miner or Web Usage miners in future works.
4. Results
This proposed approach will solve the problem of the decrease of accuracy in the off-line models over time resulted of
new users joining or changes of behavior for existing users in model-based approaches. But, the probabilistic approach
added the random behavioral response of calculation which sometimes returns false results. In the experiments, we
have set the database without using the referrer information. In this case, we can see that the results of base paper is
giving more time utilization for all the pages, while this may be not true in case when the webpages are used and then
the user has shifted to some offline processing. In the contrast, the results of proposed method are slightly down
because it is able to recognize the offline working and therefore not adding-up that time in the web page utilization
duration. The results of both the algorithms are as follows,
While experimenting with the dataset and introducing the variations, we have found that the proposed algorithm
outperforms in the situation when the page hits are high while not very promising results are recorded in case when the
page hit are less with respect to time. However, in any case the proposed algorithm does well as compared to the base
algorithm.
5. Conclusion
This paper has attempted to provide an up-to-date survey of the rapidly growing area of Web Usage mining. With the
growth of Web-based applications, specially electronic commerce, there is significant interest in analyzing Web usage
data to better understand Web usage, and apply the knowledge to better serve users. This has led to a number of
commercial offerings for doing such analysis. However, Web Usage mining raises some hard scientific questions that
must be answered before robust tools can be developed. In this paper user’s browsing behaviour on a web page is
considered with different modes of actions on client side to calculate web page browsing time with more
precision and accuracy. It helps to maintain the accuracy in finding weighted frequent web usage patterns
and administrators to evaluate its website usage more effectively.
References
[1.] Buchner, A. and Mulvenna, M. D., Discovering internet marketing intelligence through online analytical Web
usage mining. SIGMOD Record, (4) 27, 1999.
[2.] Chen, M. S., Park, J. S., and Yu, P. S., Data mining for path traversal patterns in a Web environment. In
Proceedings of 16th International Conference on Distributed Computing Systems, 1996.
[3.] Charniak, E., Statistical language learning. MIT Press, 1996.
[4.] Cooley, R., Mobasher, B., and Srivastava, J., Data preparation for mining World Wide Web browsing patterns.
Journal of Knowledge and Information Systems, (1) 1, 1999.
[5.] Agarwal, R., Aggarwal, C., and Prasad, V., A tree projection algorithm for generation of frequent itemsets. In
Proceedings of High Performance Data Mining Workshop, Puerto Rico, 1999.
Volume 2, Issue 6, June 2013
Page 549
International Journal of Application or Innovation in Engineering & Management (IJAIEM)
Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com
Volume 2, Issue 6, June 2013
ISSN 2319 - 4847
[6.] Agrawal, R. and Srikant, R., Fast algorithms for mining association rules. In Proceedings of the 20th VLDB
conference, pp. 487-499, Santiago, Chile, 1994.
[7.] Han, E-H, Boley, D., Gini, M., Gross, R., Hastings, K., Karypis, G., Kumar, V., and Mobasher, B., More, J.,
Document categorization and query generation on the World Wide Web using WebACE. Journal of Artificial
Intelligence Review, January 1999.
[8.] Herlocker, J., Konstan, J., Borchers, A., Riedl, J.. An algorithmic framework for performing collaborative filtering.
To appear in Proceedings of the 1999 Conference on Research and Development in Information Retrieval, August
1999.
[9.] Han, E-H, Karypis, G., Kumar, V., and Mobasher, B., Clustering based on association rule hypergraphs. In
Proccedings of SIGMOD’97 Workshop on Research Issues in Data Mining and Knowledge Discovery
(DMKD’97), May 1997.
[10.] Han, E-H, Karypis, G., Kumar, V., and Mobasher, B., Hypergraph based clustering in highdimensional data sets:
a summary of results. IEEE Bulletin of the Technical Committee on Data Engineering, (21) 1, March 1998.
[11.] Konstan, J., Miller, B., Maltz, D., Herlocker, J., Gordon, L., and Riedl, J., GroupLens: applying collaborative
filtering to usenet news. Communications of the ACM (40) 3, 1997.
[12.] Spiliopoulou, M. and Faulstich, L. C., WUM: A Web Utilization Miner. In Proceedings of EDBT Workshop
WebDB98, Valencia, Spain, LNCS 1590, Springer Verlag, 1999.
[13.] Schechter, S., Krishnan, M., and Smith, M. D., Using path profiles to predict HTTP requests. In Proceedings of
7th International World Wide Web Conference, Brisbane, Australia, 1998.
[14.] Nasraoui, O., Frigui, H., Joshi, A., Krishnapuram, R., Mining Web access logs using relational competitive fuzzy
clustering. To appear in the Proceedings of the Eight International Fuzzy Systems Association World Congress,
August 1999.
[15.] Perkowitz, M. and Etzioni, O., Adaptive Web sites: automatically synthesizing Web pages. In Proceedings of
Fifteenth National Conference on Artificial Intelligence, Madison, WI, 1998.
[16.] Shardanand, U., Maes, P., Social information filtering: algorithms for automating "word of mouth."In Proceedings
of the ACM CHI Conference, 1995.
Volume 2, Issue 6, June 2013
Page 550
Download