Introduction - Academic Science,International Journal of Computer

advertisement
A literature Review on web usage mining techniques
P.SrinivasaRao 1 Dr.D.Vasumathi 2
Asst professor , Department of IT , MRCET , Hyderabad , Telangana 1
Professor , Department of CSE, JNTUH CEH , Hyderabad , Telangana 2
Abstract
The rapid growth of the World-Wide Web
poses unprecedented scaling challenges for
search engines. In the modern era of high
volume information generation, search
engine proves to be a pivotal technology of
data mining and information retrieval.
General purpose search engines have
achieved a great deal of success in
providing relevant information to the user.
They used to be an effective tool for
retrieving information from the huge
information repository. For instance,
Google, which is one of the popular search
engines, not only provides fitting search
results for the user in the world by pack up
of more than 20 hundred millions web
pages, but also the time to search is not
always beyond 0.5 second. The ubiquity of
the Internet and Web has led to the
emergence of
several Web search
engines with varying capabilities. These
search engines index Web sites, images,
Usenet
news groups,
content-based
directories, and news sources with the
goal of producing search results that are
most relevant to user queries. However,
only a small number of web users
actually know how to utilize the true
power of Web search engines. In order
to address this problem, search engines
have started providing access to their
services via various interfaces.
INTRODUCTION
Search engine as a tool to investigate the
Web must obtain the desired results for any
given query. Success of a search engine is
directly dependent on the satisfaction level
of the user. Users desire the information to
be presented to them within a short time
interval. They also expect that the most
relevant and recent information to be
presented [3]. Most of the search engines
cannot
completely
satisfy
user's
requirements and the search results are often
very inaccurate and irrelevant [4]. There are
already many researchers who have reported
on about various aspects of search engines
in [5, 6]. A meta- search engine is the kind
of search engine to provide users with
information services and it does not have its
own database of web pages. It sends search
terms to the databases maintained by other
search engines and gives users the results
that come from all the search engines
queried [4]. The dearth of any specific
structure and wide range of data published
on the web makes it highly challenging for
the user to find the data without any external
assistance. It is a general credence [8,9] that
a single general purpose search engine for
all web data is improbable because its
processing power cannot scale up to the fast
increasing and unlimited amount of web
data. A tool that swiftly gains approval
among users is Meta search engines [10].
The Meta search engines can run user query
across multiple component search engines
concurrently,
retrieve
the
generated
outcomes and amassed them. The benefits of
Meta search engines against the search
engines are notable [11].
The Meta search engine enhances the search
coverage of the web providing higher recall.
The overlap among the primary search
engines is generally small [12] and it can be
small as three percentages of the total results
retrieved. The Meta search engine solves the
scalability issue of searching the web and
facilitates the use of multiple search engines
enabling consistency checking [13]. The
Meta search engine enhances the retrieval
effectiveness providing higher precision
because of ‘chorus effect’ [14]. Web Meta
searching in disparity to rank aggregation is
an issue representing its own unique
challenges. The outcomes that a Meta search
system gathers from its component engines
are not similar to votes or any other single
dimensional entities: Apart from the
individual ranking it is assigned by a
component engine, a Web outcome also
incorporates a title, a small fragment of text
which represents its significance to the
submitted query [7, 15] (textual snippet) and
a uniform resource locator (URL).
Ostensibly, the traditional rank aggregation
techniques are insufficient for providing a
robust ranking mechanism appropriate for
Meta search engines, because they ignore
the semantics each Web result.
2 . literature Review
With the development of the Internet, web
service generates a large amount of log
information, how to mine user preferred
browsing paths is an important research
areas. Current researches mainly focus on
the mining of user preferred browsing
paths;This section shows a brief review of
some of the related works. Leonidas
Akritidis et al. [16] have presented a Quad
Rank technique which considered the
additional information regarding the query
terms, collected results and data correlation.
They have implemented and tested the Quad
Rank in real world Meta search engine.
They comprehensively tested Quad Rank for
both effectiveness and efficiency in the real
world search environment and also used the
task from the TREC-2009 conference. They
demonstrated that in most cases their
technique outperformed all component
engines. Hideaki Ishii et al. [17] have
proposed a technique to reduce the
computation and communication loads for
the Page Rank algorithm. They developed a
method to systematically aggregate the web
page into groups by using the sparsity
inherent in the web. For each group, they
computed an aggregated page rank value
that can be distributed among the group
members. They provided a distributed
update scheme for the aggregated Page Rank
along with an analysis on its convergence
properties. They provided a numerical
example to illustrate the level of reduction in
computation while keeping the error in
rankings small.
Aging activity has been recently identified
as a potential source of knowledge about
personal interests, preferences, goals, and
other attributes known from user models.
Tags themselves can be therefore used for
finding personalized recommendations of
items. In this paper, Frederico Durao and
Peter Dolog [18] have present a tag-based
recommender system which suggests similar
Web pages based on the similarity of their
tags from a Web 2.0 tagging application.
The proposed approach extends the basic
similarity calculus with external factors such
as tag popularity, tag representativeness and
the affinity between user and tag.
Soheila Abrishami et al [19] aims to design
a hybrid recommendation system based on
integrating semantic information with Web
usage mining and page clustering based on
semantic similarity. Since the Web pages are
seen as ontology individuals, frequent
navigational patterns are in the form of
ontology instances instead of Web page
addresses, and page clustering is done using
semantic similarity. The result is used for
generating web page recommendations to
users. The recommender engine presented in
this paper which is based on semantic
patterns and page clustering creates a list of
appropriate recommendations. The results of
the implementation of this hybrid
recommendation system indicate that
integrating semantic information and page
access sequence into the patterns yields
more accurate recommendations.
Yang
and Hanjalic [20] developed a
prototype-based re-ranking framework,
which
constructs
meta
re-rankers
corresponding
to
visual
prototypes
representing the textual query and learns the
weights of a linear re-ranking model to
combine the results of individual meta rerankers and produces the re ranking score of
a given image taken from the initial textbased search result. The induced re-ranking
model was learned in a query-independent
way requiring only a limited labeling effort
and being able to scale up to a broad range
of queries. The experimental results on the
Web Queries dataset demonstrated that the
proposed method outperforms all the
existing supervised and unsupervised reranking methods.
To provide personalized preferred paths to
fulfill user need, in this paper, Zhou,
Zhurong, and Dengwu Yang [21] proposed a
novel method to compute the similarities of
preferred paths and the given fields by
experts. Firstly, the similarities of each page
on the preferred paths and the given fields
are computed. Secondly, according to the
computed similarities of each page on the
preferred paths and the given fields, the
average similarity of all the pages on the
preferred path and the given field is
computed, and it was used as the similarity
of preferred path and the given file.
Experimental result shown that, it was
accurate and scalable. It could be applied to
optimize website or design personalized
service. NazneenTarannum S.H. Rizvi1 and
Prof. Ranjit R. Keole [22] have presents a
new framework for a semantic-enhanced
Web-page recommendation (WPR), and a
suite of enabling techniques which include
semantic network models of domain
knowledge and Web usage knowledge,
querying techniques .
If a user accesses a page by using back
button in browser then it return copy of that
page which is stored in cache. This kind of
accessing does not record any entry in log
file that causes problem of missing
references hence path completion techniques
are required to fill these entries in log file
[23]. The learnt object components range
Author
Cooley et al. [7]
Preprocessing techniques
Focused on
Data Cleaning, User Identification, Transaction
Session
Identification, transaction Identification
identification
Remarks
Proposed heuristics are not suitable for
complex web sites
Prabarskaite
[15]
Advance data cleaning, Filtering and
data visualization
Data cleaning
Did not perform any other preprocessing
technique like user identification and
session identification etc.
Data fusion, Data cleaning, Data
All
structuration and Data summarization
completion
Tanasa et
[24]
al.
except
Data All except
Data completion
path Ignored the removal of wrong http request
status code
Castellano et al.
[25]
Data
cleaning
module,
structuration module and
filtering module
Robert et al. [26]
Data cleaning and filtering, User
identification, Session Identification
Yen li et al. [23]
Data cleaning, User identification, Path Completion
Session Identification and
path
completion
Xiang-ying
li[27 ]
Data cleaning, Client Identification, Client Identification High accuracy and high efficiency but
Session Identification and
Path and
Session poor operating rate.
Completion
Identification
from local structures over line segments to
global silhouette-like descriptions. This
representation can be used.
Categories in a totally unsupervised fashion.
Furthermore it employ the representation as
the basis for building a supervised multicategory detection system making efficient
use of training examples and outperforming
pure features-based representations.
Tanasa et al. [24] divides preprocessing
process in four steps: Data fusion, Data
cleaning, Datastructuration and Data
summarization. In Data fusion author joined
multiple log files from different web servers
and also from site maps into a single log
files. After that they anonym zed log file by
encrypting host name. Further Data cleaning
is performed by removing requests for non-
Session
Identification
path Included almost all steps of data
preprocessing.
Better session creation simultaneously by
using integer programing
Combined two approaches Maximal
forward reference length and Reference
length to find out completed path
analyzed resource such as multimedia files
(images, audio, video etc.) and robot’s
generated requests In Data structuration part
author have completed user identification by
Authentication data or IP address, Session
identification by host and agent, Page view
Identification by site map etc. At last Data
summarization step includes pattern analysis
part by using data generalization and
aggregation. They did not considered
unsuccessful request in data cleaning phase
which is also required to remove to get rid
of unnecessary calculations in later phases
of web log mining processes.
When evaluated on TRECVID 2005 video
benchmark, the pro-posed approach
improves retrieval on the average up to 32%
relative to the baseline text search method in
terms of story-level Mean Average
Precision. In the people-related queries,
which usually have recurrent coverage
across news sources, we can have up to 40%
relative improvement. Most of all, the
proposed method does not require any
additional input from users (e.g., example
images), or complex search models for
special queries (e.g., named person search).
Castellano et al. [25] developed a tool
LODAP (Log Data Preprocessor) which
takes log file as input and gives statistical
analysis and user sessions as output. This
tool is divided into three modules: Data
cleaning module, Data structuration module
and Data filtering module
Robert et al. [26] introduced a new
concept called integer programming for
better session identification .This method
generates session simultaneously and
produced session better match to an
empirical distribution.
Xiang–ying li [27] has proposed an
algorithm named CSIA (Client and Session
Identification algorithm) for identification of
user and sessions. This algorithm includes
comprehensive approach by combining IP
address, topology, browser version and
referrer page to identify unique user with
better accuracy and efficiency. He proposed
his algorithm in JAVA language framework
as it is good for space utilization. However
this algorithm is suffering with decrease in
operating rate due to consideration of many
factors for identifying user.
discovery from web usage data and
satisfactory knowledge representation for
effective Web-page recommendations are
crucial and challenging The common
problems of the exiting technique are shown
below.




The major problem of many on-line
web sites is the presentation of many
choices to the client at a time; this
usually results to strenuous and time
consuming task in finding the right
product or information on the site.
The knowledge of ontology and
history is not much personalization
in the existing techniques.
Due to lack of accuracy, extended
and high run time existing
recommendation systems exhibit the
problems of less coverage.
Pages which are recently added or
rarely visited by end user is not
showed by the existing technique,
which also an important problem.
These problems are motivated to do
the
research
on
webpage
recommendation.
Further in future, combination of two or
more user identification techniques can be
used to make better user identification. This
paper concludes that various applied data
preprocessing techniques with their
advantages and disadvantages and draws
conclusion and research directions in future.
3. CONCLUSIONS
Reference
In this paper Web-page recommendation or
personalization plays a significant role in
intelligent web systems. Useful knowledge
[1] Abawajy, J.H., Hu, M.J.,"A new Internet
meta-search engine and implementation,”
The 3rd ACS/IEEE International Conference
on Computer Systems and Applications,
2005.
[2] Juan Tang, Ya-Jun Du, Ke-Liang Wang,
“Design and Implement of personalize
Meta-Search Engine Based on FCA,”
Proceedings of the Sixth International
Conference on Machine Learning and
Cybernetics, Hong Kong, 19-22 August
2007.
[3] K.Satya Sai Prakash, S. V. Raghavan,
"DLAPANGSE: Distributed Intelligent
Agent based Parallel Architecture for Next
Generation Search Engines", IIT Madras,
India, 2001.
[4] Z. Li, Y. Wang.V. Oria, "A New
Architecture
for
Web
Meta-Search
Engines," Seventh Americas Conference on
Information Systems, CIS Department, New
Jercy Institute of Technology, 2001.
[5] A. Araus, et. al., "Searching the Web",
ACM Transactions on Internet Technology,
Vol. 1, August 2001, pp: 243.
[6]G.S.Goldsmidt,"Distributed Management
by Delegation," Ph.D. Thesis, Columbia
University, 1996.
[7] R. Cooley, B. Mobasher, J. Srivastav
(1999), Data preparation for mining world
wide web browsing pattern in Journal of
Knowledge
and
Data
Engineering
Workshop, IEEE, Vol.1 Page(s): 5-32.
[8] Sugiura, A., Etzioni, O., 2000. Query
routing for Web search engines: architecture
and experiments. Computer Networks 33
(1–6), 417–429.
[9] Manning, C.D., Raghavan, P., Schutze,
H., 2008. Introduction to Information
Retrieval.Cambridge University Press.
[10] Meng, W., Yu, C., Liu, K.-L., 2002.
Building efficient and effective metasearch
engines. ACM Computing Surveys 34 (1),
48–89.
[11] Spink, A., Jansen, B.J., Blakely, C.,
Koshman, S., 2006. Overlap among major
Web search engines. In: Proceedings of the
IEEE
International
Conference
on
Information Technology: New Generations
(ITNG), pp. 370–374.
[12] Aslam, J.A., Montague, M.H., 2001a.
Metasearch consistency. In: Proceedings of
the ACM International Conference on
Research and Development in Information
Retrieval (SIGIR), pp. 386–387.
[13]
Vogt,
C.C.,
1999.
Adaptive
combination of evidence for information
retrieval. Ph.D. Thesis. University of
California at San Diego.
[14] Dwork, C., Kumar, R., Naor, M.,
Sivakumar, D., 2001. Rank aggregation
methods for the Web. In: Proceedings of the
ACM International Conference on World
Wide Web (WWW), pp. 613–622.
[15] Pabarskaite Z (2002), Implementing
advanced
cleaning
and
end-user
interpretability technologies in web log
mining in 24th International Conference on
Information Technology Interfaces (ITI),
Vol. 1 Page(s): 109-113.
[16] Leonidas Akritidis, Dimitrios Katsaros
and Panayiotis Bozanis, "Effective rank
aggregation for metasearching", The Journal
of Systems and Software, vol. 84, pp. -143,
2011.
[17] Hideaki Ishii, Roberto Tempo and ErWei Bai, "A Web Aggregation Approach for
Distributed
Randomized
PageRank
Algorithms", IEEE TRANSACTIONS ON
AUTOMATIC CONTROL, Vol. 57, No. 11,
pp. 2703-2717, 2012.
[18] Frederico Durao, Peter Dolog,A
Personalized Tag-Based Recommendation
in Social Web Systems",2012.
[19]
Soheila
Abrishami,
Mahmoud
Naghibzadeh, Mehrdad Jalali,"Web Page
Recommendation Based on Semantic Web
Usage Mining",Volume 7710 of the series
Lecture Notes .
[20]Linjun Yang , Alan Hanjalic,“PrototypeBased Image Search Reranking,” IEEE
Transactions On Multimedia, Vol. 14, No. 3,
June 2012.
[21] Zhou, Zhurong, and Dengwu Yang.
"Personalized Recommendation of Preferred
Paths Based On Web Log." Journal of
Software 9, no. 3, pp. 684-688, 2014.
[22] NazneenTarannum S.H. Rizvi1 and
Prof. Ranjit R. Keole,"A Preliminary
Review of Web-Page Recommendation
Information Retrieval Using Mining”,
International Journal of Advance Research
in Computer Science and Management.
[23] Yan LI (2008), Research on path
completion technique in web usage mining
in International Symposium on Computer
Science and Computational Technology,
IEEE, Vol. 1 Page(s): 554-559.
.
[24] D. Tanasa, B. Trousse (2004),
Advanced Data Preprocessing for Intersites
Web Usage Mining in IEEE Intelligent
Systems, Vol. 19 Issues. 2 Page(s): 59-65.
.
[25] G. Castellano, A. Fanelli, M. Torsello,
LODAP: A Log Data Preprocessor for
Mining Web Browsing Patterns in
Proceedings of the 6th Conference on
Artificial
Intelligence,
Knowledge
Engineering and Data Bases, Page(s):12–17.
[26] R. F. Dell (2008),Web user session
reconstruction using integer programming in
International
Conference
on
Web
Intelligence
and
Intelligent
Agent
Technology, IEEE/ACM/WIC, Vol. 1
Page(s): 385-388
[27]
Xiang-ying
Li
(2013),
Data
Preprocessing in Web Usage Mining in 19th
International Conference on Industrial
Engineering and Engineering Management
Page(s): 257-266.
Download