
Web Usage Mining – A Review
Roshni S. Ali#, Rahila H. Sheikh*
Department of Computer Science &Engineering#*
Rajiv Gandhi College of Engineering, Research & Technology, Gondwana University,
Chandrapur, Maharashtra, India-442401
Abstract— The Web Usage Mining (WUM) is an area, where the
navigational behavior of the user is tracked and analyzed. It is
important to analyze the behavior of user, so that website owner
can easily recognize the usage pattern of its users. By collecting
this behavior of the user activities, owner can improve the quality
of services to catch the attention of existing as well as new
customer. For analysis of the usage patterns of the users, web log
files are used. These Web Log File contains different parameters
that tracks the user’s request. Then depending upon these
parameters, similarities regarding user access is identified and
patterns are discovered and analyzed. In this paper we reviewed the
concepts and process of Web Usage Mining i.e. Pre-processing
Data, Pattern Discovery and Pattern Analysis.
Keywords— Web Usage Mining, Web Log File, User Access
Pattern, Pattern Discovery.
The World Wide Web (WWW) is expanding with booming
information day by day [9]. It serves as huge, widely
distributed, global information centre for news, consumer
information, financial management, advertisements, education,
government and e-commerce. It contains ample and dynamic
collection of information about web page contents with
hypertext structures and multimedia, hyperlink information
and access and usage information. Web mining is one of the
applications of data mining techniques that help to extract
knowledgeable data from the vast web. This web data may
include Web documents, usage logs of web sites, hyperlinks
between documents, etc [1].
As we know Data Mining (also called as called
knowledge discovery) is the computational technique of
discovering and identifying patterns from bulk data sets
involving methods at the intersection of artificial intelligence,
database systems, machine learning, and statistics[4].It is a
practice of analysing data from different sources and
parameters and then summarizing it into useful information,
that can be used to boost revenue, cuts costs, or both. Data is
analysed from many different dimensions, categorized, and
summarize the relationships identified. Technically, data
mining is the method of finding correlations or patterns
among dozens of fields in large relational databases. On the
whole, goal of the data mining process is to extract
information from a data set and transform it into an
understandable structure for further use.
Thus looking upon into Data Mining terms, we can say Web
Mining can be implemented to perform major operations of
interest - clustering (finding natural groupings of users, pages
etc.), associations (which URLs tend to be requested together),
and sequential analysis (the order in which URLs tend to be
accessed).Web mining is one of the application of data mining
techniques to dig out knowledge from Web data, where at least
one of structure (hyperlink) or usage (Web log) data is used in
the mining process (with or without other types of Web
As in real world, we can see, trends are growing
among all companies, different organizations and individuals
to gather information through web and to utilize that
information in their best interest. The term Web Mining is a
technique used to crawl through various web resources to
bring together required information, which helps an individual
or a company to endorse business, understanding dynamics of
marketing, new promotions floating on the Web and so on .
In the following sections, we will proceed by having
overview on Web Mining and their classifying areas in
Section II. Section III explains the Concepts of Web Usage
Mining, Web Log File & process of Web Usage Mining.
Section IV gives the idea of some related work done till now
in area of Web Usage Mining. Section V mentions the
proposed methodology and finally we sum up with conclusion
in Section VI.
Web mining is an application area in Data Mining techniques
that automatically extracts the information from web
documents and mainly performs four major tasks [3]:
A. Resource finding
It involves the task of retrieving intended web
documents. It is the procedure by which we extract the data
either from online or offline text resources available on web.
B.Information selection and pre-processing
It involves the automatic selection and pre processing
of specific information from retrieved web resources. This
technique transforms the original retrieved data into
information. The transformation could be renewal of stop
words, stemming or it may be meant for ob taining the desired
representation such as finding phrases in tra ining corpus.
It automatically discovers ge neral patterns at
individual web sites as well as across numerous sites. Data
Mining techniques and machine learning are used in
D. Analysis
It involves the validation and in terpretation of the
mined patterns. It plays vital role in p attern mining. An
individual plays an important role in information on
knowledge discovery process on web.
Web Mining is broadly divided into three categories as shown
in Fig 1:
Web Usage
tion of
Web Site
Web Usage Mining uses the concepts of chart
technology, data mining, artificial intelligence, and automated
learning techniques on the user data sets and web logs [4] and
also various models such as random-walk, markov-chain
models are implemented for statistical simulation [5].
A. Concept of Web Usage Mining
Web Content
Web usage mining is a method that uses various web data
sources to find out hidden kn owledge about users and their
access patterns on the Web. Such knowledge of user access is
taken into consideration to bring benefits to business and lead
directly to profit increase.W eb Usage Mining involves
identifying the frequency of th e page access by the users and
then finding the common traversal paths. Long and
complicated user access paths along with low use of a web
page shows that the web site is not designed in an spontaneous
manner. Thus with the help of this analysis, one can
reorganize the web site.
Web Structure
Int ellige
of User
Fig 1.Classification of Web Mining & Application Areas of Web Usage
A. Web Content Mining
Web content mining, also referred as text mining, is
used in scanning and mining of text, pictur es and graphs of a
Web page to find out the relevance of the c ontent to the
search query. Content mining render the results lists to search
engines in order of highest resemblance to the keywords in the
B.Web Usage Mining
This technique allows grouping of Web access
information for Web pages. This collected u sage data
provides the way leading to Web pages to b e accessed. This
information is most often gathered autom atically into access
logs through the Web server.
C.Web Structure Mining
It identifies the liaison between Web pages linked by
information or direct link connection. Thi s structured data is
determined by the provision of web structure schema by
means of database techniques for Web pages. The connection
then allows search engine to drag data r elating to a search
query directly to the linking Web page f rom the Web site
where the content rests upon.
Discovery of meaningful patter ns from data generated by
client-server transactions on on e or more Web servers
[26]. Typical Sources of Data:
1) Automatically generat ed data stored in server access
logs, referrer logs, pr oxy server logs, browser logs,
agent logs, and client-s ide cookies.
2) E-commerce and pro duct-oriented user events (e.g.
ad, shopping cart chan ges or product click-throughs,
etc.), registration data.
3) User profiles and/or user ratings.
4) Meta-data, page co ntent, site structure, page
5) User queries, bookmark data, mouse clicks and scroll.
B. Web Log Format
A web server log file conta ins requests made to the web
server, recorded in sequential order. Different web servers
maintain different information in Log File. The most popular
log file formats are the Common Log Format (CLF) and the
extended CLF. A common log format file is generated by the
web server to keep track of th e requests that occur on a web
site [26].
Here are some basic parameters listed that makes the entries of web log file.
entries of web log file.
User Name: This deter mines who had visited the
web site. The identification of user mostly done
through the IP address that is allotted by the Internet
Service provider (ISP).
Visiting Path: The path followed by the user while
visiting the web site. This can be done by entering
the URL directly or by hitting a link or through a
search engine.
Path Traversed: This specifies the path accessed by
the user within the web site using the different links.
Time stamp: It specifies the time spent by the user on
each web page while browsing. This time spent is
identified as the session.
Page last visited: It specifies the page that was
visited by the user before he or she leaves the
Success Rate: The success rate of the web site can be
determined by the number of downloads made and
the number copying activity under gone by the user.
Purchase of things or software made, add upon the
success rate.
User Agent: It specifies the browser from where the
user sends the request to the web server. It’s just a
string describing the type and version of browser
software being used.
URL: The resource accessed by the user. It may be
an HTML page, a CGI program, or a script.
Request Type: The method used for information
transfer is noted. The methods like GET, POST are
C. Web Usage Mining-A Process.
Web Usage mining Consists of three phases, mainly preprocessing, pattern discovery, and pattern analysis. Fig 2.
below shows the sequence of Web Usage Mining process.
Log File
Different actions performed on data or contents during preprocessing phase are given below.
1) Pre-Processing
It is the process of converting the unstructured data into
useful information by applying some algorithm. Web usage
data sources must be integrated, filtered, cleaned, and
transformed, such that gaps will be possibly filled, irrelevant
information will be thrown away, and user sessions and
transactions will be identified. These sources of data are
mainly Web server log files, agent logs and other interfaces.
The data present in the log file cannot be used as it is for the
mining process [7].Therefore the contents of the log file
should be cleaned in this preprocessing step. The unwanted
data are bumped of and a minimized log file is obtained.
Data cleaning: The entries made in the log file for
the unwanted view of images, graphics, multimedia,
etc made by the users are removed. Once these data
are cleaned the size of the file is minimized to a
larger extent. 
Session Identification: Session is the time duration
spent in the web page. This is done by using the time
stamp details of the web pages. This can also be done
by taking down the note of user id of those who have
visited the web page and had traversed through the
links of the web page. 
Data conversion: This is process of converting the
log file data into the format needed by the mining
algorithms. 
2) Pattern Discovery
After converting the data in log file into a formatted data
the pattern discovery process is done[8]. With the existing
data of the log files many useful patterns are identified either
with user id’s, session details, time outs etc. It is the key
component for analysing the pre-processed data. In this phase
the process is done through various algorithm and knowledge
discovery techniques used in pattern recognition, data mining,
machine learning etc. It can be done using various techniques
such as association rules, classification, clustering, sequential
pattern and statistical analysis.
 Statistical Analysis such as median, frequency
analysis, mean etc. 
 Clustering of users help to discover groups of users
personalized Web Data). 
Fig 2.Web Usage Mining Process
Classification is the technique to arrange a data item
into one of several predefined classes. 
Association Rules find out correlations among pages
accessed together by a client. 
Sequential Patterns extract repeatedly occurring
Inter-session patterns such that the occurrence of a
set of items followed by another item in time order. 
Dependency Modeling checks if there are any
considerable dependencies among the variables in the
Web. 
3) Pattern Analysis
This process eliminates the irrelevant rules or patterns
that were generated. They extract the interesting rules or
patterns from the output of the pattern discovery. The most
familiar form of pattern analysis comprises of a knowledge
query mechanism such as SQL (Structured Query Language)
or loads the usage data into a data cube to perform OLAP
(Online analytical processing) operations. Visualization
techniques, like graphing patterns or assigning colors to
different values, highlights overall patterns or trends emerging
in the data. Various mechanisms used for mining these
patterns are mentioned below:
Site Filter: This technique is implemented by
WEBMINER system. The site filter uses the site
topology to filter out rules and patterns that are not
interesting. Any rule that identifies direct hypertext
links among pages is sorted out[10]. 
mWAP(Modified Web Access Pattern): This
technique totally eliminates the need to engage the
numerous reconstruction of intermediate WAP-trees
during mining and considerably reduces execution
time[11]. 
EXT-Prefix span: This method mines the complete
set of patterns but greatly reduces the efforts of
candidate subsequence generation. Prefix –projectio
n process involved in this method substantially
reduces the size of projected database [12]. 
In this section we will have a look on some frameworks
that are studied to implement the Web Usage Mining and
various techniques and algorithms for pattern discovery and
In this paper, [10] author proposed framework for web page
personalization with web access. This framework follows the
three-step process. Initially, it was recommended to process
the data not only from web log , but to use site topology and
page classification (head, content, navigation, look up,
personal) based on physical and usage characteristics, then
afterwards this heuristics can be used to determine users and
sessions. Data referring are then transformed into transactions
which represent page preferences clusters for individual users.
Data cleaned and transformed in mentioned way is presented
to some of the pattern discovery methods.
Hybrid approach to web usage mining, proposed in [25]
combines the compact HPG (Hyper Probability Grammar)
approach along with explicit OLAP.Here in this model, data is
stored in database through the Quilt and XML Query. The
constraints for the analysis are built on the top of this database
and data jointly with the constraints are used for modeling
Hypertext Probabilistic Grammars, which were then mined
with the help of Breadth First Search (BFS) based algorithm
for mining association rules.
Algorithm proposed by author in [13] is based on Maximal
Forward References. These were used for mining path
traversal patterns to provide environment where documents or
objects are linked together to smooth the progress of
interactive access on web. Two algorithms are devised here
for determining large reference sequence. One is based on
hashing and pruning techniques and other one is an
improvisation in order to reduce number of database scans
Markov chains Algorithm proposed in paper[14] is based on
Association Rule Mining technique of Web Usage Mining
which is used to make link prediction The structural
knowledge is tracked in the form of three different types of
clusters: grid clusters, hierarchical clusters and reference
clusters. The assumed Web pages and resultant Web
structures are then grouped to assist Web users in their
navigation in the Web site.
Improved AprioriAll algorithm has been proposed for Web
logs mining in [15].It is based on Association Rule Mining. It
is improvement to existing Apriori algorithm where it adds the
property of the UserID during the each step of generating the
candidate set and every step of scanning the database. This
helps to decide whether an item in the candidate set should be
put into the large set which will be used to produce next
candidate set. It also restricted the size of the candidate set in
time whenever it is produced.
In paper[16] author proposed FPgrowth and Prefix Span
Algorithm based on Association rule Mining for Web Usage
mining for implementation in real business case. Maximum
Forward Path (MFP) is also used in the web usage mining
model along with sequential pattern mining that uses Prefix
Span so as to reduce the interference of “false vis it” resulted
by browser cache and heave the mining frequent traversal
Self Organized Maps were proposed in [17] by author that
lays a basis of artificial neural network but actually is a
Clustering technique that is used to identify the user’s
navigational patterns. It focused on the transformations
required to modify the data storage in the Web Server Log
files as an input of Self Organized Maps.
Algorithm based on Graph Partitioning is used to identify
user’s access patterns in [18]. An undirected graph, based on
connectivity between each pair of the web pages are
recognized and weights are then assigned to the edges of the
graph that showed improvement in the quality of clustering
for user’s navigation pattern in web usage mining systems.
Ant-based clustering, proposed in [08] is applied to preprocessed logs to dig out frequent access patterns for pattern
discovery and then it is displayed in an interpretable format. It
uses neighborhood function and after clustering alignment
processing is then applied to the obtained sequences in each
cluster and extracts the representative for each cluster.
Modified k-means Algorithm of Clustering proposed by
author in [19] solves the issue of empty cluster. The problem
identified was considered as unimportant and was solved by
executing this algorithm repeatedly for a number of times. To
deal with large data set, a number of different parallel
implementations of the k-means was developed and
implemented for clustering.
Custom-built Apriori algorithm was proposed to identify the
effective pattern analysis, analyzing web logs for usage and
access trends [20].Mentioned Algorithm was used to identify
the different rules or co-relations in a rational execution time
of all the frequent item set from an educational log file. The
rules (co-relations) obtained from the system helped the
website developer for proper decision making that helped
them to improve their site effectively.
K-means with Genetic Algorithm was based on rough sets to
find interval sets of clusters proposed in [21]. The polished
initial condition allowed the iterative algorithm to come
together to a "better" local minimum. And in the next step,
they proposed a GA based refinement algorithm to improvise
the cluster quality. The proposed algorithm was evaluated
with web access logs obtained from the Internet Traffic
Archive (ITA) and showed that refined initial starting points
and post processing refinement of clusters leads to improved
Naive Bayesian Classification algorithm proposed by [22] was
used to identify interested users. The performance of this
algorithm was measured for web log data with session based
timing, page depth to the site length, page visits and repeated
user profiling. It showed progress in time and memory
utilization when it was applied to any web log files.
Learning Based K-Mean algorithm of Clustering proposed in
[23], is used to develop the learning capabilities and reduce
the computation intensity of a competitive learning multilayered neural network. Multi-layered network architecture
with a back propagation learning mechanism is used to
identify and analyze useful knowledge from the existing Web
log data.It used neural networks learning capabilities to
classify the web traffic data mining set.
Improve-K-Means Clustering in [24], is used to improve the
clustering patterns. Its idea is to identify the data objects
through an iterative clustering, in order to minimize the target
function, so that the generated cluster is as compact as
possible and independent. K-Means clustering algorithm is
based on the effective index.
Having studied different approaches in literature survey we have
observed that there are several algorithms that implement the
clustering on the web data. However these clustering techniques
are found to be useful and efficient. It enhances the Web usage
mining process in some or the other way. But as web is growing
rapidly day by day as information gate way,
size of cluster will also start increasing due to the increase in
user’s accessibility. This may result in data similarity that may
occur during clustering. Thus we propose a technique for
cluster formation and its optimization that will lay a basis by
which web page could be personalized so that user easily
switches to the page where his/her requirements are fulfilled.
To have an easy and faster way for the user on web to access
the data of their interest and needs, we propose a plan that not
only supports a better way of clustering but also focuses on
the cluster optimization to support improved web usage
mining. This methodology will follow the same sequential
phases of Web Usage Mining.
The flow of proposed methodology is gives as follows:
Tracking the user sessions with the help of Web Log
Discovering the User access patterns using NeuroFuzzy computing
Optimizing grouped clusters using Ant-Nest mate
Generating & Tracking user profiles from clusters.
User sessions are tracked and pre-processed by converting the
usage, content and structure information available data
sources into the data abstractions necessary for discovering
interesting navigation patterns. The interesting criteria for
navigation patterns are dynamically specified by a human
expert. Once the navigational patterns are determined
NEFCLASS theory based on neural and fuzzy approach
clustering is used for the processing of clusters[27]. This
NeuroFuzzy approach conforms to changes in users’
navigation patterns over time without losing earlier
information. These processed clusters in group are then
optimized using swarm intelligence technique[28] of study
proposed as Ant Nest mate approach which then generates the
user profiles that holds the data of their interest and needs[29].
Web Usage Mining plays a vital role in improvising the
usability of the website design. It stresses on improvement of
customers’ relations and improving the requirement of system
performance and other relevant factors. Web usage mining
provides the support for the web site designing, providing
personalization of web server and other business making
decision, etc.In this paper we focused on the process of web
usage mining which involved basically three important tasks
i.e. Preprocessing, Pattern Discovery and Pattern analysis. We
have also gone through various algorithms that are
implemented for improved web usage mining. However as the
web size and access to web is increasing day by day, as result
of which cluster size is also increasing. Hence we could think
on optimizing these developed clusters for which a proposed
plan of work is specified.
Srivastava J, Desikan P and V Kumar, "Web Mining-Concepts, Applications & Research Direction" in 2002
Srivastava J, Desikan P and V Kumar, "Web Mining Accomplishment Future Directions" in 2004 Conference.
R. Kosala, and H. Blockeel, “Web Mining Research: A
Aided Industrial design and Conceptual design, 2008
approach based on graph partitioning algorithm”, Jo
Mohsin, “ Data Pre-processing on Web Server Logs fo r [21] Mahdi Khosravi, Mohammad J. Tarokh, “Dynamic
R. Cooley, B. Mobasher, and J. Srivastava,“Web Mini
Preparation for Mining World Wide Web Browsing
SYSTEMS, vol. 1,1999.
