Web Usage Mining and Pattern Discovery: A Survey Paper

advertisement
Web Usage Mining and Pattern
Discovery: A Survey Paper
By
Naresh Barsagade
CSE 8331
December 8, 2003
1. Introduction
Web technology is not evolving in comfortable and incremental steps, but it is turbulent,
erratic, and often rather uncomfortable. It is estimated that the Internet, arguably the
most important part of the new technological environment, has expanded by about 2000
% and that is doubling in size every six to ten months. In recent years, the advance in
computer and web technologies and the decrease in their cost have expanded the
means available to collect and store data. As an intermediate consequence, the amount
of information (Meaningful data) stored has been increasing at a very fast pace.
Traditional information analysis techniques are useful to create informative reports from
data and to confirm predefined hypothesis about the data. However, huge volumes of
data being collected create new challenges for such techniques as organizations look for
ways to make use of the stored information to gain an edge over competitors. It is
reasonable to believe that data collected over an extended period contains hidden
knowledge about the business or patterns characterizing customer profile and behavior.
With the rapid growth of the World Wide Web, the study of knowledge discovery in web,
modeling and predicting the user’s access on a web site has become very important
[GO2003].
From the administration, business and application point of view, knowledge obtained
from the Web usage patterns could be directly applied to efficiently manage activities
related to e-Business, e-CRM, e-Services, e-Education, e-Newspapers, e-Government,
Digital Libraries, and so on [AR2003]. Web is becoming the necessity of the businesses
and organizations because of its demand from the clients. Since the web technology
largely feeds on ideas and knowledge rather than being dependent on fixed assets, it
gave birth to new companies such as Yahoo, Google, Netscape, e-Bay, e-Trade,
Survey Paper: Barsagade
Page 2 of 27
2/13/2016
Expedia, Amazon and so on. With the large number of companies using the Internet to
distribute and collect information, knowledge discovery on the web has become an
important research area [JTP2002]. With the explosive growth of information sources
available on the World Wide Web, it has become necessary for organizations to discover
the usage patterns and analyze the discovered patterns to gain an edge over
competitors.
Jespersen et al [JTB2002] proposed a hybrid approach for analyzing the visitor click
stream sequences. A combination of hypertext probabilistic grammar and click fact table
approach is used to mine Web logs, which could be also used for general sequence
mining tasks. Mobasher et al [MCS1999] proposed the web personalization system,
which consists of offline tasks related to the mining if usage data and online process of
automatic Web page customization based on the knowledge discovered. LOGSOM
(LOGSOM, a system that utilizes Kohonen's self-organizing map (SOM) to organize web
pages into a two-dimensional map) proposed by Smith et al [SN2003], utilizes a selforganizing map based solely on the users’ navigation behavior, rather than the content
of the web pages. LumberJack proposed by Chi et al [CRHL2002] builds up user profiles
by combining both clustering of user sessions and traditional statistical traffic analysis
using k–means algorithm. Joshi et al [JJYK1999] used relational online analytical
processing approach for creating a Web log warehouse using access logs and mined
logs. A comprehensive overview of web usage mining research is found in [SCDT2000,
CMS97, CMS1999, RWC2000].
Web mining can be divided into three areas, namely web content mining, web structure
mining and web usage mining [SCDT2000]. Web Content mining focuses on discovery of
information stored on the Internet. Web Structure mining focuses on improvement in
Survey Paper: Barsagade
Page 3 of 27
2/13/2016
structural design of a website. Web Usage mining, the main topic of this paper, focuses
on knowledge discovery from the usage of individuals web sites.
Global Internet Usage Average Usage [NN2003] shows the current usage around the
globe and in United States.
Month of September 2003, Panel Type: Home
Number of Sessions per Month
Number of Unique Domains Visited
Page Views per Month
Page Views per Surfing Session
Time Spent per Month
Time Spent During Surfing Session
Duration of a Page Viewed
Active Internet Universe
Current Internet Universe Estimate
September
22
55
901
41
11:59:20
0:32:29
0:00:48
252,672,070
419,054,724
August
22
54
899
41
11:50:30
0:32:37
0:00:47
253,054,814
416,339,888
%
Change
1.65
0.89
0.3
0
1.24
-0.4
0.94
-0.15
0.65
United States: Average Web Usage
Month of October 2003, Panel Type: Home
Sessions/Visits Per Person
71
Domains Visited Per Person
103
PC Time Per Person
80:46:37
Duration of a Web Page Viewed
0:01:00
Active Digital Media Universe
Current Digital Media Universe Estimate
47,003,165
51,012,930
The remainder of the paper is organized as follows: Section 2 contains applications of
web usage mining, section 3 contains basic components of web mining terminologies,
taxonomy of web mining, architecture of web usage mining, explanation of individual
components in web usage mining architecture, section 4 summarizes the paper,
identifies several future research directions and section 5 contains the bibliography.
Survey Paper: Barsagade
Page 4 of 27
2/13/2016
2. Applications of Web Usage Mining
Each of the applications can benefit from patterns that are ranked by subjective
interesting.
Web usage mining is used in the following areas:

Web usage mining offers users the ability to analyze massive volumes of
clickstream or click flow data, integrate the data seamlessly with transaction and
demographic data from offline sources and apply sophisticated analytics for web
personalization, e-CRM and other interactive marketing programs.

Personalization for a user can be achieved by keeping track of previously
accessed pages. These pages can be used to identify the typical browsing
behavior of a user and subsequently to predict desired pages.

By determining frequent access behavior for users, needed links can be identified
to improve the overall performance of future accesses.

Information concerning frequently accessed pages can be used for caching.

In addition to modifications to the linkage structure, identifying common access
behaviors can be used to improve the actual design of Web pages and to make
other modifications to the site.

Web usage patterns can be used to gather business intelligence to improve
Customer attraction, Customer retention, sales, marketing and advertisement,
cross sales.

Mining of web usage patterns can help in the study of how browsers are used
and the user’s interaction with a browser interface.

Usage characterization can also look into navigational strategy when browsing a
particular site.
Survey Paper: Barsagade
Page 5 of 27
2/13/2016

Web usage mining focuses on techniques that could predict user behavior while
the user interacts with the Web.

Web usage mining helps in improving the attractiveness of a Web site, in terms
of content and structure.

Performance and other service quality attributes are crucial to user satisfaction
and high quality performance of a web application is expected.

Web usage mining of patterns provides a key to understanding Web traffic
behavior, which can be used to deal with policies on web caching, network
transmission, load balancing, or data distribution.

Web usage and data mining is also useful for detecting intrusion, fraud, and
attempted break-ins to the system.

Web usage mining can be used in e-Learning, e-Business, e-Commerce, e-CRM,
e-Services, e-Education, e-Newspapers, e-Government, and Digital Libraries.

Web usage mining can be used in Customer Relationship Management,
Manufacturing and Planning, Telecommunications and Financial Planning.

Web usage mining can be used in Physical Sciences, Social Sciences,
Engineering, Medicine, and Biotechnology.

Web usage mining can be used in Counter Terrorism and Fraud Detection, and
detection of unusual accesses to secure data.

Web usage mining can be used in determination of common behaviors or traits
of users who perform certain actions, such as purchasing merchandise.

Web usage mining can be used in usability studies to determine the interface
quality.
Survey Paper: Barsagade
Page 6 of 27
2/13/2016

Web usage mining can be used in network traffic Analysis for determining
equipment requirements and data distribution in order to efficiently handle site
traffic.
3. Web Usage Mining and Pattern Discovery
Web usage mining is the application of data mining techniques to discover usage pattern
from Web data, in order to understand and better serve the needs of Web-based
applications [CMS1997]. Web usage mining consists of three phases, namely
preprocessing, pattern discovery, and pattern analysis. A high level Web usage mining
Process is presented in Figure 1 [SCDT2000]. Mobasher et al. [CMS1997] proposes that
the web mining process can be divided into two main parts. The first part includes the
domain dependent processes of transforming the Web data into suitable transaction
form. This includes preprocessing, transaction identification, and data integration
components. The second part includes some data mining and pattern matching
techniques such as association rule and sequential patterns. In the absence of cookies
or dynamically embedded session Ids in the URIs, the combination of IP address can be
used as a first pass estimate of unique users. This estimate can be refined using the
referrer field as described in [CMS1999]. Some authors have proposed global
architectures to handle the web usage mining process. Cooley et al [CTS1999] proposed
a site information filter, named WebSIFT that establishes a framework for web usage
mining as shown in Figure 2. The WebSIFT performs the mining in distinct tasks.
Survey Paper: Barsagade
Page 7 of 27
2/13/2016
WeSift system divides the Web Usage Mining Process into three main parts, as show in
Fig 1. For a particular Web site, the three server logs access, referrer, and agent (often
combined into a single log), the HTML files, template files, script files or databases that
make up the site content, and any optional data such as registration data or remote
agent logs provide the information to construct the different information abstractions.
The preprocessing phase uses the input data to construct a server session file based on
the method and heuristics discussed in [[CMS, 1999]. In order to preprocess a server
log, the log must first be “cleaned”, which consists of removing unsuccessful requests,
parsing relevant CGI name/value pairs and rolling up file accesses into page views. Once
the log is converted into a list of page views, users must be identified. In the absence of
cookies or dynamically embedded session Ids in the URIs, the combination of IP address
The first is preprocessing state in which user sessions are inferred from log data. The
second searches for patterns in the data by making use of standard data mining
techniques, such as association rules or mining for sequential patterns. In the third
stage an information filter bases on domain knowledge and the web site structures is
applied to the mining patterns in search for the interesting patterns. Links between
pages and the similarity between contents of pages provide evidence that pages are
related. The preprocessing phase allows the option of converting the server sessions
into episodes prior to performing knowledge discovery.
Survey Paper: Barsagade
Page 8 of 27
2/13/2016
Figure 2: A General Architecture for Web Usage Mining
In this case, episodes are either all of the page views in a server sessions that the user
spent a significant amount of time viewing, or all of the navigation page views leading
up to each content page view. The details of how a cutoff time is determined for
classifying a page view as content or navigation are also contained in [CMS1999]. The
click-stream or click-flow for each user is divided into sessions based on a simple thirtyminute timeout. The notion of what makes discovered knowledge interesting has been
addressed in [PT1998]. A survey of methods that have been used to characterize the
interestingness of discovered patterns is given in [HH1999]. Four dimensions used by
[HH1999] to classify interestingness measures are pattern-form, representation, scope,
and class. Pattern-form defines what type of patterns a measure is applicable to, such
as association rules or classification rules. The representation dimension defines the
nature of the framework, such as probabilistic or logical. Scope is a binary dimension
that indicates whether the measure applies to single pattern, or to the entire discovered
set. The final dimension, class is also a binary dimension that can be labeled as
subjective or objective.
Survey Paper: Barsagade
Page 9 of 27
2/13/2016
Preprocessing for the content and structure of a site involves assembling each page
view for parsing and /or analysis. Page views are accessed through HTTP requests by a
“site crawler” to assemble the components of the page view. This handles both static
and dynamic content. In addition to being used to derive a site topology, the site files
are used to classify the pages of a site. Both the site topology and page classification an
then be fed into the information filter. The knowledge discovery phase uses existing
data mining techniques to generate rules and patterns. Included in this phase is the
generation of general usage statistics, such as number of “hits” per page, page most
frequently accessed, most common starting page, and average time spent on each
page.
The WebSIFT performs the mining in distinct tasks. The first state is preprocessing in
which user sessions are inferred from log data. The second searches for patterns in the
data by making use of standard data mining techniques, such as association rules or
mining for sequential patterns. In the third stage an information filter bases on domain
knowledge and the web site structures is applied to the mining patterns in search for the
interesting patterns. Links between pages and the similarity between contents of pages
provide evidence that the pages are related. This information is used to identify
interesting patterns, for example, itemsets that contain pages not directly connected are
declared interesting. In Mobasher et al [MCS1999] the authors propose to group the
itemsets obtained by the mining stage in cluster of URL references. These clusters are
aimed at real time web page personalization. A hypergraph is inferred from the mined
itemsets where the nodes correspond to pages and the hyperedges connect pages in a
itemset. The weight of a hyperedge is given by the confidence of the rules involved. The
graph is subsequently partitioned into clusters and an occurring user session is matched
Survey Paper: Barsagade
Page 10 of 27
2/13/2016
against such clusters. For each URL in the matching clusters a recommendation score is
computed and the recommendation set is composed by all the URL whose
recommendation score is above a specified threshold.
In Buchner et al. [BBAMH1999] a new approach, in the form of process, is proposed to
find marketing intelligence from Internet data. An n-dimensional web log data cube is
created to store the collected data. Domain knowledge is incorporated into the data
cube in order to reduce the pattern search space. They proposed an algorithm to extract
navigation patterns from the data cube. The patterns conform to pre-specified
navigation templates whose use enables the analyst to express his knowledge about the
field and to guide the mining process. This model does not store the log data in compact
form, and that can be major drawback when handling very large daily log files.
Information on how customers are using a Web site is critical for marketers of electronic
commerce businesses. Buchner et al [BM1998] have presented a knowledge discovery
process in order to discover marketing intelligence from Web data. They define a Web
log data hypercube that consolidates Web usage data along with marketing data for
electronic commerce applications. Four distinct steps are identified in customer
relationship life cycle that can be supported by their knowledge discovery techniques:
customer attractions, customer retention, cross sales and customer departure.
In Masseglia et al [MPC1999] proposed an integrated tool for mining access patterns
and association rules from log file. The techniques implemented pay particular attention
to the handling of time constraints, such as the minimum and maximum time gap
between adjacent requests in a pattern. The system provides a real time generator of
dynamic links, which aimed at automatically modifying the hypertext organization when
user navigation matches a previously mined rule.
Survey Paper: Barsagade
Page 11 of 27
2/13/2016
Fundamental methods of data cleaning and preparation have been well studied by
Srinivasa et al [SCDT2000]. The main techniques traditionally used for modeling usage
patterns in a Web site are collaborative filtering (CF), clustering pages or user sessions,
association rule generation, sequential pattern generation and Markov Models. The
prediction step is the real-time processing of the model, which considers the active user
session and makes recommendations based on the discovered patterns. The time spent
on a page is a good measure of the user’s interest in that page, providing an implicit
rating for it [GO2003]. If a user is interested in the content of a page, she will likely
spend more time there compared to the other pages in her session. They presented a
new model that uses both the sequences of visiting pages and the time spent on that
pages which reflects the structural information of user session and handles twodimensional information.
Data preprocessing consists of data filtering, user identification, session/transaction
identification, and topology extraction. Data filtering filters out some noise, i.e.,
unsuccessful requests, automatically downloaded graphics, or requests from robots, to
get more compact training data. Now people use some heuristic rules to identify user,
such as IP address, cookies, etc. Preprocessing consists of converting the usage,
content, and structure information contained in the various available data sources into
the data abstractions necessary for pattern discovery.
Usage preprocessing: Usage preprocessing consists of Web pages, such as IP
addresses, page references, and the date and time of accesses [SCDT2000]. Typically,
the usage data comes from an Extended Common Log Format (ECLF) Server log
[RWC2000].
Survey Paper: Barsagade
Page 12 of 27
2/13/2016
Content Preprocessing: Content preprocessing consists of converting the text,
images, scripts, and multimedia data into forms that are useful for the web usage
mining process. Often this consists of performing content mining such as classification
or clustering. In the context of web usage mining, the content of Web sites can be used
to filter the input to the pattern discovery algorithms [SCDT2000].
Structure Preprocessing: Web structure mining analyses the link structure of the
web in order to identify relevant documents [SCDT2000]. The structure of a site is
created by the hypertext links between page views. Intra-page structure information
includes the arrangement of various HTML or XML tags within a given page. The
principal kind of inter-page structure information is hyper-links connecting one page to
another. The Google Search engine [GOO] makes use of the web link structure in the
process of determining the relevance of a page. The Google search engine achieves
good results because while the keyword similarity analysis ensures high precision the
use of a probability measure ensures high quality of the pages returned.
The information provided by the data sources listed above can be used to construct a
data model consisting of several data abstractions, notably users, page views, clickstreams, server sessions, and episodes [RWC2000]. A page view is defined as all of the
files that contribute to the client-side presentation seen as the result of a single mouse
“click” of a user. A click-stream is then the sequence of page views that are accessed by
a user. A server session is the click-stream is then sequence of page views that are
accessed by a user. A server session is the click-stream for a single visit of a user to a
Web site. Finally, an episode is a subset of page views from a server session. Data can
be collected at the server-level, client-level, proxy-level, or obtained from an
organization’s database. Each type of data collection differs not only in terms of the
Survey Paper: Barsagade
Page 13 of 27
2/13/2016
location of the data source, but also in the kinds of data available, the segment of
population from which the data was collected, and its method of implementation.
The usage data collected at the different sources such as Server level, Client Level and
Proxy Level represent the navigation patterns of different segments of the overall Web
traffic [SCDT2000].
Server-level Collection: A Web server log records the browsing behavior of site
visitors [SCDT2000]. The data recorded in server logs reflect the concurrent and
interleaved access of a Web site by multiple users. These log files can be stored in
various formats such as Common Log Format (CLF) or Extended Common Log Format
(ECLF). ECLF contains client IP address, User ID, time/date, request, status, bytes,
referrer, and agent. Tracking of individual users is not an easy task due to the stateless
connection model of the HTTP protocol. In order to handle this problem, Web servers
can also store other kind of usage information such as cookies in separate logs, or
appended to the CLF or ECLF logs. Cookies are tokens generated by the Web server for
individual client browsers in order to automatically track the site visitors. Packet sniffing
technology (also referred to as “network monitors”) is an alternative method for
collecting usage data through server logs. Packet sniffers monitor network traffic coming
to a Web server and extract usage data directly from TCP/IP packets. Besides usage
data, the server side log also provides access to the “site files”, e.g. content data,
structure information, local databases, and Web page meta-information such as the size
of a file and its last modified time.
Client level collection: Client-side collection can be implemented by using a remote
agent (such as Java scripts or Java applets) or by modifying the source code of an
existing browser (such as Mosaic or Mozilla) to enhance its data collection capabilities
Survey Paper: Barsagade
Page 14 of 27
2/13/2016
[SCDT2000]. Proxy Level Collection: The Internet Service Provider (ISP) machine that
users connect to through a model is a common form of proxy server. A web proxy acts
as an intermediary between client browsers and Web servers. Proxy-level caching can
be used to reduce the loading of time of a Web page experienced by users as well as
the network traffic load at the server and client sides.
Pattern Discovery: Pattern discovery uses methods and algorithms developed from
several fields such as statistics, data mining, machine learning and pattern recognition
[SCDT2000]. Zaiane et al. [ZXH1998] proposed the use of On-Line Analytical
Processing (OLAP) technology in web usage mining. OLAP and the data cube structure
offer a highly interactive and powerful data retrieval and analysis environment. The
knowledge that can be discovered is represented in the form of rules, tables, charts,
graphs, and other visual presentation forms for characterizing, comparing, predicting, or
classifying data from the web access log. Visualization can also be used in web usage
mining, and it presents the data in the way that can be understood by users more
easily.
Statistical Analysis: Statistical techniques are the most common method to extract
knowledge about visitors to a web site. By analyzing the session file, one can perform
different kinds of descriptive statistical analyses (frequency, mean, median, etc) on
variables such as page views, viewing time and length of a navigational path. For
example e-Trade developed a website in German language for Germany and scrapped it
because German people were visiting the English site rather than the German site. Many
web traffic analysis tools produce a periodic report containing statistical information such
as the most frequently accessed pages, average view time of a page or average length
of a path through a site. This type of knowledge can be potentially useful for improving
Survey Paper: Barsagade
Page 15 of 27
2/13/2016
the system performance, enhancing the security of the system, facilitating the site
modification task, and providing support for marketing decisions. There are lots of
commercial tools available for statistical analysis.
Association Rules: Association rule generation can be used to relate pages that are
most often referenced together in a single server sessions [SCDT2000]. In the context
of web usage mining, association rules refer to sets of pages that are accessed together
with a support value exceeding some specified threshold. Association rule mining has
been well studied in Data Mining, especially for basket transaction data analysis. Many
association rule algorithms have been used, such as Apriori, Partition [MHD2003]. Aside
from being applicable for e-Commerce, business intelligence and marketing applications,
it can help web designers to restructure their web site. The results about the usefulness
of such rules in supermarket transaction or in web application have not been reported.
People also put some constraints over the mining process, and prune the extracted
rules. The association rules may also serve as heuristic for pre fetching documents in
order to reduce user-perceived latency when loading a page from a remote site. In
electronic CRM, an existing customer can be retained by dynamically creating web offers
based on associations with threshold support and/or confidence value [BM98].
Clustering: Clustering is a technique to group together a set of items having similar
characteristics [SCDT2000]. Clustering can be performed on either the users or the page
views. Clustering analysis in web usage mining intends to find the cluster of user, page,
or sessions from web log file, where each cluster represents a group of objects with
common interesting or characteristic. User clustering is designed to find user groups that
have common interests based on their behaviors, and it is critical for user community
construction. Page clustering is the process of clustering pages according to the users’
Survey Paper: Barsagade
Page 16 of 27
2/13/2016
access over them. Such knowledge is especially useful for inferring user demographics in
order to perform market segmentation in e-Commerce applications or provide
personalized web content to the users. On the other hand, clustering of pages will
discover groups of pages having related content. This information is useful for the
Internet search engines and Web assistance providers. In both applications, permanent
or dynamic HTML pages can be created that suggest related hyperlinks to the user
according to the user’s query or past history of information needs. The intuition is that if
the probability of visiting page, given page has also been visited, is high, then maybe
they can be grouped into one cluster. For session clustering, all the sessions are
processed to find some interesting session clusters. Each session cluster may be one
interesting topic within the web site. Mobasher et al [MCS1999] generated
recommendations from URL clusters to build an adaptive web site by using ARHP
(Association Rule Hypergraph Partitioning).
Abhrahum et al [AR2003] proposed an ant-clustering algorithm to discover web usage
patterns and a linear genetic programming approach to analyze the visitor trends. They
proposed hybrid framework, which uses an ant colony optimization algorithm to cluster
Web usage patterns. The raw data from the log files are cleaned and preprocessed and
the ACLUSTER algorithm is used to identify the usage patterns. The developed clusters
of data are fed to a linear genetic programming model to analyze the usage trends.
The WebCANVAS (Web Clustering Analysis and VisuAlization of Sequences) [CHMSW,
2003] presented a new methodology for exploring and analyzing navigation pattern on a
web site. The patterns that can be analyzed consist of sequences of URL categories
traversed by users. In their approach, they first partitioned site users into clusters such
that users with similar navigation paths through site are places into the same cluster.
Survey Paper: Barsagade
Page 17 of 27
2/13/2016
The clustering approach they employed was model-based (as opposed to distance
based) and partitioned users according to the order in which they request web pages.
Another feature of their use of model-based clustering is that learning time scales
linearly with sample size. In contrast, agglomerative distance-based methods scale
quadratically with sample size.
The purpose of knowledge discovery from users profile is to find clusters of similar
interests among the users [SZAS1997]. If the site is well designed, there will be strong
correlation among the similarity of the navigation paths and similarity among the users
interest. Therefore, clustering of the former could be used to cluster the latter. The
definition of the similarity is application dependent. They provide an overview on a
powerful path clustering method called path mining. This approach is suitable for
knowledge discovery in databases with partial ordering in their data. In this method,
first a general path feature space is characterized. Then a similarity, measure among the
paths over the feature space is introduced. Finally this similarity measure is used in the
clustering purpose. They implemented the path-mining algorithm to cluster the
navigation paths detected by the profiler. This algorithm finds a scalar number as the
similarity among the paths. These similarity numbers could be fed to standard datamining algorithms to cluster the user interests.
Classification: Classification is the task of mapping a data item into one of several
predefined classes [SCDT2000]. In the internet marketing, a customer can be classified
as ‘no customer’, ‘visitor once’ and ‘visitor regular’ based on their browsing patters and
discovered rules for attracting the customers by displaying special offers [BM98].
In the web domain, one is interested in developing a profile of users belonging to a
particular class or category. This requires extraction and selection of features that best
Survey Paper: Barsagade
Page 18 of 27
2/13/2016
describe the properties of a given class or category. Classification can be done by using
supervised inductive learning algorithms such as decision tree classifiers, naïve Baysian
classifiers, k-nearest neighbor classifiers, Support Vector Machines etc. For example,
classification on server logs may lead to the discovery of interesting rules such as: 30%
of users who placed an online order in /Product/Music are in the 18-25 age group and
live on the west coast. The Classification algorithms such as C4.5, CART, BAYES, and
RIPPER can be used to predict if page is of interest to the user.
Sequential Patterns: The technique of sequential pattern discovery attempts to find
inter-session patterns such that the presence of a set of items is followed by another
item in a time-ordered set of sessions or episodes [SCDT2000]. A new algorithm MiDAS
(Mining Internet data for Associative Sequences) for discovering sequential patterns
from web log files has been proposed that provides behavioral marketing intelligence for
e-commerce scenarios [BBAMH1999]. MiDAS contains three phases: 1. A priori phase is
the input data preparation, which consists of data reduction and data type substitution.
2. Discovery Phase discovers the sequences of hits and generates the pattern tree. 3. A
posteriori Phase filters out all sequences that do not fulfill the criteria laid in the
specified navigation templates and topology network and also pruning is done in this
phase. By using this approach, Web marketers can predict future visit patterns, which
will be helpful in placing advertisements aimed at certain user groups. Other types of
temporal analysis that can be performed on sequential patterns include trend analysis,
change point detection, or similarity analysis.
Oyanagi et al [OKN2002] explore the issues in sequence mining for methods for mining
WWW access log. The Apriori algorithm is well known as a typical algorithm for
sequence pattern mining. However, it suffers from inherent difficulties in finding long
Survey Paper: Barsagade
Page 19 of 27
2/13/2016
sequential patterns and in finding interesting patterns among a huge amount of results.
This paper proposes a new method for finding sequence patterns by matrix clustering.
This method decomposes a sequence into a set of sequence elements, each of which
corresponds to an ordered pair of items. Then matrix clustering is applied to extract a
cluster of similar sequences. The resulting sequence elements are composed into a
graph. A Web Utilization Miner, WUM [SS1998] uses an efficient data structure called
Aggregated Tree to store the user sessions, and it also provides query language to
extract interesting patterns from the aggregated session data. WUM employs an
innovative technique for the discovery of navigation patterns over an aggregated
materialized view of the web log. After performing the classical preparation steps (i.e.,
user and session identification) the user sessions are merged into Aggregated Tree. An
Aggregated Tree is a tree constructed by merging trails with the same prefix. WUM
provides a query language called MINT to let the users specify their query, concerning
the content, structure and statistics of navigation patterns. MINT supports the
specification of criteria of statistical, structural, and textual nature. The “WEBMIER” tool
[CMS1997] provides a query language on top of external mining software for association
rules and for sequential patterns.
Dependency modeling: Dependency modeling is another useful pattern discovery
task in web mining [SCDT2000]. The goal here is to develop a model capable of
representing significant dependencies among the various variables in the web domain.
As an example, one may be interested to build a model representing the different stages
a visitor undergoes while shopping in an online store based on the actions chosen (ie,
from a casual visitor to a serious potential buyer. There are several probabilistic learning
techniques that can be employed to model the browsing behavior of users. Such
Survey Paper: Barsagade
Page 20 of 27
2/13/2016
techniques include Hidden Markov Models and Bayesian Belief Networks. Modeling of
Web usage patterns will not only provide a theoretical framework for analyzing the
behavior of users but is potentially useful for predicting future Web resource
consumption. Such information may help develop strategies to increase the sales of
products offered by the Web site or improve the navigational convenience of users.
Borgees et al [BL1999] proposed formal data mining model, Hypertext probabilistic
Grammars (HPG) to capture user web navigation patterns. User sessions are presented
as HPG whose higher probability strings correspond to the navigation trails preferred by
the user. Hypertext Probabilistic Grammar (HPG) is a Markov model, which assumes that
the probability of a link being chosen depends more on the contents of the page being
viewed than on all the previous history of sessions [LL1999]. Note that this assumption
can be weighted by making use of the Ngram concept, or dynamic Markov chain
techniques There are situations in which a Markov assumption is realistic, such as, for
example, an online newspaper where a user chooses which article to read in the sports
section independently of the contents of the front page. However, there are also cases
in which such assumption is not very realistic, such as, for example, an online tutorial
providing a sequence of pages explaining how to perform a given task.
Deviation/Outlier Detection: It contains techniques aimed at detecting unusual
changes in the data relatively to the expected values. Such techniques are useful, for
example, in fraud detection, where the inconsistent use of credit cards can identify
situations where a card is stolen. The inconsistent use of credit card could be noted if
there were transactions performed in different geographic locations within a given time
window.
Survey Paper: Barsagade
Page 21 of 27
2/13/2016
Pattern analysis: Pattern analysis is the last step in the overall Web Usage mining
process as described in Figure 2. The motivation behind pattern analysis is to filter out
uninteresting rules or patterns from the set found in the pattern discovery phase. The
exact analysis methodology is usually governed by the application for which Web mining
is done. The most common form of pattern analysis consists of a knowledge query
mechanism such as SQL. Another method is to load usage data into a data cube in order
to perform OLAP operations. Visualization techniques, such as graphing patterns or
assigning colors to different values, can often highlight overall patterns or trends in the
data. Content and structure information can be used to filter out patterns containing
pages of a certain usage type, content type, or pages that match a certain hyperlink
structure.
4. Summary and Future Research Directions
This paper has attempted to provide an up-to-date survey of the rapidly growing area of
Web usage mining, which is the demand of current technology. In this paper a general
overview of Web usage mining is presented in introduction section. Web usage mining is
used in many areas such as e-Business, e-CRM, e-Services, e-Education, e-Newspapers,
e-Government, Digital Libraries, advertising, marketing, bioinformatics and so on. The
major classes of recommendation services are based on the discovery of navigational
patterns of users. The main techniques for pattern discovery are sequential patterns,
association rules, Classification, Clustering, and path analysis. Web usage mining’s basic
components, taxonomy of web mining, architecture of web usage mining, individual
components in web usage mining and detailed research in this area by researchers like
Survey Paper: Barsagade
Page 22 of 27
2/13/2016
Jaideep Srivastava, Bamshad Mobasher, Robert Cooley, Cyrus Shahabi, Ming-Syan Chen,
and A.G. Büchner in web mining is described in detail section.
With the growth of Web-based applications, specifically e-commerce, there is significant
interest in analyzing Web usage data. As the web mining area is growing fast, there is a
lot of demand for web usage mining and there is a need to develop a common
framework like J2EE and .NET. Cross Industry Standard Process for Data Mining, the
CRISP-DM project has developed an industry and tool-neutral Data Mining process
model [CRISP-DM] for data mining. Similar Process model or framework needs to be
developed for creating an interest among the new researchers or business strategists
and developers. We need a systematic web-site design methodology to create new web
pages, or modify existing web pages, such that different user’s navigation patterns could
be better mapped to answers to a set of specific questions. There is a need to develop
tools, which incorporate statistical methods, visualization, and human factors to help
better understand the mined knowledge. Since the output of knowledge mining
algorithms is often not in a form suitable for direct human consumption, there is a need
to develop techniques and tools for helping an analyst better assimilate it. One of the
open issues in data mining, in general and Web Mining, in particular, is the creation of
intelligent tools that can assist in the interpretation of mined knowledge. Clearly, these
tools need to have specific knowledge about the particular problem domain to do any
more than filtering based on statistical attributes of the discovered rules or patterns.
More research needs to be done in e-Commerce, bioinformatics, computer security, Web
intelligence, intelligent learning, Database systems, Finance, Marketing, Healthcare, and
Telecommunications by using Web usage mining.
Survey Paper: Barsagade
Page 23 of 27
2/13/2016
5. Bibliography
[AR2003].
Ajit Abhraham, Vitorino Ramos, Web Usage Mining Using Artificial Ant
Colony Clustering and Linear Genetic Programming, to appear in CEC´03
- Congress on Evolutionary Computation, IEEE Press, Canberra, Australia,
8-12 Dec. 2003.
[BBAMH1999]. A.G. Büchner, M. Baumgarten, S.S. Anand, M.D. Mulvenna, J. G. Hughes,
Navigation Pattern Discovery from Internet Data, in WEBKDD, San Diego,
CA 1999.
[BL1999].
Jos'e Borges, Mark Levene, Data Mining of User Navigation Patterns,
WEBKDD,1999.
[BM1998].
A.G. Büchner, M.D. Mulvenna, Discovering Internet Marketing Intelligence
through Online Analytical Web Usage Mining, ACM SIGMOD, Vol. 27, No.
4, pp. 54-61, 1998.
[CHMSW2003]. I. V. Cadez, D. Heckerman, C. Meek, P. Smyth, S. White,
Model-based clustering and visualization of navigation patterns on a Web
Site, Journal of Data Mining and Knowledge Discovery, 7(4), 2003.
(extended version of ACM SIGKDD 2000 conference paper).
[CMS1997].
Robert Cooley, Bamshad Mobasher, and Jaideep Srivastava, Web Mining:
Information and Pattern Discovery on the World Wide Web (A Survey
Paper) (1997), in Proceedings of the 9th IEEE International Conference
on Tools with Artificial Intelligence (ICTAI'97), November 1997
[CMS1999].
Robert Cooley, Bamshad Mobasher, and Jaideep Srivastava, Data
Preparation for mining world wide web browsing patterns, Knowledge
and information Systems 1(1),1999.
Survey Paper: Barsagade
Page 24 of 27
2/13/2016
[CRHL2002].
Chi E.H., Rosien A. and Heer J., LumberJack:Intelligent Discovey and
Analysis of Web User Traffic Composition. In Proceedings of ACM-SIGKDD
Workshop on Web Mining for Usage Patterns and User Profiles, Canada,
ACM press, 2002.
[CRISP-DM]. http://www.crisp-dm.org .
[CTS1999].
Robert Cooley, Pang-Ning Tan, Jaideep Srivastava, WebSIFT: The Web
Site Information Filter System (1999). Proceedings of the Web Usage
Analysis and User Profiling Workshop, August 1999.
[GO2003].
Sule Gunduz, M. Tamer Ozsu, A Web Page Prediction Model Based on
Click-Stream Tree Representation of User Behavior, The Ninth ACM
SIGKDD International Conference on Knowledge Discovery and Data
Mining. Washington, DC, USA, August 24 - 27, 2003.
[GOO].
Google Search Engine http://www.google.com
[HH1999].
Robert J. Hilderman and Howard J. Hamilton, Knowledge discovery and
interestingness measures, A survey, Technical Report, University of
Regina, 1999.
[JJYK1999].
Joshi K. P., Joshi A., Yesha Y., Krishnapuram, R., Warehousing and
Mining We Logs, Proceedings of the 2nd ACM CIKM Workshop on Web
Information and Data Management, pp. 63-68, 1999.
[JTB2002].
Jespersean S.E., Throhauge J., and Bach T., A hybrid approach to Web
Usage Mining, Data Warehousing and Knowledge Discovery, (DaWaK’02),
LNCS 2454, Springer Verlag Germany, pp73-82, 2002.
Survey Paper: Barsagade
Page 25 of 27
2/13/2016
[JTP2002].
Soren E. Jespersen, Jesper Thorhauge, Torben Bach Pederson, A Hybrid
Approach to Web Usage Mining, Technical Report 02-5002, Department
of Computer Science Aalborg University, July 2002.
[MHD2003].
Margaret H. Dunham, Data Mining Introductory and Advanced Topics,
Prentice Hall, 2003.
[LL1999].
Levene, M. and Loizou, G. Computing the entropy of user navigation in
the web, Department of Computer Science, University College London,
1999.
[NN2003].
[MCS1999].
http://www.nielsen-netratings.com
Bamshad Mobasher, Robert Cooley, Jaideep Srivastava,
Creating
Adaptive Web Sites Through Usage-Based Clustering of URLs, in
Proceedings of the 1999 IEEE Knowledge and Data Engineering Exchange
Workshop (KDEX'99), November 1999
[MPC1999].
Masseglia, F., Poncelet, P., and Cicchetti, R., 1999a, Webtool: An
Integrated framework for data mining, In proceedings of the Ninth
International Conference on Database and Expert System Application
(DEXA’99), Florence, Italy, August, 1999.
[OKN2002]. Shigeru Oyanagi, Kazuto Kubota, Akihiko Nakase, Mining WWW Access
Sequence by Matrix Clustering,SIGKDD Explorations. Volume 4, Issue 2,
page 125.
[RWC2000].
Robert W. Cooley, Web Usage Mining: Discovery and Application of
Interesting Patterns from Web Data., A Ph. D. Thesis, May 2000.
[SCDT2000]. Jaideep Srivastava,Robert Cooley, Mukund Deshpande,Pang-Ning Tan,
Survey Paper: Barsagade
Page 26 of 27
2/13/2016
Web Usage Mining: Discovery and Applications of Usage Patterns from
Web Data(2000). SIGKDD Explorations, Vol. 1, Issue 2, 2000.
[SN2003].
Smith K.A. and Ng A., Web page clustering using a self-organizing map of
user navigation patterns, Decision Support Systems, Volume 35 , Issue 2
(May 2003) Special issue: Web data mining, Pages: 245 – 256.
[SZAS1997]. Cyrus Shahabi, Amir M. Zarkesh, Jafar Adibi, and Vishal Shah, Knowledge
Discovery from Users Web-page Navigation, IEEE RIDE 1997.
[SS1998].
Myra Spiliopoulou and Lukas C. Faulstich, WUM: A Web Utilization Miner, in
International Workshop on the Web and Databases (WebDB98), Valencia,
Spain, March 1998.
[ZA1997]. A. Zarkesh and J. Adibi, Pathmining: Knowledge discovery in partially
ordered databases. Submitted to KDD-1997.
[ZXH1998]. O. R. Zaiane, M. Xin, and J. Han,Discovering Web access patterns and
trends by applying OLAP and data mining technology on Web logs, in Proc.
Advances in Digital Libraries Conference (ADL'98), Santa Babara, CA, April,
1998.
Survey Paper: Barsagade
Page 27 of 27
2/13/2016
Download