clustering of web usage data using fuzzy tolerance rough set

advertisement
International Journal of Computer Science Engineering
and Information Technology Research (IJCSEITR)
ISSN 2249-6831
Vol. 3, Issue 2, Jun 2013, 143-152
© TJPRC Pvt. Ltd.
CLUSTERING OF WEB USAGE DATA USING FUZZY TOLERANCE ROUGH SET
SIMILARITY AND TABLE FILLING ALGORITHM
T. VIJAYA KUMAR & H. S. GURUPRASAD
Department of IS & E, BMS College of Engineering, Bull Temple Road, Bangalore, Karnataka, India
ABSTRACT
Web Usage Mining is the application of data mining techniques to learn usage patterns from Web server log file
in order to understand and better serve the requirements of web based applications. Web Usage Mining includes three most
important steps namely Data Preprocessing, Pattern discovery and Analysis of the discovered patterns. One of the most
important tasks in Web usage mining is to find groups of users exhibiting similar browsing patterns. Grouping web
transactions into clusters is important in order to understand user‟s navigational behavior. Different types of clustering
algorithms such as partition based, distance based, density based, grid based, hierarchical and fuzzy clustering algorithms
are used to find clusters from Web usage data. In this paper we propose an approach for clustering Web usage data based
on Fuzzy tolerance rough set theory and table filling algorithm. First, we have constructed the sessions using concept
hierarchy and link information. The similarity between two sessions is approximated by using Rough set tolerance relation.
The tolerance relation is reformulated into equivalence relation using fuzzy tolerance. Then the clusters are obtained by
using modified table filling algorithm. We provide experimental results of Fuzzy rough set similarity and table filling
algorithm on MSNBC web navigation data set. In this paper, we have considered the server log files of the Website
www.enggresources.com for overall study and analysis.
KEYWORDS: Web Usage Mining, Concept Hierarchy, Website Ontology, Rough Set Similarity, Fuzzy Tolerance,
Table Filling Algorithm
INTRODUCTION
The growth of World Wide Web in terms of Web sites and their users over the last two decades has resulted in a
large amount of data related to the user‟s interactions with the web sites. This data is recorded in the Web server log files
of Web servers and referred as Web usage data. Web usage mining (WUM) uses data mining techniques to discover
valuable information from Web usage data. WUM deals with the automatic discovery of user access patterns from one or
more Web servers. Web Usage mining contains three main tasks namely Data preprocessing, Cluster discovery and Cluster
analysis. Data preprocessing consists of data cleaning, data transformation and data reduction. Data cleaning routines work
to clean the data by filling in missing values, smoothing noisy data and resolving inconsistencies in the data. In data
transformation, the data are transformed or consolidated into forms appropriate for mining. Data reduction techniques can
be applied to obtain a reduced representation of the data that is much smaller in volume, yet closely maintains the integrity
of the original data. Cluster discovery deals with formation of groups of users exhibiting similar browsing patterns and
obtaining groups of pages that are accessed together. Cluster analysis filters out uninteresting patterns from the user
clusters and page clusters found in the Cluster discovery phase. Clustering is a data mining technique that groups together a
set of items having similar characteristics. In the Web usage domain, two kinds of interesting clusters such as user clusters
and page clusters can be discovered. This paper presents a new approach for finding session similarity using Fuzzy Rough
set theory. Rough set theory deals with uncertainty and vagueness. The building block of rough set theory is an assumption
144
T. Vijaya Kumar & H. S. Guruprasad
that with every set of the universe, we can associate some information in the form of data and knowledge. Objects
clustered by the same information are similar with respect to the available information about them. The set similarity
considered for two sessions is a tolerance relation which is only reflexive and symmetric but not transitive. Fuzzy tolerance
is used to reform the tolerance relation in to an equivalence relation. Then the indiscernibility based fuzzy tolerance rough
set similarity is combined with table filling algorithm to form the clusters. Table filling algorithm is used to minimize the
deterministic finite automata. The minimization problem is to find the unique minimal deterministic finite automata that
accept the same language accepted by the given deterministic finite automata. Algorithms solving this problem are used in
applications ranging from compiler construction to hardware circuit minimization. The rest of the paper is organized as
follows. Section 2 gives a brief description about the related work. Section 3 explains the proposed model. Section 4 covers
details of Data Preprocessing using Concept hierarchy and Web site topology. The details of Rough set theory and Fuzzy
tolerance are discussed in Section 5. The proposed approach using fuzzy tolerance rough set theory and table filling
algorithm is explained in section 6. The experimental design and results are discussed in section 7. Finally, we give our
conclusion in section 8.
LITERATURE SURVEY
Several researchers are working on Web Usage mining and have contributed various methodologies, tools for
Web Usage mining. A number of data mining methods have been used to generate models of usage patterns. Models based
on association rules, clustering algorithms, sequential analysis and Markov models have been used for discovering the
knowledge from Web usage data. All these models are predominantly based on usage information from Web usage data
alone. Significant improvement can be achieved by making use of domain knowledge, which is usually available from
domain experts, content providers, and Web designers. Cooley et al. in [1, 2], covered Web usage mining process &
various steps involved in it. It serves as the primary thesis to understand fundamentals of Web usage mining. Along with
the server log file other sources of knowledge such as site content or structure and semantic domain knowledge can be used
in Web usage mining [3]. In [4], Murat Ali Bayir et al. have proposed a novel framework, called Smart-Miner for Web
usage mining problem which uses link information for producing accurate user sessions and frequent navigation patterns.
Norwati Mustapha et al. [5], have proposed a model for mining user‟s navigation pattern based on Expectation modeling
algorithm and used it for finding maximum likelihood estimates of parameters in probabilistic models. A complete
framework for mining evolving user profiles in dynamic Websites is proposed in [6]. They also described how to enrich
the discovered user profiles with explicit information need that is inferred from search queries extracted from Web log
data. In [7], T.Vijaya Kumar et al. have proposed a framework for finding useful information from Web Usage Data that
uses Self Organizing Maps (SOM). Sessions are constructed using the concept hierarchy and the link information. Then
they have used SOM to form the cluster. In [8], Hannah et al. have proposed an approach to obtain user profiles based on
intelligent rough clustering techniques.
The proposed method provides efficient algorithms for finding hidden patterns in web log data and is able to learn
the number of clusters automatically from the given data. They have given a two-fold approach for clustering user access
patterns and retrieving effective user profiles from web logs using Gaussian Rough (GR) clustering, Gaussian Rough
Fuzzy (GRF) clustering, rough clustering and rough fuzzy clustering. In [9], Rajhans Mishra et al. have adopted the
similarity upper approximation based clustering of web logs using various similarity metrics. In [10], K.Santhisree et al.
have presented a technique to cluster web transactions based on the set similarity measures from web log data which
identifies the behavior of the user‟s page visits and order of occurrence of visits. They have formed the Web data Clusters
using the Similarity Upper Approximations. In [11], Philip Hingston presented the method for mining interesting
Clustering of Web Usage Data Using Fuzzy Tolerance Rough Set Similarity and Table Filling Algorithm
145
sequential patterns from large sequential data sets. In the first step the data set is modeled in terms of stochastic grammar
or automaton. Then the queries about frequently occurring patterns in the data set are answered by converting pattern
frequencies into formulae concerning the model. In [12], Sunil Joshi et al. have proposed a new algorithm PMFLT (Pattern
Mining using Formal Language Tools) for sequential pattern mining using formal language tools such as regular
expression constraints. The algorithm finds only user specific frequent sequence in efficient optimized way as compared to
other existing algorithm.
SYSTEM DESIGN
The main goal of the proposed system is to find web user clusters from web server log files. We have adopted
Web Usage Mining System as shown in Figure 1. The WUM system in our approach is partitioned into two modules. In
the first module Data cleaning, User identification and Session constructions are considered for Data preprocessing phase.
Sessions are constructed using web site ontology and concept hierarchy. Then in the second module web usage clusters are
formed using rough set theory, fuzzy tolerance and table filling algorithm.
Figure 1: Web Usage Clustering Process
DATA PREPROCESSING
Data preprocessing [13] comprises of, merging of log files from different Web servers, Data cleaning,
Identification of users, sessions, and visits, Data formatting and Summarization. Data cleaning consists of removing
superfluous data from log file. User identification deals with identifying unique clients to Web server. A combination of IP
& user agent is used to identify user uniquely. User identification can also be done using client side cookies. But, due to
privacy reasons, cookies can be disabled by user, and not every Website employ cookies. Session identification is
considered as the next step. A session is a sequence of requests made by a single user with a unique IP address on a
particular Web domain during a specified period of time.
Time Oriented Approach
The most basic session definition comes with Time Oriented Heuristics which are based on time limitations on total
session time or page-stay time. They are divided into two categories with respect to the thresholds they use:
146
T. Vijaya Kumar & H. S. Guruprasad

In the first one, the duration of a session is limited with a predefined upper bound, which is usually accepted as 30
minutes. In this type, a new page can be appended to the current session if the time difference with the first page
doesn‟t violate total session duration time. Otherwise, a new session is assumed to start with the new page request.

In the second time-oriented heuristic, the time spent on any page is limited with a threshold. This threshold value is
accepted as 10 minutes. If the timestamps of two consecutively accessed pages is greater than the threshold, the
current session is terminated after the former page and a new session starts with the latter page.
Navigation Oriented Approach
Navigation-Oriented approach [14, 15] uses link information of Website graph which is present in concept based
Website graph constructed by using Website knowledge. In this approach, it is necessary to have a hyperlink between every
two consecutive Web page requests.
Let
be a session containing Web pages with respect to their timestamp
orders. In this session, for every page
referring to
, except the initial page
and has a smaller timestamp than
, there must be at least one page
in the session which is
. Topology constraint forces to consider user navigation according to
some path in Website graph.
Concept-matching approach: This approach considers concepts of Web pages from concept based Website graph. Adding
page
to a session
same. Then add
is performed as follows: If concept names of pages
to the current session else create a new session and add
&
are
to it. i.e., concept switching is taken as
one more criteria for breaking session [16].
FUZZY TOLERANCE ROUGH SET SIMILARITY
In this section we present a Fuzzy rough set theoretic approach to cluster user access transactions over the web.
The presented approach is based on the table filling algorithm. Rough set theory is based on the assumption that with every
set of the universe, some information in the form of data and knowledge can be associated. Objects clustered by the same
information are similar and the similarity generated based on the information form the basis for rough set theory. Given
two transactions
and , the sequence and set similarity measure proposed in [17], is considered for our study. Sequence
similarity calculates the amount of similarity in the order of occurrence of pages within two page sequences.
The sequence similarity measure is given in equation (1). Length of longest common subsequence (LLCS) with
respect to the length of the longest sequence determines the sequence similarity aspect across two sequences. The Length
of Longest Common subsequence (LLCS) can be calculated by dynamic programming approach [18]. Set similarity is
defined as the ratio to the number of common pages and the number of unique pages in two page sequences. The Set
similarity measure is given in equation (2). The Sequence set similarity metric satisfies Non negativity, Symmetry and
Normalization, hence qualifies as a proper similarity metric [19]. The Sequence set similarity measure is given in equation
(3).
(1)
(2)
(3)
147
Clustering of Web Usage Data Using Fuzzy Tolerance Rough Set Similarity and Table Filling Algorithm
Here
and
. The values of p and q determine the relative weights for sequence similarity and
∈ [0, 1] for all x and y. S
set similarity.
same.
= 1, when two transactions x and y are exactly
= 0, when two transactions x and y have no items in common. The measures of similarity gives
information about the users access patterns related to their common areas of interest. The navigation of any two users over
a web site may not be exactly identical but may have some common interesting patterns. Moreover the same user can
∈ [0, 1].
navigate the same pattern in different ways. From the above definition it is obvious that
= 1, when two transactions
are exactly identical.
= 0, when two transactions
are to totally different. This measure of similarity gives information about the users access patterns related to their
common navigational patterns.
The navigation of any two users over a Web site may not be exactly same but may have some common interesting
pages. The same user can navigate the same pattern in two different ways. This similarity between the navigational
behaviors of two users is modeled by using a binary relation R defined on T. For any threshold value
∈(0, 1] and for
any two user transactions x and y ∈T, a binary relation R on T denoted as xRy is defined by xRy iff
. The similarity class of t, denoted by SimClass(t), is the set of transactions which are similar to t.
It is given by SimClass(t) = {s ∈T : sRt}. For different threshold values we can get different similarity classes. A domain
expert can choose the threshold based on his experience to get a proper similarity class. For a fixed threshold
∈ [0; 1], a
transaction from a given similarity class may be similar to an object of another similarity class.
This relation R is a tolerance relation as R is both reflexive and symmetric but transitive may not hold good
always. Let a, b and c are three different transactions. For a specified threshold, if a is similar to b and b is similar to c,
then a may not be similar to c. A tolerance or proximity relation R is a relation that exhibits only the properties of
reflexivity and symmetry. A tolerance relation, R, can be reformed into an equivalence relation by at most (n-1)
compositions with itself, where n is the cardinal number of the set defining R. A fuzzy relation, R, on a single universe X is
also a relation from X to X. It is a fuzzy equivalence relation if all three of the following properties for matrix relations
define it:
Reflexivity
:
Symmetry
:
Transitivity
:
=1
and
Then
where
MODIFIED TABLE FILLING ALGORITHM
The indiscernibility based fuzzy tolerance rough set similarity is combined with table filling algorithm to form the
clusters. Table filling algorithm is used to minimize the deterministic finite automata. The minimization problem is to find
the unique minimal deterministic finite automata that accept the same language accepted by the given deterministic finite
automata.
Algorithm: Modified table filling algorithm with Fuzzy tolerance rough set based similarity.
Input: A set of n transactions
Threshold:
∈ (0, 1]
148
T. Vijaya Kumar & H. S. Guruprasad
Similarity measure:
Output: Web usage clusters
Procedure Mark
Step 1: Remove sessions with session length < minimum session length
Step 2: Consider all pairs of sessions (x, y), Construct the similarity matrix using Similarity measure.
Step 3: Repeat the following until no previously unmarked pairs are marked.
For all pairs (
) if S
The sessions
>
then sessions
and
are indistinguishable or equivalent.
are placed in the same cluster.
Procedure Reduce
Construct Session Clusters
Step 1: Use procedure mark to find all pairs of similar sessions. Use Fuzzy tolerance rough set similarity
measure to find the similarity class of each session using SimClass(t) = {s ∈ T : sRt}.
Step 2: Each group of equivalent sessions must be placed in a single cluster to form session clusters.
EXPERIMENTAL DESIGN AND RESULTS
Description of the Dataset
The data from the UCI dataset repository that consists of Internet Information Server (IIS) logs for msnbc.com
and news related portions of msn.com. Each sequence in the dataset corresponds to page views of a user. Each event in the
sequence corresponds to a user‟s request for a page. Requests are recorded at the page categories level as determined by the
site administrator.
There are 17 page categories, namely „front page‟ , „news‟, „tech‟, „local‟, „opinion‟, „on-air‟, „misc‟, „weather‟,
„health‟, „living‟, „business‟, „sports‟, „summary‟, „bulletin board service‟, „travel‟, „msn-news‟, and „msn-sports‟. Each
page category is represented by an integer label.
For example, „front page‟ is coded as 1, „news‟ is coded as 2, „tech‟ is coded as 3, etc. Each row describes the hits
of a single user. Figure 2 shows the example of web navigational data. The similarity table is computed using Sequence set
similarity and shown in Figure 3.
T1:
T2:
T3:
T4:
T5:
T6:
T7:
T8:
T9:
T10:
6
2
14
1
6
6
1
1
2
1
7
12
14
1
8
6
14
1
2
11
7
3
14
12
8
6
14
1
15
1
7
4
14
2
8
6
1
1
5
2
6
12
14
2
8
3
1
1
5
2
7
12
14
4
12
14
2
14
16
14
Figure 2: Sample Web Navigation Data
149
Clustering of Web Usage Data Using Fuzzy Tolerance Rough Set Similarity and Table Filling Algorithm
T2
T3
T4
T5
T6
T7
T8
T9
T10
0
0
0
0.21
0.29
0
0
0
0
T1
0
0.47
0.17
0.17
0.17
0
0.15
0.15
T2
0
0
0.25
0.33
0.33
0
0.21
T3
0.17
0
0.45
0.27
0.24
0.5
T4
0.18
0
0
0
0
T5
0.18
0.21
0
0.17
T6
0.58
0.17
0.62
T7
0
0.5
T8
0.24
T9
Figure 3: Similarity Table Using Sequence Set Similarity
Assuming the threshold value as 0.2, the similarity classes are shown as follow.
SimClass(T1)
=
{T1, T5, T6}
SimClass(T2)
=
{T2, T4}
SimClass(T3)
=
{T3, T6, T7, T8, T10}
SimClass(T4)
=
{T2, T4, T7, T8, T9, T10}
SimClass(T5)
=
{T1, T5}
SimClass(T6)
=
{T1, T3, T6, T8}
SimClass(T7)
=
{T3, T4, T7, T8,T10}
SimClass(T8)
=
{T3, T4, T6, T7, T8, T10}
SimClass(T9)
=
{T4, T9, T10}
SimClass (T10) =
{T3, T4, T7, T8, T9, T10}
Initially {T1, T5, T6} and {T2, T4} form as two separate clusters.
Based on the Fuzzy tolerance rough set similarity we get the following clusters.
C1 = {T1, T5, T6}
C2 = {T2, T4, T7, T8, T9, T10}
C3 = {T3, T4, T6, T7, T8, T9, T10}
There are some transactions which belong to multiple clusters. Different clusters can be formed by choosing
different threshold values. We have considered Web Server log file from the Web site www.enggresources.com for our
experimental study and concept based Website graph is constructed as additional input. Error records, requests for images
and multimedia files are removed from Server log file by using a tool called Web log filter. Usually this process removes
requests concerning non-analyzed resources such as images, multimedia files, and page style files (*.css) etc. IP address,
timestamp, user agent, request and referrer are retained for further processing. In user identification, IP address and user
agent are used. That is, a combination of IP address and user agent is used to identify a unique user.
In session construction, we have combined two trivial approaches, Time oriented approach and Navigation
oriented approach along with concept name match approach for identifying user sessions. Page stay time threshold and
session timeout threshold are set as 10 and 30 minutes respectively. Each Web page is assigned with unique index.
150
T. Vijaya Kumar & H. S. Guruprasad
And, every unique session is also given unique index. 10217 users and 25814 sessions were discovered from preprocessing. Similarity table is constructed using Sequence set similarity. Experiments are conducted by randomly selecting
100, 200, 300, 400, and 500 sessions from the preprocessed data with threshold values
= 0.2 and
= 0.3.
As the number of records increases the number of clusters formed also increases. The graphs for the number
sessions versus the number of clusters with threshold values
= 0.2 and
= 0.3 are shown in Figure 4(a) and 4(b)
respectively.
Figure 4(a): Graph Depicting Number of Sessions versus Number of Clusters with
= 0.2
Figure 4(b): Graph Depicting Number of Sessions versus Number of Clusters with
= 0.3
CONCLUSIONS
A web user transactions clustering can be used to find interesting user access patterns from web server log files. In
this paper we have proposed an approach for finding web sessions clusters using Fuzzy tolerance rough set theory and table
filling algorithm. These clusters symbolize groups of users exhibiting similar browsing patterns. These patterns can be used
to provide set of recommendations for the web site which can be deployed by web site administrator for website
enhancement. Traditional clustering methods create clusters by describing the members of each cluster whereas the rough
set based clustering techniques create clusters describing the main characteristics of each cluster. In this work, we
introduced Fuzzy tolerance rough set similarity measure along with the table filling algorithm. The proposed approach
allows merging of two or more clusters. We investigated our approach on MSNBC web navigation data set. We
successfully conducted experiments on the server log files of the Website www.enggresources.com to form clusters.
Clustering of Web Usage Data Using Fuzzy Tolerance Rough Set Similarity and Table Filling Algorithm
151
REFERENCES
1.
R. Cooley, B. Mobasher, and J. Srivastava, “Web mining: information and pattern discovery on the World Wide
Web”, Ninth IEEE International Conference on Tools with Artificial Intelligence, Newport Beach, CA, USA,
1997, Pages 558-567.
2.
J. Srivastava, R. Cooley, M. Deshpande, and P. N. Tan, “Web usage mining: discovery and applications of usage
patterns from Web data”, ACM SIGKDD Explorations Newsletter, Volume 1,Pages 12-23, 2000.
3.
BamshadMobasher, Chapter: 12, “Web Usage Mining in Data Collection and Pre-Processing”, ACM SIGKKD
2007 Pages 450-483.
4.
Murat Ali Bayir, Ismail HakkiToroslu, GuvenFidan, and AhmetCosar, “Smart Miner: A New Framework for
Mining Large Scale Web Usage Data”, ACM 2009.
5.
Norwati Mustapha, ManijehJalali , and MehrdadJalali, “Expectation Maximization Clustering Algorithm for User
Modeling in Web Usage Mining Systems”, European Journal of Scientific Research ISSN 1450-216X Volume 32
Number.4 (2009), Pages.467-476.
6.
OlfaNasraoui, MahaSoliman,EsinSaka,Antonio Badia, and Richard Germain, “Web Usage Mining Framework for
Mining Evolving User Profiles in Dynamic Web Sites”, IEEE transactions on knowledge and data engineering,
Volume. 20, Number. 2, February 2008.
7.
T. Vijaya Kumar, Dr. H. S. Guruprasad, “Clustering Web Usage Data using Concept hierarchy and Self
Organizing Maps”, International Journal of Computer Applications (0975 – 8887) Volume 56 – No.18, October
2012 www.ijcaonline.org.
8.
H. Hannah In barani , K. Thangavel, “Rough set based User profiling for Web Personalization”, International
Journal of Recent Trends in Engineering, Vol 2, No. 1, November 2009.
9.
Rajhans Mishra and Pradeep Kumar, “Clustering Web Logs Using Similarity Upper Approximation with
Different Similarity Measures” ,International Journal of Machine Learning and Computing, Vol. 2, No. 3, June
2012.
10. K.Santhisree, and Dr.A.Damodaram, “Clustering on Web usage data using Approximations and Set Similarities”
2010 International Journal of Computer Applications (0975 – 8887) Volume 1 – No. 4.
11. Philip Hingston, “Using Finite State Automata for Sequence Mining”, Proceedings of twenty-fifth Australian
conference on computer science – Volume 4 Pages 105-110. Australian Computer Science Communications
Vol.24 Issue 1, Jan-Feb 2002.
12. Sunil Joshi, Dr. R. S. Jadon, and Dr. R. C. Jain, “ Sequential Pattern Mining Using Formal Language Tools”
IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 5, No 2, September 2012 ISSN
(Online):1694-0814.
13. G.T.Raju, and P. S. Satyanarayana, “Knowledge Discovery from Web Usage Data: Complete Preprocessing
Methodology”, IJCSNS International Journal of Computer Science and Network Security, Volume.8, Number 01
January 2008.
152
T. Vijaya Kumar & H. S. Guruprasad
14. C. Shahabi and F. B. Kashani, “Efficient and anonymous Web-usage mining for Web personalization”, INFORMS
Journal on Computing, 15(2) Pages 123-147, 2003.
15. M. Spiliopoulou, B. Mobasher, B. Berendt, and M. Nakagawa, “A framework for the evaluation of session
reconstruction heuristics in Web usage analysis”, INFORMS Journal on Computing, 15(2), Pages 171-190, 2003.
16. T.Vijaya Kumar, Dr. H.S. Guruprasad, Bharath Kumar K.M, IrfanBaig and KiranBabu S,“A New Web Usage
Mining approach for Website recommendations using Concept hierarchy and Website Graph”, International
Journal of Computer and Electrical Engineering (IJCEE, ISSN: 1793-8198 (Online Version);1793-8163( print
version).
17. P. Kumar, M.V. Rao, P.R. Krishna, R.S. Bapi, and A. Laha, “Intrusion detection system using sequence and set
preserving metric” Proceedings of IEEE International Conference on Intelligence and Security Informatics, LNCS
Springer Verlag, Atlanta, 2005, pp.c498–504.
18. L.
Bergroth,
H.
Hakonen,
and
T.
Raita,
“
A
survey
of
longest
common
subsequence
algorithm”SeventhInternational Symposium on String Processing and Information Retriveal, Atlanta, 2000, pp.
39–48.
19. Pradeep Kumar, P. Radha Krishna, Raju SBapi and Supriya Kumar De, “Rough clustering of sequential data”,
Data & Knowledge Engineering 63 (2007) 183–199, www.elsevier.com/locate/datak.
Download