TOWARD BETTER WEBSITE USAGE: LEVERAGING DATA MINING

advertisement
TOWARD BETTER WEBSITE USAGE:
LEVERAGING DATA MINING TECHNIQUES AND ROUGH SET LEARNING TO
CONSTRUCT BETTER-TO-USE WEBSITES
A Dissertation
Presented to
The Graduate Faculty of The University of Akron
In Partial Fulfillment
of the Requirements for the Degree
Doctor of Philosophy
Natheer Yousef Khasawneh
August, 2005
TOWARD BETTER WEBSITE USAGE:
LEVERAGING DATA MINING TECHNIQUES AND ROUGH SET LEARNING TO
CONSTRUCT BETTER-TO-USE WEBSITES
Natheer Yousef Khasawneh
Dissertation
Approved:
Accepted:
_______________________________
Advisor
Dr. John Durkin
_______________________________
Department Chair
Dr. Jose De Abreu-Garcia
_______________________________
Committee Member
Dr. John Welch
_______________________________
Dean of the College
Dr. George Haritos
_______________________________
Committee Member
Dr. James Grover
_______________________________
Dean of the Graduate School
Dr. George Newkome
_______________________________
Committee Member
Dr. Yueh-Jaw Lin
_______________________________
Date
_______________________________
Committee Member
Dr. Yingcai Xiao
_______________________________
Committee Member
Dr. Chien-Chung Chan
ii
ABSTRACT
When users browse a website, they usually try to accomplish a certain task, such as
finding information, buying products, registering for classes, and attending classes online. The interaction between the users and the website can give the web engineers insight
into the most common user tasks performed on the website. They can learn how most
users navigate the website to finish their tasks and what changes can be made to the
website structure in order to make the completion of the common tasks easier and faster.
Most web servers provide web interaction logs to track the interaction between the users
and the website. But such logs are usually designed for debugging purposes and not for
the analysis of the website. So there is a need for a deeper conceptual method to analyze
the interaction log to reveal information that can be used for enhancing the website
structure.
In this work, different data mining techniques, along with a rough set learning
approach, are presented to enhance website usage. A new active-user-based user
identification algorithm was applied to the interaction log to group together records that
belong to the same user. The algorithm has a complexity running time of one order faster
than other user identification algorithms. Sessions for identified users are found using an
ontology-based session identification algorithm, which uses the website ontology in
determining the sessions within website users. Different website sessions are then
compared using a new Multidimensional Session Comparison Method (MSCM). MSCM
iii
takes into consideration other session dimensions, such as pages visited, time spent on the
pages and the session length. MSCM compares sessions more precisely than other well
known session comparison methods, such as the Sequence Alignment Method (SAM),
Multidimensional Sequence Alignment Method (MDSAM), and Path Feature Space.
Using the comparison results from the MSCM, sessions are clustered by hierarchal and
equivalence classes clustering algorithms. The clustering results are used by the rough set
learning method and the centroid method to generate rules that are used for both
predicting and describing sessions’ clusters. Rules generated using a rough set learning
approach predict and describe clusters better than rules generated using centroid method.
Each session cluster is considered one task and the cluster centroid is the navigation path
for completing the task. So common tasks along with their navigation path are evaluated,
suggestions are then made for the website engineer to enhance the website structure to
better serve website users. This work shows how data mining techniques along with
rough set learning methods can be used to enhance the website structure for better-to-use
websites.
iv
DEDICATION
To my parents…
v
ACKNOWLEDGEMENTS
All praises are due to ALLAH (GOD). Every good comes through HIM alone. So
praises be to HIM.
My profound thanks to my advisor Dr. John Durkin for his support, confidence, and
understanding. My deep appreciations to Dr. C.-C. Chan for his constant support and
insightful guidance and to Dr. Tom Xiao for the good time I spent with him in the ODOT
project that was very helpful in my research. I want to thank Dr. John Welch for his proof
readings and the time he spent with me teaching in the "Tools Lab." Dr. James Grover
and Dr. Y.-J. Lin also gave me invaluable support throughout my research. My special
thanks to the staff in the computer center at the University of Akron for providing the
data for this research. My thanks also go to the faculty and the staff of the Department of
the Electrical and Computer Engineering for their support.
My heartfelt thanks to my brothers in Akron Majsid, Abdul Kareem, Abdul Raheem,
Yahya, Hussien, Masoud, Musa and Abdel Ghanee, for their prayers and support. To my
dear friends in USA, including, Qasem, Luay, Qais, Ahmad, Mohammad, Hussein,
Huthaifa, Faisal, Sami, Samer, and Majed a special thanks for the happy time we spent
together.
My friends and family in Jordan, including my mother, Mrs. Fairouze Khasaswneh,
my father, Mr. Yousef Khasawneh, my sisters Fatemah, Hala, and Dr. Maha, my brothers
Dr. Basheer and Dr. Mohammad and their families, have been my strongest support
vi
system. This project surely would not have been accomplished without their love, care
and DU'A (prayers).
vii
TABLE OF CONTENTS
Page
LIST OF TABLES………………………………………………………………….
xiii
LIST OF FIGURES………………………………………………………………… xv
CHAPTER
I. INTRODUCTION……………………………...…………………………...
1
2
1.1
Motivation …………….……………...………………………………
1.2
Previous work ………...................…………………………………… 2
1.3
Proposed WUM system architecture………………………………….
3
1.4
Main contributions ………………………………...………………...
5
1.5
Research objective……………………………………………………. 6
1.6
Structure of the dissertation…………………………………………... 6
II. WEB LOG DATA PREPROCESSING FOR WEB USAGE MINING …...
7
7
2.1
Introduction…………………………………………………………...
2.2
Previous work………………………………………………………… 9
2.3
Data preprocessing architecture………………………………………
10
2.3.1
Data cleaning……………………………………………………
10
2.3.2
User identification…………………………………...………….
11
2.3.2.1
User identification problem statement………………….. 12
2.3.2.2
A trivial user identification algorithm…………………..
viii
12
2.3.2.3
The
active
user-based
user
identification
algorithm………………………………………... 13
2.3.3
Ontology-based session identification………………………….. 16
2.3.4
Data filtering……………………………………………………. 18
2.4 Experimental results…………………………………………………….
19
2.4.1
Data overview…………………………………………………... 19
2.4.2
Data selection process…………………………………………..
2.4.3
Data cleaning results……………………………………………. 20
2.4.4
User identification results………………………………………. 22
2.4.5
Session identification and data filtering results………………… 24
20
2.5 Modeling website parameters…………………………………………... 25
2.5.1
Distribution functions…………………………………………...
25
2.5.2
Analytical results………………………………………………..
26
2.5.2.1
Modeling number of records per user…………………..
27
2.5.2.2
Modeling inactive user time…………………………….
28
2.5.2.3
Modeling recorded records per second…………………. 29
2.6 Summary………………………………………………………………... 29
III. MULTIDIMENSIONAL SESSIONS COMPARISON METHOD USING
DYNAMIC PROGRAMMING………...……………………....................... 31
3.1 Introduction……………………………………………………………..
31
3.2 Definitions………………………………………………………………
32
3.3 Problem statement………………………………………………………
33
3.4 Related work……………………………………………………………. 33
ix
3.4.1
Exact sequence matching……………………………………….
33
3.4.2
Approximate one dimension sequence matching……………….
33
3.4.2.1
Measuring difference distance………………….……..... 34
3.4.2.2
Measuring similarity distance…………………………... 36
3.5 Previous work…………………………………………………………... 37
3.5.1
Limitations of the previous work………………..……………...
38
3.6 Multidimensional session comparison method (MSCM)………………. 38
39
3.6.1
Assumptions…………………………………………………….
3.6.2
Algorithm construction…………………………………………. 40
3.6.3
Algorithm description…………………………………………... 41
3.6.4
Time complexity analysis………………………………………. 43
3.7 Experimental results and analysis………………………………………
43
3.8 Summary and conclusion……………………………………………….
45
IV. ENHANCING
WEBSITE
STRUCTURE
BY
MEANS
OF
HIERARCHAL CLUSTERING ALGORITHMS AND ROUGH SET
LEARNING APPROACH…………………………………………………. 47
4.1 Introduction……………………………………………………………..
47
4.2 Clustering analysis……………………………………………………… 49
49
4.2.1
Clustering algorithms……..…………………………………….
4.2.2
Properties of agglomerative hierarchal clustering techniques….. 51
4.3 Clustering web sessions………………………………………………… 52
4.3.1
Definitions………………………………………………………
52
4.3.2
Problem statement……………………………………………...
54
x
4.3.3
Hierarchal clustering algorithm…………………………………
54
4.3.4
Equivalence classes clustering algorithm……………………….
55
4.3.5
Determining a common termination condition for different
sessions lengths………………………………………………… 57
4.3.6
Ward's method improves determining a common termination
condition………………………………………………………... 58
4.4 Web sessions' classifiers………………………………………………... 60
4.4.1
The centroid approach…………………………………………..
61
4.4.2
Rough set approach……………………………………………..
62
4.5 Classifier accuracy estimator…………………………………………… 66
4.6 Experimental results…………………………………………………….
68
68
4.6.1
Choosing the clustering termination conditions………………...
4.6.2
Classifier prediction accuracy results by rules generated from
examples using the hierarchal clustering algorithm……………. 70
4.6.3
Classifier prediction accuracy results by rules generated from
examples using equivalence classes clustering algorithm……… 72
4.6.4
Cluster description results………………………………………
73
4.7 Results incorporation…………………………………………………… 74
4.7.1
Identifying the most common tasks…………………………….. 74
4.7.2
Finding how many clicks needed to finish each task…………...
75
4.7.3
Presenting suggestions to enhance the website structure……….
76
4.8 Results discussion………………………………………………………. 76
4.9 Summary and conclusion……………………………………………….
77
V. SYSTEM IMPLEMENTATION………………...………………………..... 79
xi
5.1
Introduction………….………………………………………………..
79
5.2
Data preparation module……………………………………………..
80
5.3
Session identification module………………………………………...
81
5.4
Clustering process module…..……………………………………......
84
5.5
Results presentation and evaluation module………………………….
88
5.6
Summary….….……………………………………………………….. 90
VI. SUMMARY AND CONCLUSIONS………...…………………………..… 92
REFERENCES…………………………………………………………...…………
xii
96
LIST OF TABLES
Page
Table
2.1
Selected dates for experimental results along with their major activity…...
20
2.2
The percentage of different file types in the selected data set……………..
21
2.3
Requests status for the records in the web record………………………….
21
2.4
Correlation coefficient for different models for the number records per
user probability……………………………………………………………. 28
2.5
Correlation coefficient for different models for inactive user time
probability………………………………………………………………….. 28
2.6
Correlation coefficient for different models for recorded records per
second probability…………………………………………………………. 29
3.1
Pairwise scores between different pages…………………………………...
36
3.2
Two sequences si and sj…………………………………………………….
37
3.3
MSCM algorithm major steps……………………………………………...
41
3.4
Matrix used to compute the minimum edit distance, when only the zeroth
column and row are filled in ………………………………………………. 42
3.5
Matrix used to compute the minimum edit distance, when the entire cells
are filled in ……….……………………………………..…………………. 43
3.6
Distance measure between sessions using different methods………….......
4.1
Representing clustering results in a form of examples…………………….. 47
4.2
Number of clusters for different session lengths at different iterations……
4.3
Percentage of the number of the clusters from the initial number of
clusters for different session lengths at different iterations………………... 58
xiii
44
58
61
4.4
Example of a rule generated by a web sessions' classifier…………………
4.5
Decision produced by the clustering algorithm……………………………. 63
4.6
Certain rules learned from 4.5 using BLEM2……………………………...
4.7
Inference engine testing examples…………………………...…………….. 67
4.8
Inference engine results along with results from cluster…………………...
68
6.1
Mathematical models for three website parameters………………………..
93
xiv
65
LIST OF FIGURES
Page
Figure
1.1
Proposed WUM system architecture…………………………………….. 4
2.1
Data preprocessing architecture………………………………………….
2.2
Formal user identification problem statement…………………………… 12
2.3
Trivial user identification algorithm……………………………………... 13
2.4
Active user-based user identification algorithm…………………………. 15
2.5
Ontology-based session identification algorithm………………………...
2.6
Monthly record counts recorded in the web log…………………………. 20
2.7
Active user-based user identification script……...………………………
23
2.8
Histogram for sessions’ lengths after session identification……………..
24
2.9
Histogram for the sessions’ lengths before filtering……………………..
25
2.10
Probability of the number of records per user…………………………… 27
2.11
Probability of inactive user time in seconds……………………………... 28
2.12
Probability of recorded records per second………………………………
4.1
Web usage classification and prediction workflow……………………… 48
4.2
Two well separated clusters with intermediate chain……………………. 51
4.3
Hierarchal clustering algorithm………………………………………….. 55
4.4
Equivalence classes clustering algorithm………………………………... 57
xv
10
18
29
4.5
Percentage of the number of the clusters from the initial number of
clusters for a specific session length at different iterations using the
average linkage method………………………………………………….. 60
4.6
Percentage of the number of the clusters from the initial number of
clusters for a specific session length at different iterations using the
Ward’s method…………………………………………………………... 60
4.7
Holdout classifier accuracy estimator……………………………………
4.8
Percentage of the number of the clusters from the initial number of
clusters for different session length groups at different iterations………. 70
4.9
Average accuracy for different session lengths at the 100% number of
clusters using examples from the hierarchal clustering algorithm………. 71
4.10
Average accuracy for different session lengths at the 15.69% number of
clusters using examples from the hierarchal clustering algorithm……… 72
4.11
Average accuracy for different session lengths at the 15.69% number of
clusters using examples from the equivalence classes clustering
algorithm………………………………………………………………… 73
4.12
Cluster description length for different session lengths using different
classifiers………………………………………………………………… 74
4.13
Seven most common tasks performed on the website…………………… 75
4.14
Sequence length distribution for “Class Search Detail”…………………. 76
5.1
Data flow diagram for the web usage mining system……………………
80
5.2
Entity relation model for data preparation……………………………….
81
5.3
Use case diagram for session identification……………………………...
82
5.4
Session identification module user interface…………………………….. 84
5.5
Use case diagram for clustering process module………………………...
5.6
UML diagram for the clustering process………………………………… 85
5.7
Sequence diagram for generating dissimilarity matrix…………………... 86
5.8
Sequence diagram finding clusters………………………………………. 86
xvi
67
85
5.9
Clustering module user interface………………………………………… 88
5.10
Dataflow diagram for the results presentation and evaluation module…..
xvii
90
CHAPTER I
INTRODUCTION
The World Wide Web has greatly impacted every aspect of our societies and our
lives. This ranges from information dissemination to communication, and from ecommerce to process management. By browsing through a website, users complete
different tasks, such as buying products, registering for classes, and attending classes online. Web Usage Mining (WUM), a new field that analyzes the navigation process, has
emerged in recent years. WUM is defined as applying data mining techniques to log
interactions between users and a website [1]. Analysis of an interaction log file can
provide useful information that helps a website engineer in enhancing the website
structure in a way that will make the website usage easier and faster in the future.
In this dissertation, we are interested in the clustering of web users’ sessions in the
context of web applications, such as registration web-based systems, distance-education
web-based systems, e-commerce sites, and any other web based applications. Clustering
web users’ sessions is to group users with similar navigation behaviors together. Our goal
is to use the clustering results to identify dominant browsing behaviors, evaluate a
website structure and predict future users’ browsing behaviors to better assist users in
their future browsing experiences.
1
1.1 Motivation
In the process of designing a web application it is hard to predict how users will use
a website in completing different tasks. Usually, web designers have the choice to make a
certain task easier to complete than other tasks by constructing the website structure in a
certain way. After publishing the website online and having users interact with it for a
while, it becomes the time to review certain decisions concerning the website structure.
Such decisions can be made by analyzing the interaction log between users and the
website. A deep conceptual analysis of the interaction log is required to understand what
the most common tasks done over the website are, how the majority of users navigate the
website to achieve such common tasks, and what changes can be made to the website
structure to make the completion of the common tasks easier and faster. For example, if
we have a registration website—where users can do different tasks, such as check grades,
add classes, drop classes, and pay tuition fees—there is a need for a system to determine
what the most common tasks are, and how easily they can be achieved by users. So, if we
find that, at a certain point in time, the grade checking process is the most common task,
and it takes a long time for users to finish this task, the website engineer shall be advised
to enhance the website structure to make this task easier and faster.
1.2 Previous work
Available commercial web usage mining systems, such as Surfaid [2], Net Tracker
[3], and WebTrends [4], give statistical information about the website, such as the
average usage hits, geographical distribution of users, and the most frequent page hit.
These are considered to be statistically significant results rather than conceptual results.
For example, if we conclude that during a certain time a given number of users hit a
2
website, this gives no insight into the hidden usage patterns. Other published work on
web usage mining used different data mining techniques such as association rules [5],
clustering and classification. In our research, we focus on researches that use clustering
and classification of data mining techniques. For example, Fu et al. [6] used the BIRCH
[7] clustering algorithm to cluster users’ sessions. However, they did not discuss how the
closeness between different sessions was defined, and they did not show how they chose
the maximum difference allowed between sessions in the same cluster. Foss et al. [8]
presented a novel clustering algorithm that clusters users’ sessions. Their clustering
algorithm did not require any input parameter from the users, such as the final number of
clusters, or the maximum difference allowed between sessions in the same cluster. The
way they measure similarity did not consider the order of the pages. For example, they
considered a session consisting of pages, say A, B, and C, identical to a session
consisting of pages, say A, B, C, and D, or any other number of pages that contains the
pages A, B, and C.
1.3 Proposed WUM system architecture
As shown in Figure 1.1, the proposed WUM system is divided into four phases:
preprocessing, dissimilarity measure, clustering analysis, and results incorporation and
evaluation. In the preprocessing phase, the raw web logs are filtered from unrelated web
requests, the records that belong to the same user are then grouped together in one set. In
the dissimilarity measure phase, the users’ records are divided into one or more sessions
and a dissimilarity matrix, which reflects the dissimilarity between different sessions, is
constructed. In the clustering analysis phase, sessions with similar browsing behaviors
are grouped together. In the last phase—the results incorporation and evaluation phase—
3
clustering results are incorporated to predict future users’ classes, and present suggestions
to enhance the website structure to adequately better serve website users in their future
visits.
Web log data
Preprocessing
Phase 1
Data filtered
Records grouped into users
Dissimilarity
measure
Phase 2
Users divided into one or more sessions
Dissimilarity matrix constructed
Clustering
analysis
Phase 3
Sessions with the same browsing
behavior are grouped together
Results incorporation and
evaluation
Results are evaluated
Results are incorporated to enhance website structure
Figure 1.1 Proposed WUM system architecture
4
Phase 4
1.4 Main contributions
In each WUM phase presented in Section 1.3, we have one or more new
contributions to the WUM field.
In the preprocessing phase, we present a fast active user-based user identification
algorithm with time complexity O(n). For the session identification phase, we present an
ontology-based session identification algorithm that utilizes the website structure and its
functionalities in identifying different sessions. In addition, we present extra cleaning
steps, such as removing housekeeping pages, removing redundant pages, and grouping
sessions with similar session lengths. We also present three mathematical models for the
parameters on which our user-identification algorithm depends.
In the dissimilarity measure phase, we present a new Multidimensional-Sessions
Comparison-Method (MSCM) using dynamic programming. Our method takes into
consideration different session dimensions, such as the page list, the time spent on each
page, and the length of the session. This is in contrast to other algorithms that treat
sessions as sets of visited pages within a time period and do not consider the sequence of
the click-stream visitation or the session length.
In the web sessions clustering analysis phase, we present two clustering algorithms:
a hierarchal clustering algorithm and an equivalence classes clustering algorithm. The
equivalence classes clustering algorithm does not depend on the seed starting point of the
clustering process. We also present a new method to determine the clustering parameter
that in turn determines where the clustering algorithm should stop.
In the results incorporation and presentation phase, we present a rough set approach
in predicting the future classes and we present the results in a way that can be
5
incorporated in the website server for predicting future users’ classes. We also present an
evaluation process that evaluates the accuracy of the predicted classes. Finally, we show
how results can be incorporated to enhance the website structure to better serve future
website users.
1.5 Research objective
Our main objective in this research is to present a WUM system that uses clustering
algorithms along with a rough set learning approach. This improved WUM system
presents a deep conceptual understanding of the usage behavior for a website, can be
used by the website engineer to evaluate and enhance the website structure, and to predict
“what the user was trying to do” to better assist users in their future browsing
experiences. This should lead to websites that are easier and more convenient for users to
navigate.
1.6 Structure of the dissertation
The rest of the dissertation is organized as follows. In Chapter 2, we present the data
preprocessing phase. In Chapter 3, we present the MSCM method. In Chapter 4, we
present the four main steps incorporated in the WUM system: the clustering algorithms,
the clustering results presentation, evaluating the results and incorporating the results to
improve the website structure. In Chapter 5, we present an overview of the system
implementation. In Chapter 6, we present conclusions drawn from the research and
recommendations for future work.
6
CHAPTER II
WEB LOG DATA PREPROCESSING FOR WEB USAGE MINING
Web usage mining is “the application of data mining techniques to large Web data
repositories in order to extract usage patterns” [1]. Web log files contain data that need
some cleaning since their formats were meant for debugging purposes only [9]. In this
chapter, we present new techniques for preprocessing web log data and for identifying
unique users and sessions from the data. We present a fast active user-based user
identification algorithm with time complexity O(n). The algorithm uses both an IP
address and a finite users’ inactive time to identify different users in the web log. For the
session identification, we present an ontology-based session identification method that
utilizes the website structure and functionalities to identify different sessions. In addition,
we present extra cleaning steps such as removing housekeeping pages, removing
redundant pages, and grouping sessions with similar session lengths. Finally, we present
three mathematical models for the website parameters on which our active user-based
user identification algorithm depends.
2.1 Introduction
Web usage mining is “the application of data mining techniques to large Web data
repositories in order to extract usage patterns” [1]. Data mining techniques—such as
association rule mining, sequential patterns or clustering analysis—cannot be applied
7
directly to raw web-log-files data, since the format of this data was designed for
debugging purposes [9]. Five preprocessing steps have been identified [10]:
1. Data cleaning: This step removes irrelevant data, such as log records for images,
scripts, help files, and cascade style sheets. Only data that is relevant to the
mining process is kept.
2. User identification: This consists of grouping together records for a same user.
Log records are recorded in a sequential manner as they are coming from different
users (i.e., records for a specific user may not necessarily be in consecutive order
since they could be separated by records from other users).
3. Session identification: This step divides the pages access of each user into
individual sessions.
4. Path completions: This step determines whether there are important accesses that
are not recorded in the access log due to caching on several levels.
5. Formatting: This step formats the data to be readable by data mining systems.
In this chapter, we present a detailed data processing architecture that includes data
cleaning, user identification, session identification and data filtering. Our main
contributions in this chapter include a user identification algorithm that is running in a
time complexity of O(n), an ontology-based session identification algorithm, and three
mathematical models for the website parameters on which our user identification
algorithm depends.
The rest of the chapter is organized as follows. In Section 2, we survey the previous
work on web usage mining preprocessing techniques. In Section 3, we present our data
preprocessing architecture along with a detailed description for each step. In Section 4,
8
we present experimental results. Section 5 presents the mathematical models for different
website parameters. In Section 6, we present the summary.
2.2 Previous work
Previous work on preprocessing web logs emphasized on the caching problem since
caching produces incomplete web logs. One solution to this is to collect the data on the
client side. For example, Shahabi [11] collected almost all the user interactions with the
browser. Fenstermacher and Ginsburg [12] went beyond the browser interaction and
recorded the interaction between the user and some other applications. Catledge and
Pitkow [13] presented another system that collected data on the client side. The use of
these methods imposes the issue of security. Moreover, most of the proposed methods
require special browsers and setups.
Other works like [14] and [15] assumed that the filtered web server log is a good
representation for web usage meaning and that there is no need for any heuristic methods
to complete the sequences such as path completion process [10].
In their task of the preprocessing step [16], Yan et al. converted the information in
user access logs into a vector representation. The vector representation combines the page
access along with the amount of interest a user shows in a page, which was calculated by
counting the number of times the page was accessed.
Path completion [10] identifies missing records in users’ sessions using a heuristic
method, which is based on the web structure. Transactions were identified either by
reference length, which is based on the time spent on the page, or by maximal forward
reference, which is based on the first backward action (hitting the back button on the
browser) after a series of forward actions (normal forward navigation). Other heuristic
9
methods like time-oriented heuristics [17] and navigation-oriented heuristics [18] were
used to identify different sessions.
It can be concluded that previous work has not identified a specific algorithm for
user identification; rather, they assumed that users’ records are readily available in the
website log.
2.3 Data preprocessing architecture
As shown in Figure 2.1, we identified four steps in the data preprocessing phase:
data cleaning, user identification, session identification and data filtering. The following
subsections provide details on each step.
Records from
Web Log
Data Mining
Technique
Data Cleaning
User
Identification
Data
Filtering
Session
Identification
Figure 2.1 Data preprocessing architecture
2.3.1 Data cleaning
Web logs are designed for debugging purposes in that the web accesses are recorded
in the order they arrive [9]. Due to the connectionless nature of the HTTP (i.e., each
request is handled in a separate connection), web log records for a single user do not
necessarily appear contiguously since they could be interleaved with records from other
users. Thus, for each page component—such as an image, a cascading style sheet file, an
HTML file, scripting file, or a Java script—a separate record is recorded in the web log
file. Usually, each record in the web-log file has the following standard format [19]:
10
•
Remotehost which is the remote hostname or its IP address;
•
Logname which is the remote logname of the user;
•
Date, which is the date and time of the request;
•
Request, which is the exact request line as it came from the user;
•
Status, which is the HTTP status code returned to the client; and
•
Byte, which is the content-length of the document transferred.
Usually, for web mining purposes the only interesting elements are the HTML pages
and the scripting pages—such as JSP, ASP or PHP pages—unless other file types are
playing a navigation role in the web application and they are part of the web structure. In
the cleaning phase, the file types that are related to the navigation structure are kept and
other files are eliminated. The status field in the web log can be used to keep the
successfully fulfilled requests and to delete the unsuccessful requests. Finally, the mining
process can be limited to a certain time or date so that web traffic during such time and
date will be only considered.
2.3.2 User identification
A user is defined as a unique client to the server during a specific period of time.
The relationship between users and web log records is one to many (i.e., each user is
identified by one or more records). Users are identified based on two assumptions:
1. Each user has a unique IP address while browsing the website. The same IP
address can be assigned to other users after the user finishes browsing.
2. The user may stay in an inactive state for a finite time after which it is assumed
that the user left the website.
11
Next, we formally present the problem statement of the user identification, and then
we present two different algorithms for user identification: a trivial algorithm and our
new active user-based one.
2.3.2.1 User identification problem statement
Figure 2.2, shows a formal description of the user identification problem statement.
As stated earlier, the user identification algorithm must identify the user’s record based
on the assumption that all user’s records have the same IP address and a finite inactive
browsing time, β.
Given web log record R =< r1 , K, rk >, where k > 0 and k is the total number
of records in the web log database.
∀r ∈ R, r is defined as
<date_time,c_ip,s_ip,s_port,cs_method,url,url_query,status,s_agent>
Find
users
U =< u1 ,K, u j > ∀u ∈ U , u is
defined
as
u =< c _ ip, last _ date _ time, {rs ,K, re } >
∀r ∈ u , r.c _ ip = c _ ip and r.date _ time ≤ last _ date _ time + β at the time
record r is added to the user u
where:
c_ip is the user’s ip address last_date_time is the date and time when the user
accessed the last record
β is the maximum user’s idle time
rs is the first record the user accessed in a single visit to the website
re is the last record the user accessed in a single visit to the website
Figure 2.2 Formal user identification problem statement
2.3.2.2 A trivial user identification algorithm
Figure 2.3, shows the trivial user identification algorithm. The figure shows that the
algorithm has two loops: an outer loop; and inner. The outer loop has time complexity of
n, where n is the size of the total records. The inner loop has time complexity of i, where i
is the total number of current users. In the worst case, each user has one record which
12
leads to two loops, the outer and the inner, each of size n. Hence, the overall time
complexity of the algorithm is O(n × n) = O(n 2 ) .
Assumption :
n number of records in the web log
Define
R : Website records
U : Users' records
R(i ) : ith Weblog record
Initialize U = φ
u first .c _ ip = R(1).c _ ip
u first .last _ date _ time = R(1).date _ time
u first .r = R(1)
U = U U {u first }
for each record r ∈ R
for each record u ∈ U
if(r.c _ ip = u.c _ ip AND r.date _ date ≤ u.last _ date _ time + β )
u.r = u.r U r
if (r.date _ date > u.last _ date _ time)
u.last _ date _ time = r.date _ date
endif
else
u new .c _ ip = r.c _ ip
u new .last _ date _ time = r.date _ time
u new .r = r
U = U U {u new }
endif
endfor
endfor
Figure 2.3 Trivial user identification algorithm
2.3.2.3 The active user-based user identification algorithm
The algorithm shown in Figure 2.4 is a modified version of the algorithm described
in Section 2.3.2.2. We limited the inner loop search to the active users only. Active users
13
are defined as users who did not exceed the maximum inactive time, and hence they are
considered to be still browsing the website and more records are likely to be added to
their navigation records.
Time complexity analysis shows that there are two loops: outer and inner loops. The
outer loop has time complexity of n, where n is the size of the total records. The inner
loop has time complexity of i where i is the total number of active users. According to the
assumptions given at the beginning of the algorithm, the maximum number of active
records cannot exceed (m • k • t ) in the worst case, where m is the number of records per
user, k is the rate of the recorded records in the web log, and t is the inactive browsing
time. So, the algorithm breaks down to two loops: the outer loop with the size of i, and
the inner loop with the size of (m • k • t ) , which is constant. Therefore, the overall
complexity of the algorithm becomes O((m • k • t ) • n) = O((cons.) • n) = O(n) .
14
Assumption
n records in the web log
m number of records per user
k records/second recorded in the web log
t inactive time in seconds for user
β maximum inactive time in seconds for the user
Define
U A : Active users; users who still browsing
U I : Idle users; users who stopped browsing
Initialize U A = first record in R
Initialize U I = φ
for each record r ∈ R do [starting at the 2nd record]
for each record u ∈ U A do
if (r.date_time > u.last_date_time + β )
U A = U A − {u} //Remove record from active users list
U I = U I ∪ {u} //Add record to the idle users list
else if(r.c_ip = u.c_ip AND r.date_time ≤ u.last_date_time + β)
u.r = u.r U r
if(r.date_time > u.last_date_time)
u.last_date_time = r.date_time
endif
else
u new .c_ip = r.c_ip
u new .last_date_time = r.date_time
u new .r = r
U A = U A U {u new } //Add new record to the active users list
endif
endfor
U I = U I U U A //Add the remaining records to the idle users list
Figure 2.4 Active user-based user identification algorithm
15
2.3.3 Ontology-based session identification
A session is defined as the stream of mouse clicks whereby a user is trying to
perform a specific task. In our research, we compare different task specific browsing
behavior. For example, assume two users, A and B, performed the following tasks.
User A searched for classes and checked his grade; whereas,
user B paid his tuition fees and searched for classes.
The user identification process identifies users A and B as two separate users with
totally different behaviors. However, if we divide each user into different sessions, user A
will have two sessions: searching for classes and checking for grades; user B will also
have two sessions: searching for classes and paying tuition. This shows that users A and
B are partially similar in searching for classes rather than being totally different if using
only user identification process.
We identify different sessions in a single user visit using the website ontology. We
also assume that the website ontology is already available through methods of retrieving
website ontology like the ones in [20-22].
The website ontology is defined as W = ( P, L, F ) where:
P : Website pages,
L : Website links,
F : Website functionalities,
P = ⟨ p1 ,K, p k ⟩ where k is the number of pages in the website.
16
Links L is the group of links for the web application. Each link l = ⟨ p s , p d ⟩ is
defined by two pages: the source page ( p s ) where a link starts from, and the destination
page ( p d ) where links ends.
Define
functionalities F as F =< f o , f 1 , K, f n −1 > ,
web
where ∀f ∈ F , f =< p s ,K, p e > . Each web functionality,
f, consists of at least two
pages: a start page and an end page. There can be zero or more pages between the start
and end pages. The session identification algorithm will divide the users identified in the
Section 2.3.2 into different sessions using the website functionalities. From the website
functionalities, we can identify pages that are considered as the breaking points for the
session, such as the sign-in or the sign-out pages.
Figure 2.5 shows the ontology-based session identification algorithm, where B is the
set of the breaking pages. The algorithm splits each user into one or more sessions and
returns a final list of sessions S. The time complexity analysis of the algorithm shows two
loops: an inner and an outer loop. The inner loop depends on the number of records per
user, m, and the outer loop depends on the total number of users, j. It can be easily
concluded that m • j = n , where n is the total number of records. So, the overall time
complexity of the algorithm is O(n).
17
for each user u ∈ U do
for each p ∈ u do
if ( p ∈ B)
split u at p location
s new = the first part of u
u=the remaining part of u
endif
S = S ∪ {s new }
endfor
S = S ∪ {u}
endfor
Figure 2.5 Ontology-based session identification algorithm
2.3.4 Data filtering
After we identify different users’ sessions, filtering is done based on removing the
housekeeping pages. The housekeeping pages are the pages that are necessary for the web
application to run properly. They are not called directly by the user; rather, they are
called internally by the requested page. These pages are identified by the website
engineer and can be found using the website ontology. Removing the housekeeping pages
can result in redundant pages which can be misleading to the sequence comparison
method, which we present in Chapter 3. To illustrate this, consider the following two
sequences:
Sequence 1: p 2 → p1 → p3 → p 4 → p5 → p1 → p5 → p 6 ,
Sequence 2: p 2 → p 7 → p8 → p 4 → p5 → p 6
It can be seen that these two sequences are totally different. However, assuming that
pages p1 , p3 , p 7 , and p8 are housekeeping pages and applying the following two step
filtering process, sequence 1 and 2 reduces to:
Step 1: Removing the housekeeping pages
18
Sequence 1 becomes p 2 → p 4 → p5 → p5 → p 6
Sequence 2 becomes p 2 → p 4 → p5 → p 6
Step 2: Removing the redundant pages
Sequence 1 becomes p 2 → p 4 → p5 → p 6
Sequence 2 becomes p 2 → p 4 → p5 → p 6
The outcome of the two-steps filtering process shows how sequences that might look
strikingly different they are actually similar.
2.4 Experimental results
For experimental results, we used data obtained from the University of Akron
registration website log files during the period from October 2003 to September 2004. In
this section, we present the results from each preprocessing step mentioned in the
previous section.
2.4.1 Data overview
The total number of records recorded in the web server for the time period
mentioned was 28,294,229 records. Each record in the web log file represents a page
request processed by the web server. Figure 2.6 shows the traffic volume on the web
server over the selected time period. The figure shows high web traffic during the months
when there was a major activity such as beginning registration, release of final grades, or
beginning of a semester. For example, it is clear from the figure that there was a high
traffic volume in January, which is the time just after the release of the final grades for
fall semester and just before the spring semester beginning.
19
Sep-04
Aug-04
Jul-04
Jun-04
May-04
Apr-04
Mar-04
Feb-04
Jan-04
Dec-03
Nov-03
Oct-03
Records Count in Thousands
450
400
350
300
250
200
150
100
50
0
Date
Figure 2.6 Monthly record counts recorded in the web log
2.4.2 Data selection process
For experimental purposes, we selected the data records for the days in which there
was major activity on the web server. Table 2.1 shows the selected dates along with the
major activity. The total number of records for the selected dates was 1,582,292, which
represents 5.6% of the total records.
Table 2.1 Selected dates for experimental results along with their major activity
Date
Major Activity
Monday, December 15, 2003
Teachers upload students grades
Tuesday, December 16, 2003
Final grades due for fall semester 2003
Monday, May 10, 2004
Teachers upload students grades
Tuesday, May 11, 2004
Final grades due for spring semester 2004
Friday, February 20, 2004
Summer semester registrations begin
Friday, October 24, 2003
Spring semester registrations begin
Friday, April 02, 2004
Fall semester registration begin
2.4.3 Data cleaning results
Table 2.2 shows the percentage of different file types in the selected data set. Since
we are interested in the scripting files that imply a direct request by the user, we kept the
ASP and HTML file types and we removed other file types.
20
Table 2.2 The percentage of different file types in the selected data set
File Type
HTML
DLL
No extension
PHP
HTM
TXT
ICO
JPG
ASP
JS
XML
Total
Count
33307
186
2148
2
6938
68
645
1510
1537485
1
2
1582292
Percentage
2.10%
0.01%
0.14%
0.00%
0.44%
0.00%
0.04%
0.10%
97.17%
0.00%
0.00%
100.00%
The second cleaning step was to remove the uncompleted requests. This can be
tracked using the status code in the log file described in Section 2.3.1. Table 2.3 shows
the status code, the code description, the total number of records, and the percentage of
records with specific status code. We kept the records with an HTML status code of Ok
and removed the other records, so we are left with 76% of the total records.
Table 2.3 Requests status for the records in the web record
Status Code Code Description
Page Count
Percentage
206
Unknown
20
0.00%
207
Unknown
3
0.00%
304
No Change
17826
1.13%
302
Not Found
322728
20.40%
400
Bad request
74
0.00%
200
Ok
1206982
76.28%
403
Forbidden
17491
1.11%
404
Matching not found
1536
0.10%
501
Facility not supported
10
0.00%
500
Unexpected condition
15622
0.99%
1582292
100.00%
Total
21
2.4.4 User identification results
We loaded the selected data into a single table in an SQL database. Out of the six
fields described in Section 2.3.1, we selected three fields (Remotehost, Date, and URL).
Then, we ran the session identification script shown in Figure 2.7, which is based on the
algorithm described in Section 2.3.2. To show the effectiveness of the active user based
algorithm we ran both the active user based and the trivial user identification algorithm
on the same records and we repeated the experiment using different web log sizes. The
active user based algorithm shows much better performance over the other algorithm
even for small web log sizes. For example, for 100 web log records, the trivial algorithm
took 527 seconds to identify the users’ sequences, while the active user-based algorithm
took 8 seconds. For the full log size (1,582,292 records), the trivial algorithm ran for
about two days and was aborted by the operating system apparently because of the
memory build up without giving any result; whereas the active user-based algorithm took
only three hours and 33 minutes to yield the results.
22
DECLARE @RecordId int
DECLARE @Date DateTime
DECLARE @IPAddress varchar(255)
DECLARE @FoundUser_id int
DECLARE @NewUserId int
DECLARE UserCursor CURSOR FOR
SELECT id,date,c_ip FROM guest.weblog order by date
OPEN UserCursor
FETCH NEXT FROM UserCursor INTO @RecordId, @Date, @IPAddress
WHILE @@FETCH_STATUS = 0
BEGIN
--Delete old (more than 30 minutes old) records from active users
DELETE FROM guest.active_users WHERE DATEADD(minute, 30, date) <
@Date
--See if there is active users with the same ip iddress
SET @FoundUsers_id = (select top 1 user_id from guest.open_users where
c_ip = @IPAddress)
IF LEN(@FoundUser_id) > 0
BEGIN
--if yes, update the max time
UPDATE guest.active_users SET date = @date where c_ip = @IPAddress
AND user_id = @FoundUser_id
--and insert new value into the users table,
INSERT INTO guest.users(user_id, id) VALUES(@FoundUser_id,
@RecordId)
END
ELSE
BEGIN
--if NO, insert new item in the open user table,
--get the new user id and then ...
INSERT INTO guest.active_users(c_ip, date) VALUES (@IPAddress,
@Date)
SET @NewUserId = @@IDENTITY
-- Insert in the users table as well
INSERT INTO guest.users(user_id, id) VALUES(@NewUserId,
@RecordId)
END
FETCH NEXT FROM UserCursor INTO @RecordId, @Date, @IPAddress
END
CLOSE UserCursor
DEALLOCATE UserCursor
Figure 2.7 Active user-based user identification script
23
2.4.5 Session identification and data filtering results
Figure 2.8 shows the histogram for sessions’ lengths after session identification. The
session identification was done based on two breaking pages sign in and sign out.
Frequency
15000
10000
5000
0
0
10
20
30
40
50
60
70
80
90
Session length
Figure 2.8 Histogram for sessions’ lengths after session identification
Grouping the sessions according to their length is important since some learning
algorithms require fixed session lengths. This will be illustrated more in Chapter 4.
Figure 2.9 shows the histogram for the sessions’ lengths before filtering. It is clear that
most of the sessions are of length 15 or less and sessions with larger lengths are
considered outliers. So, we grouped sessions into 15 different sessions’ groups, where
each session group has sessions with the same length.
24
Frequency
10000
5000
0
0
10
20
30
40
50
60
Session length
Figure 2.9 Histogram for the sessions’ lengths before filtering
2.5 Modeling website parameters
In this section, we discuss the statistical analysis methods applied to the
experimental results described earlier in Section 2.4. In modeling, we emphasize on the
website parameters on which our active user-based user identification algorithm
described in Section 2.3.3 depends.
2.5.1 Distribution functions
We selected three distribution functions as candidates to adequately represent the
three website parameters on which our active user-based user identification depends.
These are number of records per user, inactive user time, and recorded records per
second in the web log. These three distribution functions are.
1. The power fit, which is given by
F ( x) = ax b
2.1
2. The reciprocal quadratic, which is given by
F ( x) =
1
a + bx + cx 2
25
2.2
3. The geometric fit, which given by
F ( x) = axb x
2.3
2.5.2 Analytical results
One measure of the "goodness of fit" is the correlation coefficient. To explain the
meaning of this measure, we consider the standard error, which quantifies the spread of
the data around the mean, as follows
St = ∑i =po1 int s ( y − yi )
n
2.4
where the average of the data points ( y ) is simply given by
y=
1
n po int s
∑
n po int s
i =1
2.5
yi
The quantity S t considers the spread around a constant line (the mean) as opposed to
the spread around the regression model. This is the uncertainty of the dependent variable
prior to regression. We also consider the deviation from the fitting curve as
S r = ∑i =po1 int s ( y i − f ( xi ) )
n
2
2.6
Note the similarity of equation 2.6 to the standard error of the estimate given in
equation 2.4. This quantity measures the spread of the points around the fitting function.
Thus, the improvement (or error reduction) due to describing the data in terms of a
regression model can be quantified by subtracting the two quantities presented in
equations 2.4 and 2.6. Because the magnitude of the difference is dependent on the scale
of the data, this difference is normalized to yield
r≡
St − S r
St
2.7
26
where r is the correlation coefficient. As the regression model better describes the data,
the correlation coefficient will approach unity. For a perfect fit, the standard error of the
estimate will approach S r = 0 and the correlation coefficient will approach r=1.
Next we model three website parameters number of records per user, inactive user
time and recorded records per second in the web log. These three parameters determine
the speed of the active user based algorithm presented in Section 2.3.2.3.
2.5.2.1 Modeling number of records per user
Figure 2.10 shows the probability of the number of records per user fitting. Table
2.4 shows the correlation coefficient for different models for the number of records per
user probability. The power fit model is the most appropriate to fit the data since it has
Probability
the closest value to one.
1.00E+02
1.00E+01
1.00E+00
1.00E-01 1
1.00E-02
1.00E-03
1.00E-04
1.00E-05
1.00E-06
1.00E-07
1.00E-08
10
100
1000
10000
Experimental data
Power fit model
Number of records per user
Figure 2.10 Probability of the number of records per user
27
Table 2.4 Correlation coefficient for different models for the number records per user
probability
Model
Correlation coefficient
Power fit
0.41
Reciprocal quadratic
≈0.0
Geometric fit
≈0.0
2.5.2.2 Modeling inactive user time
Figure 2.11 shows the probability of inactive user time in seconds. Table 2.5 shows
the correlation coefficient for different models for inactive user time probability. The
reciprocal quadratic model is the most appropriate model to fit the data since its
correlation coefficient is the closest to the value of one.
1.00E+00
1.00E-01 1
100
10000
1000000
1.00E-02
Probability
1.00E-03
Experimental data
1.00E-04
1.00E-05
Reciprocal quadratic
model
1.00E-06
1.00E-07
1.00E-08
1.00E-09
1.00E-10
Inactive user time (seconds)
Figure 2.11 Probability of inactive user time in seconds
Table 2.5 Correlation coefficient for different models for inactive user time probability
Model
Correlation coefficient
Power fit
≈0.0
Reciprocal quadratic
0.92
Geometric fit
0.11
28
2.5.2.3 Modeling recorded records per second
Figure 2.12 shows the probability of recorded records per second. Table 2.6
Correlation coefficient for different models for recorded records per second probability.
The geometric fit model is the most appropriate to fit the data since its correlation
coefficient has the value of one.
Probability
0.2
0.15
Experimental data
Geometric fit
0.1
0.05
0
0
20
40
60
80
100
Recorded records per second
Figure 2.12 Probability of recorded records per second
Table 2.6 Correlation coefficient for different models for recorded records per second
probability
Model
Correlation coefficient
Power fit
0.89
Reciprocal quadratic
0.42
Geometric fit
1.00
2.6 Summary
In this chapter, we presented new techniques for preprocessing web log data
including identifying unique users and sessions. We presented a fast active user-based
user identification algorithm with time complexity of O(n). For session identification we
29
presented an ontology-based session identification algorithm that uses the website
structure to identify users’ sessions. We showed that the user identification algorithm
depends on three website parameters: number of records per user, inactive user time and
number of recorded records per second in the web log. In the chapters to follow, the
output of this preprocessing step will be used as an input for different data mining
techniques. In our research we focus on the clustering algorithm along with different
learning algorithms for clustering results presentation.
30
CHAPTER III
MULTIDIMENSIONAL SESSIONS COMPARISON METHOD USING DYNAMIC
PROGRAMMING
In this chapter, we present a new Multidimensional Sessions Comparison Method
(MSCM) using dynamic programming. Our method takes into consideration different
session dimensions, such as the page list, the time spent on each page and the length of
each session. This is in contrast to other algorithms that treat sessions as sets of visited
pages within a time period and don’t consider the sequence of the click-stream visitation
or the session length.
3.1 Introduction
The problem of sequence comparison, which is defined as the measure of how much
two or more sequences are similar to each other, has attracted researchers in different
fields such as molecular biology [23], speech recognition [24], string matching [25] and
traffic analysis studies [26].
In molecular biology, macromolecules are considered as long sequences of subunits
linked together sequentially. Comparing these sequences helps to answer important
questions in biology.
In speech recognition studies, speech is converted to a vector function of time,
which is considered a continuous sequence. Sequence comparison can be used in
31
different applications such as recognizing an isolated work selected from limited
vocabulary.
String matching represents each string as a sequence of characters. Sequence
comparison can be used in spell checking in word processing applications.
In the context of web usage mining, measuring similarities between web sequences,
or simply sessions, is an important step in the clustering process since the clustering
process is grouping together similar web sessions.
In this chapter, we introduce a new method for measuring dissimilarities between
web sessions that takes into account the sequence of events in a click stream visitation,
the time spent in each event, and the length of the sessions. This method is used in our
clustering method discussed in the next chapter.
In the next section, we provide some necessary definitions. In Section 3, we present
the problem statement. In Section 4, we present related work on sequence comparison. In
Section 5, we present previous work done on session comparison. In Section 6 we present
our new Multidimensional Session Comparison Method (MSCM). In Section 7, we
present experimental results and analysis. In Section 8, we present summary conclusions.
3.2 Definitions
We define the list of sessions S = ⟨ s1 ,K , s k ⟩ where s i = {⟨ p1 , K, p m ⟩ , ⟨t1 , K, t m ⟩} is
defined by two lists of entities: the first list consists of m pages ⟨ p1 ,K , p m ⟩ and the
second list consists of m time values ⟨t1 ,K , t m ⟩ . The list of pages represents the pages
visited by the user and the list of time values represents the time spent at each page. We
also define the operator |s| that returns the number of items in the sequence s. We also
32
define two functions s.p(x) and s.t(x) that return the page and the time spent at position x,
respectively.
3.3 Problem statement
The objective is to find a distance function, D , defined over S × S , where D( si , s j ) is
a numeric value that shows the extent to which sessions si and s j are similar.
3.4 Related work
In this section, we present the well-established algorithms that can be used in the
context of session comparison. Most of these algorithms were presented in the field of
string matching.
3.4.1 Exact sequence matching
In this method, the distance function is defined as a Boolean function where the
function returns true when there is an exact match between si and s j and returns false
otherwise, or expressed as follows:
⎧ true if si . p( x) = s j . p( x) and si .t ( x) = s j .t ( x) and si = s j , ∀x ≤ max(si , s j )⎫
D(si , s j ) = ⎨
⎬ 3.1
otherwise
⎭
⎩ false
In most sequence matching problems, the sequences do not exactly match; rather,
they show similarities to a certain extent. Consequently the method of equation 3.1
returns no matches and it will not recognize any similarities between sequences. For this
reason such a method is considered impractical and is not often used.
3.4.2 Approximate one dimension sequence matching
The idea behind approximate one dimensional sequence matching is based on
limiting the sequence definition to one list of entities. For the web sessions case, the
33
sequence definition is limited to the list of pages. The matching is defined by a numeric
value that is greater than or equal to zero. A value of zero represents an exact match, and
the value increases as the difference between the sequences increases.
There are two major ways to compare the sequences. One is based on measuring the
differences of the sequences’ items, while the other is based on measuring the similarities
of the sequences’ items.
3.4.2.1 Measuring difference distance
The distance between sessions si and s j is defined by the number of edit operations
needed to transform si to s j . These operations are insertion (I), deletion (D), replacement
(R), or no operation (M). For example, if we have two sequences s1 and s 2 given as
s1 = ⟨ p 0 , p 2 , p5 , p3 ⟩
s 2 = ⟨ p0 , p1 , p 2 , p5 , p 4 ⟩ .
To transform s1 to s 2 , the following operations need to be applied:
M: no operation since p 0 ’s are matched in the first position of s1 and s 2 .
I: insertion of p1 into the second position of s1 .
M M: p 2 ’s, p5 ’s match on the third and fourth positions.
R: replacement of p3 at last position of s1 by p 4 .
The dynamic programming method [27] can be used to find the minimum number of
operations. For sl and s k , D(i, j ) is defined to be the edit distance (number of edit
operations) to convert the first ith characters of s l to the first jth characters of s k .
The recursion base conditions are
34
D(i,0) = i
3.2
D(0, j ) = j
3.3
and
The recurrence relation for D(i, j ) is
D(i, j ) = min[ D(i − 1, j ) + 1, D(i, j − 1) + 1, D(i − 1, j − 1) + t (i, j )]
3.4
where t (i, j ) is defined to have value of 1 if and only if the ith character of s l and the
jth character of s k are different, otherwise it has a value of 0.
In this approach, D(i, j ) is first computed for the smallest possible values for
i and j .Typically, this computation is organized with a dynamic programming table of
size (n + 1) × (m + 1) . The table holds the values of D(i, j ) for all the choices of i and j . In
the table, the vertical axis represents s l and the horizontal axis represents s k . Because
i and j begin at zero, the table has a zeroth row and a zeroth column. The values in the
zeroth row and the zeroth column are filled indirectly from the base conditions
for D(i, j ) . After that, the remaining n × m subtable is filled in one row at time, in order of
increasing i . Within each row, the cells are filled in order of increasing j . Except for the
base case, the other D(i, j ) ’s are known once D(i − 1, j − 1) , D(i, j − 1) , and D(i − 1, j )
have been computed. So the entire table can be computed in one row at a time, in the
order of increasing i , and in each row the values can be computed in order of
increasing j .
35
3.4.2.2 Measuring similarity distance
In a pairwise scores matrix, s ( x, y ) denotes the score obtained by aligning pages in
s i and s j . The alignment is done by inserting spaces (or no action) between the pages. A
pairwise scores matrix sets the score s ( x, y ) to be greater than or equal to zero when the
pages are the same, and less than zero if they mismatch.
The alignment value A is defined as
(
l
)
A = ∑ s S1' (i ), S 2' (i )
3.5
i =1
where S1' (i ) and S 2' (i ) denote s i and s j after alignment.
For example, if we have P = ⟨ p 0 , p1 , p 2 , p3 , p 4 ⟩ , we can define the pairwise scores
between different pages as shown in Table 3.1. If we have two sequences si and s j as
described in Table 3.2, the alignment value, A , is calculated using equation 3.5, which
gives the following result:
A = −2 + 2 − 1 + 0 − 3 − 2 − 4 + 2 = −8
Table 3.1 Pairwise scores between different pages
s
p1 p 2 p3 p 4 p0
p0
p1
p2
1
-2
-4
-2
-2
-1
2
-3
-1
-3
-2
2
-1
-2
-3
0
-4
-4
1
-1
p3
p4
-
0
36
Table 3.2 Two sequences s i and s j
si
p0
p2
-
p3
sj
p1
p2
p0
p3
p2
-
p0
p4
p3
-
p1
p1
3.5 Previous work
In this section, we present an overview of the previous work on session comparison
done for web usage mining. Most of the similarity measures used to compare sessions in
web usage mining were simply based on intersections between the sets, such as the
cosine measure or the Jaccard coefficient [8]. For example, Foss et al. [8] applied the
Jaccard coefficient, which basically measures the degree of common visited pages in the
compared sessions. This method does not take into consideration the sequence of events.
So, the algorithm does not differentiate between the following two situations, p0 is visited
before page p1, or p1 is visited before p0. Path Feature Space [28] were used to represent
all the navigation paths. The similarity between each two paths was measured by the
definition of path angle. In the path angle method, each navigation path is represented as
a vector and the similarity between paths is the cosine similarity between the vectors.
A non-Euclidean distance measure was presented using the sequence alignment
method (SAM) [29-31], which is derived from the Levenshtein [32] approach, and it
takes into account the weight of different operations. The distance formula is defined by
D( si , s j ) = min( D * wd + I * wi + R * wr )
3.6
where D, I , R are the number of deletion, insertion and replacement operations,
respectively, needed to convert s i to s j ; and wd , wi , wr are the weights of these
operations, respectively. Unlike the Levenshtein method, SAM is done in two steps.
37
First, it reorders the common elements such that the common elements in the two
sequences appear in the same order. In the second step, it inserts the uncommon elements
in both sequences so they appear the same.
The multidimensional sequence alignment method MDSAM [29] is a modified
version of the sequence alignment method where it ultimately finds the set of operations
inducing the possibly smallest sum of multidimensional operational costs. The full
algorithm description can be found in [29].
3.5.1 Limitations of the previous work
The algorithms that use the Euclidean distance for vector or the cosine measure have
several limitations [33]:
1. The transferred space could be of very high dimension.
2. The original click stream is naturally a click sequence which cannot be fully
represented by a vector or a set of URLs where the order of clicks is not considered;
3. Euclidean distance has been proven in practice to be not suitable for measuring
similarity in categorical vector space. The multidimensional algorithms do not solve the
inter-attribute relationship problem, which is defined as the problem of considering the
relationship between the attributes in different dimensions.
3.6 Multidimensional session comparison method (MSCM)
In this section, we present our new Multidimensional Session Comparison Method
(MSCM). Next, we present assumptions on which we based our algorithm. After
constructing the algorithm, we present a detailed description of the algorithm. Finally, we
present a time complexity analysis for the algorithm.
38
3.6.1 Assumptions
We assume that three edit operations are allowed to convert one session to another:
deletion, insertion and swap. The deletion operation D(x) is defined as deleting an event
in the session at position x. The insertion operation I(x) is defined as inserting a new
event in the session at position x. The swap operation S(x) is defined as swapping
between events in the session at positions x and x+1.
We present the following assumptions about the sessions and our algorithm:
1. For the two dimension—the pages list and time list—we assume that the page list
is a primary dimension and the time list is a secondary list.
2. The navigation behavior for the user is determined mainly by the primary
dimension, which is the page list.
3. The first dimension, page list, is a nominal attribute and the other dimension, time
value list, is a continuous attribute. So, the difference between pages p1 and p 2 is
considered the same as the difference between pages p1 and p100 , but the
difference
between t = 1 and t = 7 is
not
the
same
as
the
difference
between t = 1 and t = 100 .
4. The distance d mscm ( si , s j ) between two sessions si and s j is directly proportional
to the minimum number of edit operations needed to convert si to s j .
5. The distance between two sequences is inversely proportional to the maximum
length of the compared sequences.
6. The weight of the swap operation is directly proportional with the time spent on
the first page.
39
3.6.2 Algorithm construction
Based on the first four assumptions, we present a one dimensional distance function
defined by
d mscm ( si , s j ) ∝ min( D • wd + I • wi + S • ws )
3.7
where:
d mscm is the edit distance based on MSCM
D is the number of deletion operations
I is the number of insertion operations
S is the number of swap operations
wd is the weight of the deletion operation
wi is the number of insertion operation
ws is the weight of swap operation
Based on the fifth assumption, the total distance divided by the maximum length of
both sequences is
d mscm ( s i , s j ) ∝
min( D • wd + I • wi + S • ws )
max( si , s j )
3.8
where
s is the length of the sequence.
Based on the final assumption, the weight of the swap operation multiplied by the
Heaviside step function Φ (t ) , which is defined as
⎧0 t ≤ 0⎫
Φ (t ) = ⎨
⎬
⎩1 t > 0⎭
where t is the time spent on the page on which swap operation is performed, s given as
40
3.9
d mscm ( s i , s j ) =
min( D • wd + I • wi + S • ws • Φ (t ))
max( s i , s j )
3.10
The distance function defined in equation 3.10 takes into consideration the two
dimensions in the web session and gives a proper solution to the inter-attribute
relationship, unlike other algorithms where the inter-attribute relationship is solved by
computing trajectory between the first and the second attributes. Also, the distance
calculated in equation 3.10 is considered to be the absolute distance since it is relative to
the maximum length of the sequences.
3.6.3 Algorithm description
The algorithm used for finding the minimal edit operation is based on the dynamic
programming in [34, 35]. Table 3.3 summarizes the MSCM algorithm’s major steps to
find the minimum number of edit operations.
Table 3.3 MSCM algorithm major steps
Step Description
1
Set n to be the length of s1 .
Set m to be the length of s 2 .
If n = 0, return m and exit.
If m = 0, return n and exit.
Construct a matrix containing 0..m rows and 0..n columns.
2
Initialize the first row to 0..n.
Initialize the first column to 0..m.
3
Examine each character of s1 (i from 1 to n).
4
Examine each character of s 2 (j from 1 to m).
5
If s1 [i] equals s 2 [j], the cost is 0.
If s1 [i] doesn't equal s 2 [j], the cost is 1.
6
Set cell d[i,j] of the matrix equal to the minimum of:
a. The cell immediately above plus 1: d[i-1,j] + 1.
b. The cell immediately to the left plus 1: d[i,j-1] + 1.
c. The cell diagonally above and to the left plus the cost: d[i-1,j-1] + cost.
7
After the iteration steps (3, 4, 5, 6) are complete, the distance is found in
cell d[n,m].
41
For example, if we have two web sessions:
s 0 = {⟨ p14 , p11 , p6 , p10 , p1 , p11 , p8 ⟩, ⟨1,2,4,3,2,4,5⟩} , and
s1 = {⟨ p15 , p 6 , p 7 , p10 , p 7 , p1 , p11 ⟩, ⟨ 2,4,6,3,1,3,5⟩}
Table 3.4 shows the matrix to be used to compute the minimum edit distance
between sequences s0 . p and s1 . p .
Table 3.4 Matrix used to compute the minimum edit distance, when only the zeroth
column and row are filled in
d mscm
p14
p11
p6
p10
p1
p11
p8
0
1
2
3
4
5
6
7
j
i
0
0
1
2
3
4
5
6
7
1
1
p15
p6
2
2
p7
3
3
p10
4
4
p7
5
5
p1
p11
6
7
6
7
The (i,j) cell in Table 3.4 stands for the minimum edit operations to convert the first
i pages in sequence s 0 to the first j pages in sequence s1 . Taking into consideration the
zero length sequence, the table size will be the sequence length plus 1. So for the example
in Table 3.4, we have table of size (7 + 1) × (7 + 1) . In the table, the vertical axis
represents s 0 and the horizontal axis represents s1 . The values in row zero and column
zero are filled in directly from the base conditions in equations 3.2 and 3.3, respectively.
After that, the remaining cells in the table are filled in one row at time, in order of
increasing i and then in order of increasing j, using the recurrence equation 3.4. Within
42
each row, the cells are filled in order of increasing j . Table 3.5 shows when the entire
cells are filled in. The minimum edit distance is the last cell value, which in our example
equals to 5.
Table 3.5 Matrix used to compute the minimum edit distance, when the entire cells are
filled in
d mscm
p14
p11
p6
p10
p1
p11
p8
0
1
2
3
4
5
6
7
j
i
0
0
1
2
3
4
5
6
7
1
1
1
2
3
4
5
6
7
p15
2
2
2
2
2
3
4
5
6
p6
p7
3
3
3
3
3
3
4
5
6
p10
4
4
4
4
4
3
4
4
5
p7
5
5
5
5
5
4
4
5
6
p1
p11
6
7
6
7
6
7
6
6
6
7
5
6
4
5
5
4
6
5
3.6.4 Time complexity analysis
The time complexity analysis for the dynamic programming table for computing the
minimum edit distance is reported to be O(m • n) [34], where m is the length of the first
session and n is the length of the second session. This is straightforward since it needs
(n • m) steps to fill the table for sessions of length n and m, respectively.
3.7 Experimental results and analysis
In this section we present a few experiments that show how MSCM provides better
results than other sessions comparison methods. We compare the results with other three
methods: Euclidean based distance (Path Feature Space [28]) method, sequence
alignment method SAM [30], and Multidimensional Sequence Alignment Method
43
MDSAM [29]. More experimental results are presented in the next chapter, which is the
output of the clustering algorithm that adopts MSCM as its distance function.
For our experimental results, assume we have the following sessions:
s 0 = {⟨ p15 , p 7 , p6 , p10 , p1 , p 7 , p11 ⟩ , ⟨3,0,2,4,0,4,5⟩}
s1 = {⟨ p15 , p 6 , p 7 , p10 , p 7 , p1 , p11 ⟩, ⟨3,4,1,4,1,1,6⟩}
s 2 = {⟨ p14 , p11 ⟩ , ⟨ 4,5⟩}
s3 = {⟨ p5 , p 7 ⟩ , ⟨5,4⟩}
s 4 = {⟨ p15 , p 7 , p6 , p10 , p1 , p 7 , p11 ⟩ , ⟨3,2,1,4,3,4,5⟩}
s 5 = {⟨ p15 , p 7 , p6 , p10 , p1 , p3 , p1 ⟩ , ⟨ 2,3,2,4,3,7,5⟩}
Table 3.6 summarizes the distance between different sessions using different session
comparison methods.
Table 3.6 Distance measure between sessions using different methods
Method
Path Feature
SAM
MDSAM
MSCM
Sessions
Space
4
9
0
s 0 , s1
5 2 + 4 2 = 6 .4
s 2 , s3
22 + 22 = 2
2
4
1
s 4 , s5
2 2 + 3 2 = 3 .6
2
5
0.29
Sessions s 0 and s1 are almost the same except for the swap between pages p 6 , p 7 and
pages p1 , p 7 . However, the time spent on p 7 and p1 in s1 is zero. This distance—i.e., time
spent on a page is zero—arise because the time resolution for recording a web record is
one second. Therefore, we cannot tell which page is loaded first, and so the order of these
pages should be ignored. The other three algorithms do not recognize this, even the
multidimensional ones, and all of them indicate that there is a difference between the
sessions where there is actually no difference as indicated from MSCM.
44
As for sessions s 2 and s 3 , it is obvious that they are not similar at all. The other three
methods measure the difference based on the edit distance and they give results that do
not reflect the complete mismatch between the sequences. On the contrary, MSCM
computes the absolute difference, which is the edit distance divided by the maximum
sessions’ length. Thus, for sessions s 2 and s 3 , MSCM returns the value of one, which
indeed reflects the complete mismatch.
The usefulness of measuring the absolute value can also be seen when we consider
the results of comparing s 2 and s 3 versus comparing s 4 and s5 . The other three methodss
show almost the same degree of differences when comparing sessions s 2 and s 3 versus
comparing sessions s 4 and s 5 but this is not true at all because sessions s 4 and s 5 are almost
the same except for the last two pages, while sessions s 2 and s 3 are completely different.
On the other hand, MSCM algorithm recognizes that the degree of difference between
sessions s 2 and s 3 (returning a value of 1) is not the same as the degree of difference
between sessions s 4 and s 5 (returning a value of 0.29).
3.8 Summary and conclusion
In this chapter we presented a new Multidimensional Session Comparison
Algorithm (MSCM), which is based on dynamic programming. Unlike other methods,
MSCM takes into consideration other dimensions in the session, such as the time spent
on the page and the total session length. The methods showed more accurate results than
other known methods in comparing web sessions, such as Sequence Alignment Method
(SAM), Multidimensional Sequence Alignment Method (MDSAM), and Path Feature
Space. The output of the MSCM is presented in the form of dissimilarity matrix, which
45
can be used by different clustering techniques, such as the hierarchal, the k-mean, and the
equivalence classes clustering algorithms.
46
CHAPTER IV
ENHANCING WEBSITE STRUCTURE BY MEANS OF HIERARCHAL
CLUSTERING ALGORITHMS AND ROUGH SET LEARNING APPROACH
4.1 Introduction
In this chapter, we present a new way to enhance the website structure by means of
hierarchal clustering algorithms and rough sets learning approach. Figure 4.1 shows the
system workflow. The workflow starts by clustering the web sessions into different
clusters using the dissimilarity matrix. The clustering results are then presented in the
form of examples as shown in Table 4.1, where each web session along with its
clustering result represents one example in the examples table.
Table 4.1 Representing clustering results in a form of examples
Example No.
1st Page
2nd Page
3rd Page
Clustering Result
1
p0
p1
p4
C1
2
p2
C2
p3
p5
…
N
…
p2
…
p3
…
p5
…
Ck
The examples are then divided into two independent sets. The first set is used by
different classifiers to learn rules that describe the system; the rules are presented in the if
then format. For example the following two rules can be learned from the first two
examples in Table 4.1.
if 1st page = p 0 and 2nd page= p1 and 3rd Page= p 4 then Cluster= C1
47
if 1st page = p 2 and 2nd page= p3 and 3rd Page= p5 then Cluster= C 2
The second set along with the learned rules from the first set are used in the
inference engine to estimate the accuracy of the classification process. The clustering
results along with the generated rules are then incorporated to enhance the structure of the
website.
Web Sessions
Dissimilarity Matrix
Clustering Process
Examples
Classifier
Rules
Inference Engine
Classifiers Results
Estimation
Results
Incorporation
Figure 4.1 Web usage classification and prediction workflow
The rest of the chapter is organized as follows. In Section 2, we present an overview
of clustering analysis. In Section 3, we present two algorithms for clustering web
sessions. In Section 4, we present two different classifiers to describe and predict a web
session’s classes. In Section 5, we present a method to estimate the accuracy of different
48
classifiers. In Section 6, we present experimental results. In Section 7, we show how the
results are incorporated in enhancing the website structure. In Section 8, we present
results discussion. In Section 9, we present summary and conclusion.
4.2 Clustering analysis
Clustering is a useful technique for grouping objects such that objects within a
single group have similar characteristics, while objects in different groups are dissimilar.
In the context of web usage mining, the objects are users’ sessions. Each session contains
the pages visited by the user at a certain time. Clusters can be used to cluster the users
such that users with the same browsing behavior are in single cluster. For example, one
cluster may consist of predominantly freshman students who register for classes, while
another may consist of professors who upload their classes’ grades. The clusters can then
be used to identify dominant browsing behaviors, evaluate the website structure and
predict users’ browsing behavior. Clustering web usage sessions’ is an example of
clustering where objects are of a non-numeric data type such as nominal or categorical
data type.
4.2.1 Clustering algorithms
Clustering algorithms can be classified into partitional clustering and hierarchal
clustering [36, 37]. Partitional clustering algorithms divide n objects into k clusters that
satisfy two conditions (1) each cluster contains at least one object, and (2) each object
exactly belongs to one cluster. Equation 4.1 shows one of the commonly used criterion
functions
k
r r
E = ∑ ∑ d ( x , mi )
4.1
r
i =1 x∈Ci
49
r r
r
In equation 4.1, mi is the centroid of cluster C i , while d ( x , mi ) is the Euclidean
r
r
distance between x and mi defined in equation 4.2
r r
d ( x , mi ) =
(
r r 2
∑i =1 (x − mi )
d
)
1
2
4.2
The criterion function E attempts to minimize the distance of every object from the
mean of the cluster to which the object belongs. One of the common approaches to
minimize the criterion function is the iterative k-means method. While the use of the kmeans method could yield satisfactory results for numeric attributes, it is not appropriate
for data sets with categorical attributes [38], as it is the case in web sessions.
Hierarchical clustering algorithms work by grouping data objects into a tree of
clusters. A hierarchical method can be classified as agglomerative or divisive.
Agglomerative hierarchical clustering, which is the most common strategy, starts by
placing each object in one cluster then merges similar objects together until forming one
cluster that has all the objects in it or some other termination condition exists. Divisive
hierarchical clustering starts with all objects in one cluster and divide them up until each
object forms a cluster by itself or some other termination condition is met.
At the first step of the agglomerative method, the dissimilarity matrix can be used to
determine how close the objects are to one another. Once the first step is completed and
the first level of clusters is generated, there will be a need to compare the clusters rather
than comparing the objects. Next, we present the five most common techniques to
measure the difference between clusters:
•
Single linkage: the distance between any two clusters is the shortest distance from any
object in one cluster to any object in the other [39].
50
•
Complete linkage: the distance between any two clusters is the farthest distance from
any object in one cluster to any object in the other.
•
Average linkage: the distance between any two clusters is the average distance from
all objects in one cluster to all individuals in the other.
•
Ward’s method: the distance between two clusters is the sum of squares of the
distance between all objects in both clusters [40].
•
Centroid method: the distance between two clusters is the distance between their
centroids.
4.2.2 Properties of agglomerative hierarchal clustering techniques
The single linkage method tends to have the chaining property [41]. As shown in
Figure 4.2, chaining is producing two well separated clusters with an intermediate chain
of data.
6
5
y
4
3
2
1
0
0
0.5
1
1.5
2
x
Figure 4.2 Two well separated clusters with intermediate chain
Previous empirical investigations indicate that the average linkage method and the
Ward’s method had superior performance. For example, Cunningham and Ogilvie [42]
compared several hierarchal techniques and found that the average linkage method
51
performs most satisfactory for the data sets they considered. Kuiper and Fisher [43]
investigated six hierarchal techniques and found that the Ward’s method classifies the
data very well. Finally, Blashfiled [44] compared the single linkage method, the complete
linkage method, the average linkage method, and the Ward’s method using a quantifying
statistical method explained in [45]. They found that the Ward’s method performed very
well over the other methods. Later in Section 4.3.6, we explain why we use the Ward’s
method over the average linkage method.
4.3 Clustering web sessions
Clustering web sessions is to group similar website usage behavior together. The
clustering results can be used to identify dominant browsing behavior, evaluate the
website structure and predict users’ browsing behavior. We present two clustering
algorithms: hierarchal and equivalence classes clustering. Both algorithms have the
ability to deal with nominal attributes, such as session with different sizes, and both have
the ability to adopt different dissimilarity matrixes. Unlike the hierarchal clustering
algorithm, clusters generated from the equivalence classes clustering algorithm do not
depend on the seed where the clustering process starts. In this section, we next present
several definitions along with a formal description of the problem statement. At the end
of the section, we present the two proposed algorithms along with their running time
complexity analysis.
4.3.1 Definitions
Web sessions S is defined as S = ⟨ s1 , K , s k ⟩ , where k is the number of sessions. Each
web session si is defined as s i = ⟨ pi.1 ,K , pi.n ⟩ , where n is the number of pages in
session si , and pi.k is the k th page in session si . Web sessions clusters C is defined
52
as C = ⟨ c1 , K , c m ⟩ , where m is the number of clusters. Each cluster ci is defined
as ci = ⟨ si.1 ,K , si.l ⟩ , where l is the number of sessions in cluster ci .
The dissimilarity function δ i, j is defined over S × S , where 0 ≤ δ i , j ≤ 1 ; a value of one
implies perfect dissimilarity, and a value of zero implies perfect similarity. A detailed
description of this function was presented in Chapter 3. The cluster centroid cei is defined
as the session in the middle of the cluster ci . The centroid defined in equation 4.3 uses the
Ward’s method in defining the minimum distance. In Section 4.3.6, we explain why we
use the Ward’s method over other methods, such as the average linkage method, in
defining the minimum distance for the centroid.
cei = si.k where si.k
⎛ j =n
2⎞
= min⎜⎜ ∑ (δ j ,k ) ⎟⎟, ∀si.k ∈ ci
⎝ j =1
⎠
4.3
We also define an overloaded version of the dissimilarity function defined earlier in
equation 3.10 to be applied to the clusters. The overloaded version of δ i, j shown in
equation 4.4 accepts clusters as inputs and uses the centroid-method in determining the
difference between clusters.
δ i , j = ∑cluster i ∑cluster j δ ce ,ce
i
4.4
j
Finally, we define the threshold λ as the maximum difference allowed between
sessions in the same cluster.
53
4.3.2 Problem statement
Given web sessions S, dissimilarity function δ i, j , and threshold value λ , the
objective is to find web session’s clusters C λ ’s such that for ∀c k , c k ∈ C λ it is true
that δ i, j ≤ λ for ∀si , s j ∈ c k .
4.3.3 Hierarchal clustering algorithm
Figure 4.3, shows the clustering algorithm used to cluster the web sessions. Initially,
each session is placed in a cluster by itself and the threshold value λ is initialized to zero.
Then, the dissimilarity value between all pairs of clusters is checked. If the dissimilarity
value is less than or equal the threshold value, the two clusters are merged to form a new
cluster. The new number of clusters is then checked. If it is found to be less than or equal
to the minimum number of clusters the algorithm exits of the loop and the set of clusters
is then returned (in Section 4.3.5 we discusses how to choose the minimum number of
clusters). Otherwise, the threshold value λ is incremented and the process is repeated
again.
54
Intialize C := each cluster has one session
Intialize λ = 0
for i := 1 to C do{
for j := i + 1 to C do{
if δ i , j ≤ λ
merger _ clusters(ci , c j )
}
increment λ
if C ≤ minimum number of clusters
break
}
return C
Figure 4.3 Hierarchal clustering algorithm
The time complexity analysis of the algorithm in Figure 4.3 shows that the algorithm
has two loops, an inner and an outer loop. The outer loop has the worst case of n cycles in
the event the initial number of the clusters is n, while the inner loop has the worst case of
(n-1) cycles. In our work, we assumed that the dissimilarity function is provided in the
form of a matrix, where element ij contains the dissimilarity between sessions i and j .
Therefore, the time complexity for finding the dissimilarity value is fixed with a constant
time c that is independent of the initial number of clusters n. Hence the overall
complexity of the algorithm is O(n(n − 1) * c) = O(c * n 2 ) = O(cn 2 ) = O(n 2 ) .
4.3.4 Equivalence classes clustering algorithm
We define an equivalence relation called belongs to, ~, on C × C that satisfies
reflexive, symmetric and transitive properties. The reflexive property implies
that ci ~ ci for ∀ci ∈ C .
The
ci ~ c j then c j ~ ci for ∀ci , j ∈ C .
symmetric
The
transitive
55
property
property
implies
implies
that
that
if
if
ci ~ c j and c j ~ c k then c j ~ c k for ∀ci , j ,k ∈ C . The first two properties (reflexive and
symmetric) are satisfied in the hierarchal algorithm defined in Section 4.3.3. So, to
achieve the equivalence relation defined earlier, we modify the algorithm described in
Figure 4.3 to satisfy the equivalence relation.
Figure 4.4 shows the equivalence classes clustering algorithm that satisfies the three
properties for the equivalence relation mentioned earlier. The algorithm is a modified
version of the one showed in Figure 4.3 in which the merging between clusters is not
done unless all pairs of sessions in both clusters has a dissimilarity value less than or
equal to the threshold value λ .
The time complexity analysis for this algorithm is the same as the one for the
hierarchal clustering algorithm shown in Figure 4.3, except for an extra third inner loop,
which increases the complexity by one degree to become O(n 3 ) .
56
Intialize C := each cluster has one session
Intialize λ = 0
for i := 1 to C do{
for j := i + 1 to C do{
boolean merge = false
if δ i , j ≤ λ
merge = true
for k := 1 to C do{
if ( ((δ i , k ≤ λ and δ j , k > λ ) or (δ j , k ≤ λ and δ i , k > λ ))
or (δ i , k > λ and δ j , k > λ and δ j , k ≠ δ i , k ) ){
merge = false
break
}
}
increment λ
if (merge)
merger _ clusters (c i , c j )
}
if C ≤ minimum number of clusters
break
}
return C
Figure 4.4 Equivalence classes clustering algorithm
4.3.5 Determining a common termination condition for different sessions lengths
Prior to applying the web sessions to the clustering algorithms described earlier, the
sessions are grouped in a way sessions with the same session length are grouped together.
In the clustering algorithms shown in Figure 4.3 and Figure 4.4, the termination condition
depends on the minimum number of clusters. To choose a minimum number of clusters
that is common for all session’s length groups, the number of clusters is normalized to the
number of clusters in the first iteration for each session length group. To illustrate this
idea, consider for example Table 4.2 that shows how a group of web sessions divided
57
according to length of the session into two groups. The first group shows the number of
clusters for session length of 3 at different clustering iterations. The second group shows
the number of clusters for session length of 4 at different clustering iterations.
Table 4.2 Number of clusters for different session lengths at different iterations
Iteration 1
2
3
4
5
6
7
8
9
10 11
Session
length
3
255 39
35
9
6
6
6
5
1
4
479 141 115 64 35 26 24 15 5
3
1
It is clear that the initial number of clusters is different in the two set of the group
sessions. So, in order to choose the same termination condition we calculate the
percentage of the number of the clusters from the initial number of clusters for different
session lengths at different iterations. Table 4.3 shows the percentage of the number of
the clusters from the initial number of clusters for different session lengths at different
iterations. So, for example, if we choose to stop the clustering process when the
percentage of the number of clusters is 14% from the initial number of clusters, then we
stop at the iteration where the value of the percentage is the closest to 14%, which is, in
this case, iterations 3 and 4 for session lengths 3 and 4, respectively.
Table 4.3 Percentage of the number of the clusters from the initial number of clusters for
different session lengths at different iterations
Iteration 1
2
3
4
5
6
7
8
9
10 11
Session
length
3
100% 15% 14% 3%
2% 2% 2% 1% 0%
4
100% 29% 24% 13% 7% 5% 5% 3% 1% 0% 0%
4.3.6 Ward’s method improves determining a common termination condition
The reason we use Ward’s method in defining the centroid in equation 4.3 over
other methods, like average linkage, is because it shows a slower convergence, which
58
helps in determining a common termination condition for all session length groups more
accurately. To illustrate why we want a slow convergence, first consider Figures 4.5 and
4.6, which show the percentage of the number of the clusters from the initial number of
clusters for a specific session length at different iterations using the average linkage
method and the Ward’s method, respectively. Next, assume we want to stop the
clustering process when the number of clusters is 20% from the initial number of clusters
(in Section 4.6 we explain how to choose these percentage points). Using the average
linkage method, Figure 4.5 shows that the 20% point occurs between iterations 6 and 7,
where the percentage is 23.50% and 14.53%, respectively. So, the closest iteration is
iteration 6 giving a percentage of 23.50% and an error of 3.5%. When using the Ward’s
method, as shown in Figure 4.6, the 20% point occurs between iterations 53 and 54,
where the percentage is 20.5% and 19.7%, respectively. So, the closest iteration is 54
giving a percentage of 19.7% and an error of 0.3%. The error value represents how
accurate the clustering process is with a common termination condition for all sessions
with the different session lengths.
59
Percentage of the number of
the clusters from the initial
number of clusters
100.00%
90.00%
80.00%
70.00%
60.00%
50.00%
40.00%
30.00%
20.00%
10.00%
0.00%
0
2
4
6
8
10
12
14
Iteration
Percentage of the number of
the clusters from the initial
number of clusters
Figure 4.5 Percentage of the number of the clusters from the initial number of clusters for
a specific session length at different iterations using the average linkage method
100.00%
90.00%
80.00%
70.00%
60.00%
50.00%
40.00%
30.00%
20.00%
10.00%
0.00%
0
50
100
150
200
250
Iteration
Figure 4.6 Percentage of the number of the clusters from the initial number of clusters for
a specific session length at different iterations using the Ward’s method
4.4 Web sessions’ classifiers
We present two methods for cluster classification. The first method is based on the
centroid and the second is based on the inductive learning program BLEM2. The rules
learned from both classifiers are used in both predicting and describing web sessions
clusters. Table 4.4 shows an example of a rule generated by a web sessions’ classifier.
60
Table 4.4 Example of a rule generated by a web sessions’ classifier
Page 1
Page 2
Page 3
Cluster
p 45
p84
p 204
C1
To predict web sessions cluster, the rule in Table 4.4 can be presented in an if then
statement such as
If 1st Page= p 45 and 2nd Page= p84 and 3rd page= p 204 then Cluster= C1
So, for any new session in the web log the above rule can be applied in order to
predict the session’s cluster. We use the classifier accuracy estimator described in Section
4.5 to estimate the accuracy of the prediction. Experimental results based on this method
are presented in Section 4.6.
To describe a cluster, the physical page-name lookup-table is used to find the
physical page name. For example, the physical name for pages in Table 4.4 is
⟨ signon, BeforeClassSearch, ClassSearch⟩
So, the cluster can then be described as Class Search cluster, meaning it contains a
group of users who are searching for classes. The evaluation of the cluster description is
based on the length of the description. For this example, the length of the description is 3.
In Section 4.6.4, we present experimental results for the description length using different
classifiers.
4.4.1 The centroid approach
The centroid approach is based on using the cluster centroid to describe clusters and
to predict users’ classes based on past behavior. As defined in equation 4.3 the centroid is
the session that has the minimum average square distance between all other sessions in
the same cluster.
61
To illustrate this, assume that cluster c k has session si as its centroid, which represents
the following page sequence:
s i = ⟨ p 45 , p84 , p 204 ⟩ .
From the page-name lookup-table assume we found that the sequence has the
following physical page names:
s i = ⟨ signon, BeforeClassSearch, ClassSearch⟩ .
The cluster can then be described as the Class Search cluster, meaning it contains a
group of users who are searching for classes.
Beside using the centroid for describing the clusters, this description is also used to
predict users’ classes. From the above example, the following rule can be generated
if si.1 = p 45 and si.2 = p84 and si.3 = p 204 then s i ∈ c k
The generated rules, like the one above, may be applied to an inference engine to
predict the class of the future coming sessions.
4.4.2 Rough set approach
In this subsection, we present the use of the rough set learning program BLEM2 in
classifying different users’ sessions. BLEM2 is an implementation of one of the LERS
[46, 47] family learning programs which was introduced by Grzymala-Busse [48]. We
use the information system notion presented by Pawlak [49,50] in which information
system S is defined as a pair S = (U , A) , where U is a nonempty finite set of attributes in
A, i.e., a vector of attribute values that denotes each object. Each attribute in A is
associated with a set of values called the domain of attribute.
62
Both hierarchal clustering algorithms described earlier produce a special case of an
information system called the decision table. In a decision table, there is a designated
attribute called the decision or class attribute, and other attributes called condition
attributes.
Table 4.5 shows an example of a decision table produced by the clustering algorithm
where the universe U consists of 16 examples. In Table 4.5, the attribute Cluster No. is
the decision attribute and the attributes Page1, Page2 and Page3 are the condition
attributes. The set of the decision attribute is {0, 1, 2 ,3 ,4} and the set of the condition
attribute A is {45, 58, 84, 108, 120, 160, 186, 194, 204, 241, 251, 444, 463}.
Table 4.5 Decision table produced by the clustering algorithm
Session No.
Page 1 Page 2 Page 3
Cluster No.
1
45
444
108
0
2
45
444
108
0
3
45
444
84
0
4
45
444
463
0
5
45
160
241
1
6
45
160
108
1
7
45
160
241
1
8
45
80
194
2
9
45
80
251
2
10
45
80
251
2
11
45
120
58
3
12
45
120
186
3
13
45
120
160
3
14
45
84
204
4
15
45
84
204
4
16
45
84
204
4
The partition on U determined by the decision attribute Cluster No. is
C 0 = [1,2,3,4] ,
C1 = [5,6,7] ,
C 2 = [8,9,10] ,
63
C 3 = [11,12,13] , and
C 4 = [14,15,6]
where C k is the set of sessions that belongs to cluster k. The rough set theory presents the
concepts of lower and upper approximations in case of inconsistency (i.e., having more
than one decision for the same condition value).
Let A = (U , R) be an approximation space, where U is set of objects and R is an
equivalence relation defined on U. Let X be a nonempty subset of U. Then, the lower
approximation of X by R in A is defined as
R X = {e ∈ U | [e] ⊆ X }
4.5
and the upper approximation of X by R in A is defined as
R X = {e ∈ U | [e] ∩ X ≠ Φ}
4.6
where [e] denotes the equivalence classes containing e.
The boundary set of X is defined as
BN R ( X ) = R X − R X
4.7
A subset X of U is said to be R-definable in A if and only if R X = R X . The
(
)
pair R X , R X defines a rough set in A, which is a family of subsets of U with the same
lower and upper approximations as R X and R X .
From Table 4.5, the lower and upper approximation of C 0 , C1 , C 2 , C 3 , and C 4 are:
64
AC 0 = AC 0 = {1,2,3,4},
AC1 = AC1 = {5,6,7},
AC 2 = AC 2 = {8,9,10},
AC 3 = AC 3 = {11,12,13},
AC 4 = AC 4 = {14,15,6}, and
BN A (C 0 ) = BN A (C1 ) = BN A (C 2 ) = BN A (C 3 ) = BN A (C 4 ) = Φ.
As of the case in the previous example, the clustering algorithms presented earlier
do not produce inconsistent rules and so the upper approximation is the same as the lower
approximation; i.e., R X = R X , and the boundary set is BN R ( X ) = R X − R X = Φ .
We use BLEM2 to learn rules from the lower approximation AX i since it is the same
as the upper approximation AX i and the boundary set BN A ( X i ) is an empty set. The rules
learned from the lower approximation are called certain rules. Table 4.6 shows the certain
rules learned from Table 4.5 using BLEM2. In the rules table, the entry -1 denotes a “do
not care” condition. The support column denotes the number of examples covered by the
rule. The certainty column denotes the ratio of the examples that match both the rule and
its decision value. The strength column is the support of the rule over the entire training
set. The coverage column is the ratio of the decision value class covered by the rule.
Page 1
-1
-1
-1
-1
Table 4.6 Certain rules learned from Table 4.5 using BLEM2
Page 2 Page 3 Cluster Support
Certainty Strength
Coverage
2
-1
1
288
1
0.0974
1
3
-1
2
118
1
0.0399
1
4
-1
3
938
1
0.3173
1
5
-1
4
47
1
0.0159
1
The rules in Table 4.6 are applied to the inference engine in two different ways. The
first way is simply by applying all certain rules. The second way is by applying the rules
with the maximum support value. So, for each set of rules that predict the same cluster,
65
only the rule with the maximum support is applied to the inference engine. For example,
if there is more than one rule that predicts cluster no. 1, then the rule with the maximum
support value will be used and the other rules will be disregarded. The advantage of the
maximum support method is that it describes a system with less number of rules.
4.5 Classifier accuracy estimator
We apply the holdout classifier accuracy estimator [51] to estimate the accuracy of
the different classifiers used. As shown in Figure 4.7, the examples generated from the
clustering process were randomly partitioned into two independent sets: α percent of the
data were used as the training set, and the rest, i.e., (1- α) percent of the data, were used
as a testing set. The training set was used to generate rules either by the centroid method
or by the BLEM2 classifier. The testing set along with the generated rules were applied to
an inference engine to predict the classes for the testing sets based on the generated rules.
The overall average accuracy result is the percentage of correctly predicted classes out of
the overall testing set.
66
Clustering
Process
1-α
Examples from
Clustering Process
α
Classifiers
Inference
Engine
Rules
Accuracy
Results
Figure 4.7 Holdout classifier accuracy estimator
To illustrate how the overall average accuracy is calculated consider the following
two rules and the testing examples in Table 4.7
if 2nd Page= p13 then Cluster= C1
if 3rd Page= p85 then Cluster= C 2
Example No.
1
Table 4.7 Inference engine testing examples
1st Page
2nd Page
3rd Page
p 45
p13
p50
Cluster
C1
2
p 45
p13
p 66
C1
3
p 45
p12
p85
C2
4
p 45
p9
p85
C2
5
p 45
p12
p85
C2
6
p 45
p13
p33
C3
7
p 45
p10
p85
C4
The inference engine predicts the classes of the examples in Table 4.7 using the two
rules presented earlier. The predicted classes are compared to the given classes in Table
67
4.7. Table 4.8 shows the inference engine clustering prediction along with classes given
in Table 4.7. Because we have 5 matches out of 7 examples, the overall average accuracy
for the classifier that generated the rules is
5
= 0.71 .
7
Table 4.8 Inference engine results along with results from cluster
Example No.
Cluster from
Cluster prediction
Match
examples
from the inference
1≡match
engine
0≡Non-match
1
1
C1
C1
2
1
C1
C1
3
1
C2
C2
4
1
C2
C2
5
1
C2
C2
6
0
C3
C1
7
C4
C2
0
4.6 Experimental results
In this section, we present the experimental results for the Web Usage Mining
(WUM) clustering and learning algorithms described in this chapter. First, we present the
choice of the termination condition for the two clustering algorithms. Next, we present
the accuracy of the prediction sessions’ clusters using the rules generated using different
classifiers described in Section 4.4. Finally, we present the experimental results for using
the rules for describing the clusters by presenting the average cluster description length
using different classifiers.
4.6.1 Choosing the clustering termination conditions
The clustering algorithm termination condition depends on the number of clusters
or, more precisely, on the percentage of the number of the clusters from the initial
number of clusters. The first termination condition we chose is when all sessions in the
68
same cluster have the exact same sequence. This occurs at the first iteration when the
percentage number of clusters is 100%. For the second termination condition, we choose
the session with the shortest length to determine the iteration at which we stop the
clustering. Then, we find the percentage of the number of clusters at that point. For the
rest of the session length groups, we stop at the point where the percentage of the number
of the clusters is closest to the percentage of the number of clusters for the shortest
session length. The reason we choose the session with the shortest length is because it is
most sensitive session to the threshold. To illustrate this, consider the following pair of
sessions, where the first pair of sessions is of length 3:
p 0 , p1 , p 2
p3 , p 4 , p5 ,
and the second pair of sessions is of length 10:
p 0 , p1 , p 2, p 0 , p1 , p 2, p 0 , p1 , p 2, p 0
p3 , p 4 , p5 , p1 , p5 , p 6, p 7 , p8 , p9, p10 .
For the previous two pairs of sessions, if we have a threshold of 3 of differences, the
first pair of sessions, which have the length of 3, will be a 100% match; whereas, the
second pair of sessions, which have the length of 10, will be a 30% match according to
the distance equation 3.10 presented in Chapter 3. Even though the two pair of sessions
have a complete mismatch, the threshold value of 3 caused the session of length 3 to give
a 100% match, and the session with length 10 to give a reasonable difference of 30%.
In our experiment, a session with length 3 is considered to be the session with the
shortest length. As shown in Figure 4.8, for session length 3, the percentage of the
number clusters dropped to a value close to zero after 8 iterations. So our choice for the
69
termination condition was between iterations 1 and 8. Iteration 1 was chosen for the first
set of the experiment. From the rest of the iterations, we choose iteration 2 since the rest
of the iterations show a small percentage of number of clusters. At iteration 2, the
percentage of number of clusters was 15.69%. For other session lengths, we stop
Percentage of the number of the
clusters from the initial number
of clusters
iterations where the percentage of number of clusters has the closest value to 15.69%.
100.00%
80.00%
60.00%
40.00%
20.00%
0.00%
1
3
5
7
9
11
Iteration
13
15
17
19
Length 3
Length 4
Length 5
Length 6
Length 7
Length 8
Length 9
Length 10
Length 11
Length 12
Length 13
Length 14
Length 15
Figure 4.8 Percentage of the number of the clusters from the initial number of clusters for
different session length groups at different iterations
4.6.2 Classifier prediction accuracy results by rules generated from examples using the
hierarchal clustering algorithm
In this subsection, we present the classifier prediction accuracy results where the
rules are generated from the examples using the hierarchal clustering algorithm described
in Section 4.3.3. The first test set is performed on the examples generated using the
clustering algorithm, where the termination condition is 100% of number of clusters.
Figure 4.9 shows the average accuracy where BLEM2 (all) refers to all rules generated
using BLEM2, BLEM2 (max) refers to BLEM2 rules with the maximum support only,
70
and centroid refers to rules generated using the centroid method. The accuracy was
Average accuracy
constant with a value of 1 for all three different classifiers.
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Centroid
BLEM2 (all)
BLEM2 (max)
3
5
7
9
11
13
15
Session length
Figure 4.9 Average accuracy for different session lengths at the 100% number of clusters
using examples from the hierarchal clustering algorithm
The second test set was performed on the clusters generated when the number of
clusters was 15.6%. Since the clustering results depend on the seed starting point for
clustering, the experiments were repeated five times for each session length group. The
overall average accuracy results are shown in Figure 4.10. The results clearly show that
the average accuracy using BLME2 is better than the centroid method.
71
Average accuracy
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Centroid
BLEM2 (all)
BLEM2 (max)
3
5
7
9
11
13
15
Session length
Figure 4.10 Average accuracy for different session lengths at the 15.69% number of
clusters using examples from the hierarchal clustering algorithm
4.6.3 Classifier prediction accuracy results by rules generated from examples using
equivalence classes clustering algorithm
In this subsection, we present the classifier prediction accuracy results where the
rules were generated using the examples from the equivalence classes clustering
algorithm described in Section 4.3.4. The first test set was performed on the examples
generated from the clustering algorithm where the termination condition was at 100% of
number of clusters. The results were same results, for the hierarchal clustering algorithm
accuracy, shown in Figure 4.9—where the accuracy was a constant value of 1 for all
three different classifiers.
The second test set was performed using the same termination condition used in the
hierarchical clustering algorithm, which is 15.6% of number of clusters. Since the
clustering results are independent of the seed starting point, the experiments were
performed only once and the accuracy results are shown in Figure 4.11. As the case for
the hierarchal clustering, the results show that the average accuracy using BLME2 is
better than the centroid method.
72
Average accuracy
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
BLEM2 (all)
BLEM2 (max)
Centroid
0
5
10
15
20
Session length
Figure 4.11 Average accuracy for different session lengths at the 15.69% number of
clusters using examples from the equivalence classes clustering algorithm
4.6.4 Cluster description results
As described in Section 4.4, rules learned using the centroid method and BLEM2
methods are used to describe the clustering results. The cluster description length is
defined as the number of conditions in the if part of the if-statement that represents the
rule. For example, if we have the following two rules that describe clusters C1 and C 2 ,
respectively
If 2nd Page= p13 and 3rd Page = p14 then Cluster= C1
If 1st Page = p 45 and 2nd Page= p33 and 3rd Page = p85 then Cluster= C 2 ,
then clusters C1 ’s description length is 2 and cluster C 2 ’s description length is 3. Thus,
the average description length is
2+3
= 2.5 . Figure 4.12 shows the cluster description
2
length using the different classifiers. The BLEM2 classifier shows a short constant cluster
description length over different session lengths, while the centroid method shows a
linear increasing cluster description length.
73
Average description length
16
14
12
10
BLEM2 (all)
BLEM2 (max)
Centroid
8
6
4
2
0
3
5
7
9
11
13
15
Session length
Figure 4.12 Cluster description length for different session lengths using different
classifiers
4.7 Results incorporation
Incorporating the results to enhance the structure of the website is done in three
steps. The first step is to identify the most common tasks. The second step is to find how
many clicks it takes to finish each task. The last step is to present suggestions for
enhancing the structure of the website so that the common tasks can be done easier and
faster.
4.7.1 Identifying the most common tasks
Each web session cluster represents one task and the number of the sessions in the
cluster reflects how common the task is. For example, Figure 4.13 shows the seven most
common tasks performed on the University of Akron registrar website. The task
description was identified by the cluster description described in Section 4.4.
74
11%
20%
Enrollment request to add class
Enrollment application
12%
12%
18%
Class search
Account due
Account view
Class search detail
Class roster
13%
14%
Figure 4.13 Seven most common tasks performed on the website
4.7.2 Finding how many clicks needed to finish each task
We assume that each page in the session represents one click for the user to move
from one page to another, so the total number of clicks needed to finish each task is the
same as the sequence length. Figure 4.13 shows the distribution of different number of
clicks for “Class search detail” task. From the figure, it can be concluded that in 61% of
the time the “Class search detail” was finished in 5 clicks, while in 24% of the time the
task was finished in 3 clicks. From the cluster centroid, the following page sequence was
clicked to finish the task in 5 clicks
⟨ signon, BeforeClassSearch, ClassSearch, BigClassSearch Re sult , ClassSearchDetail ⟩
When the task was completed in 3 clicks the following page sequence was clicked
⟨ signon, BigClassSearch Re sult , ClassSearchDetail ⟩
75
15%
24%
Sequence length 3
Sequence length 5
Others
61%
Figure 4.14 Sequence length distribution for “Class Search Detail”
4.7.3 Presenting suggestions to enhance the website structure
By studying how different tasks are completed, recommendations can be made to
change the website structure to permit common tasks to be completed in a shorter time
and with lesser number of clicks. For example, for the “Class search detail” task example
presented earlier, it can be seen from the clusters centroids that the users, who finished
the task in 5 clicks, went through regular “class search” first before going to the “class
search detail”; whereas, the users, who finished the task in 3 clicks, directly checked the
“class search detail”. Conclusions can be made to the website engineer to have a shortcut
for the “class search detail” on the homepage so users can directly access it rather being
forced to go through the “class search” first.
4.8 Results discussion
The choice of a common termination condition for all session length groups must be
based on the session with the shortest length. As shown in Figure 4.8, the session with
length 3 was used to determine the common termination condition by founding the
percentage of number of sessions at the second iteration for session length 3.
76
Figure 4.9 shows that when sessions in the same cluster have the same exact page
sequence, the prediction accuracy is 1 for all different classification methods. When the
threshold is increased, Figure 4.10 and Figure 4.11 show that the rough set based BLEM2
rules predict the classes for sessions more accurately. This is for rules generated using
examples from hierarchal and equivalence classes clustering algorithms. Results shown
in Figure 4.12 illustrate not only do BLEM2 rules predict the cluster description more
accurately, but it also presents a shorter description for the clusters. Figure 4.12 shows
that the cluster description using the centroid method is linear and increases as the session
length increases, while the cluster description based on rules learned using BLEM2 is
almost constant and has a length around 2.
Figures 4.13 and 4.14 show how the clustering and learning results may be used to
give insightful information about the website, like what are the most common tasks, and
how these tasks are commonly achieved. Finally, Section 4.7.3 shows how this
information can be used in enhancing the website structure to make the users’ task
achieved faster and easier.
4.9 Summary and conclusion
In this chapter, we presented two different clustering algorithms to generate
examples that can be used by different classifiers. We used both the centroid and BLEM2
classifiers to learn rules from the examples generated using the clustering algorithms. We
applied the holdout classifier accuracy estimator to measure the accuracy of the
classifiers. Rules generated by BLEM2 show a better cluster prediction and shorter
cluster description.
77
The rules generated by different classifiers were used to present a deep conceptual
understanding of the usage behavior of the website, which can be used by the website
engineer to evaluate and to enhance the website structure and predict future users’
browsing behavior to better assist users in their future browsing experiences.
The work presented in this chapter—including generating examples, learning rules,
and testing the results—can be applied to sequence clustering methods in other fields,
such as bioinformatics, which is the area of analyzing genomic research data.
78
CHAPTER V
SYSTEM IMPLEMENTATION
5.1 Introduction
In this chapter, we present the implementation of the web usage mining system
presented in the previous chapters. As shown in Figure 5.1, the implementation is divided
into four modules: data preparation, session identification, clustering process, and result
presentation and evaluation. The data preparation module performs data filtering and user
identification. The session identification module performs session identification and
further data filtering. The clustering process module generates the dissimilarity matrix
and performs different clustering algorithms; including hierarchal and equivalence
classes clustering algorithms. The result presentation and evaluation module performs the
learning process along with accuracy estimation for learning results.
The reset of the chapter is organized as follows. In Section 2, we present the
implementation of the data preparation module. In Section 3, we present the
implementation of the session identification module. In Section 4, we present the
implementation of the clustering process module. In Section 5, we present the
implementation of the result presentation and evaluation module. Finally, in Section 6,
we present a summary.
79
Data Preparation
Web log
records
Data filtering
and cleaning
Filtered web
log records
Users
User
identification
Session Identification
Session
identification
Sessions
Results Presentation
and Evaluation
Find Centroid
Generate
dissimilarity
matrix
Dissimilarity
matrix
Clustering Process
Clusters
Clustering
Inference
engine
Evaluation
results
Learn from
example
Centroid rules
All rules
Maximum
support rules
Figure 5.1 Data flow diagram for the web usage mining system
5.2 Data preparation module
The data preparation module performs both data filtering and user identification.
The implementation is done using MS SQL server. Figure 5.2 shows the entity relation
80
model (EM) for the database design. The Web_Records table contains all the raw data
collected from the web server. The filtered records are then stored in the Web_Log table.
The Open_Users and Users tables are used by the active user-based user identification
script shown in Figure 2.7. The final results of the user identification process are stored in
the Users table.
Figure 5.2 Entity relation model for data preparation
5.3 Session identification module
As shown in the use case diagram in Figure 5.3, the session identification module
allows the user to perform different tasks, such as loading user records, to perform
session identification, to perform further data filtering, and to export the results to
different platforms.
81
Figure 5.3 Use case diagram for session identification
Figure 5.4 shows the session identification module user interface. The first step in
using the program is to load the sequence file, which is the output from the user
identification step. Next, the user is asked to load the page lookup file which matches the
page numbers used in the sequence file with its physical name. The page lookup file also
indicates if the page is a housekeeping page or not. Several filtering options are available:
•
Remove housekeeping pages: this option removes the housekeeping pages as
identified by the user in the page lookup file. The user can setup these files by
pressing the “Setup House Keeping Pages” button.
•
Remove redundant pages: this option removes the redundant pages that resulted
from removing the housekeeping pages.
•
Session identification based on break pages: this option splits users’ records into
one or more sessions based on break pages. To do this, the user needs to enter the
page breaks. Then, the program will run the algorithm described in Section 2.3.3.
82
For our case we chose the sign-in page as a break page and the sessions were
identified based on that.
•
Specific session length range: this option filters sessions that have a specific
number of records.
After these operations are completed the results can be exported into different
formats:
•
Space delimited: this format is a general purpose format that can be read by many
learning tools including ours.
•
Weka [52]: this format can be read by the popular open source machine learning
program Weka. Weka, as many other machine learning programs, requires a fixed
sequence length, so this option can’t be used unless the data is filtered to fixed
length.
•
Result statistic: this format gives statistical results about the distribution of the
length of the sessions after filtering.
83
Figure 5.4 Session identification module user interface
5.4 Clustering process module
In the clustering module, users first prepare the session data for clustering by
generating the dissimilarity matrix. Hierarchal and equivalence class clustering
algorithms can then be applied to the sessions using the generated dissimilarity matrix.
Finally, users can pick clusters at a certain level of threshold. Figure 5.5 shows the use
case diagram for the clustering process module.
84
Generate dissimilarity matrix
«uses»
«uses»
Apply clustering algorithm
«uses»
WUM::User
Export clustering results
Figure 5.5 Use case diagram for clustering process module
Figure 5.6 shows the UML diagram for the clustering process module. The
multiplicity shows that the dissimilarity matrix is generated based on the sessions. Each
cluster has one or more sessions, and each session belongs to one cluster only. Finally,
the clusters are generated using the hierarchal class. Each cluster consists of one or
several cluster levels.
Figure 5.6 UML diagram for the clustering process
Figure 5.7 shows the first sequence diagram in the clustering process. This diagram
shows how the dissimilarity matrix is generated by passing different messages between
Session, Dissimilarity Matrix and Distance classes.
85
Session
Distance
Dissimilarity Matrix
Create
Measure Distance
Return Distance
Done
Figure 5.7 Sequence diagram for generating dissimilarity matrix
Figure 5.8 shows the second sequence diagram in the clustering process. This
diagram shows how the clusters are found by passing different messages between
Cluster, Hierarchal and Dissimilarity Matrix classes.
Cluster
Hierarchal
Dissimilarity Matrix
doClustering
Check Distance
Return Distance
Return Cluster Level
Figure 5.8 Sequence diagram finding clusters
86
Figure 5.9 shows the clustering module user interface. The user starts by loading the
session file prepared earlier by the session identification process. Then, the dissimilarity
matrix can either be generated, by choosing the “Generate Dissimilarity Matrix” button,
or loaded directly from a text file, using the “Load Similarity Matrix” button. Once both
the session and the dissimilarity matrix files are ready, the user can perform clustering
process by choosing “Run Clustering” button. The user can then export the clustering
result at any level, by providing the clustering level in the “Clustering Level” space, and
then by clicking “Export Clustering Results at Certain Level” button.
87
Figure 5.9 Clustering module user interface
5.5 Results presentation and evaluation module
Figure 5.10 shows the dataflow diagram for the results presentation and evaluation
module. The figure shows the programs used at different steps in the learning and
evaluation process. Split.java is used to split the clusters into two dependent sets.
Raff2Lem.ext is used for exporting the result to BLEM2 format and then using Lers7.exe
for the learning process to generate the first set of rules using BLEM2. MaxSupport.java
88
is used in filtering out rules that have maximum support. FindCentroid.java is used in
learning rules using the centroid method. OpMVClassifierCF3.tcl is the inference engine
that tests the accuracy of the classifiers. Source code for Split.java, FindCentroid.java,
MaxSupport.java and OpMVClassifierCF3.tcl are available upon request.
89
Clusters
Split.java
83% Clustering
Examples
17% Clustering
Examples
Raff2Lem.exe
FindCentroid.java
Clustering
Examples in
BLEM2 format
Lers7.exe
Rules Learned Using
Centroid Method
All Rules Learned
Using BLEM2
MaxSupport.java
Rules Learned
Using BLEM2 with
Max. Support
OpMVClassifierCF3.tcl
Evaluation
Results
Figure 5.10 Dataflow diagram for the results presentation and evaluation module
5.6 Summary
In this chapter, we presented the implementation of the web usage mining system
presented in this dissertation. The system implantation was accomplished using mixture
90
of different programming environments such as SQL, Java and TCL. The source code for
the implementation is available upon request.
91
CHAPTER VI
SUMMARY AND CONCLUSIONS
In this work, we presented a complete Web Usage Mining (WUM) system using
data mining techniques and a rough set learning approach. The system architecture
covered the major parts of the WUM system including data preprocessing, data cleaning
and filtering, session comparison, clustering analysis, and results presentation and results
incorporation. The goal of this system is to give a deep conceptual understating of the
usage behavior of a website. This conceptual understanding can be used by the website
engineer to evaluate and to enhance the website structure to better assist users in their
future browsing experiences.
In the data preprocessing phase, we presented new techniques for preprocessing web
log data including identifying unique users and sessions. We developed a fast active userbased user identification algorithm which has a time complexity of O(n). For session
identification we presented an ontology-based session identification algorithm that uses
the website structure to identify users’ sessions. We showed that the user identification
algorithm depends on three parameters: number of records per user, web log records
recording rate on the web log, and the maximum inactive time for users. Table 6.1 shows
the mathematical models along with correlation coefficients for the three website
parameters on which our active user-based user identification algorithm depends. These
models can be used in simulating future website usage activity.
92
Table 6.1 Mathematical models for three website parameters
Parameter
Mathematical model
Correlation coefficient
Records per user probability
Power fit
0.41
User navigation time probability Reciprocal quadratic 0.92
Records per second probability
Geometric fit
1.00
In the session comparison phase, we presented a new multidimensional session
comparison method (MSCM), which is based on dynamic programming. Unlike other
algorithms, MSCM takes into consideration other dimensions in the session, such as the
time spent on the page and the total session length. The algorithm provided more accurate
results than other known algorithms in comparing web sessions, such as Sequence
Alignment Method (SAM), Multidimensional Sequence Alignment Method (MDSAM)
and Path Feature Space. The output of the MSCM is presented in the form of
dissimilarity matrix, which can be used by different clustering techniques, such as
hierarchal, k-mean and equivalence classes clustering algorithms.
In the clustering phase, we presented two clustering algorithms. The first is a
hierarchal clustering algorithm and the other is an equivalence classes clustering
algorithm. Unlike other clustering algorithms, the equivalence classes clustering
algorithm does not depend on the seed starting point of the clustering process. So, we
didn’t have to repeat the clustering process several times and to take the average; rather,
it was sufficient to perform the clustering process once. We, also, presented a new
method for choosing a common termination condition for clustering algorithms in the
process of clustering different session length groups. The new method showed that the
shortest session length must be used to determine the termination condition for other
session length groups.
93
In the learning phase, the clustering results, which were presented in the form of
examples, were used by two classifiers to generate rules. These rules were used in
predicting the clusters for prospective users and to describe the cluster itself. We
presented two classification approaches: the centroid approach and the rough set
approach BLEM2. The accuracy for predicting the clusters for prospective sessions were
measured using the holdout accuracy estimator method. The results showed that the
rough set approach, BLEM2, is more accurate in predicting prospective sessions’
clusters. For the cluster description, we based our measure on the length of the
description. The rough set approach BLEM2 showed shorter description length for
clusters. In summary, the rules generated using the rough set approach BLEM2 better
predict and describe web sessions’ clusters.
In the results incorporation phase, we used the clustering results along with the
learned rules to present a deep conceptual description for a website usage. We presented
the most common tasks that were performed on the website. In addition, we presented
what a navigation path was most common used to complete these tasks. We showed how
the clustering and learning results can be used in presenting suggestions to the website
designer to enhance the website structure to better assist users in their future browsing
experiences.
WUM research is an emerging field and there remains much to be learned from the
interaction between users and different websites. Future work needs to be done to
automate the process of the WUM. This can be carried out by incorporating the generated
rules with the web log and clustering users into different clusters on the fly. Additional
94
work can also be done by dynamically adjusting the website structure according the
WUM results.
Web sessions are a special case of string sequences, so as a future work the
techniques presented in this dissertation—in particular the multi-dimensional sequence
comparison algorithm, the two clustering algorithms, and the learning approaches—can
be applied to other sequence comparison research areas, such as bioinformatics, which is
the area of analyzing genomic research data.
95
REFERENCES
[1] R. Cooley, P.-N. Tan, and J. Srivastava, “Discovery of Interesting Usage Patterns
from Web Data,” Revised Papers from the International Workshop on Web Usage
Analysis and User Profiling, pp. 163-182, August 15, 1999.
[2] Surfaid analytics, http://surfaid.dfw.ibm.com.
[3] Sane Solutions. Analyzing website traffic, 2000. http://www.sane.com/.
[4] Webtrends log analyzer. http://www.webtrends.com.
[5] B. Mobasher, R. Cooley, J. Srivastava, “Automatic personalization based on Web
usage mining,” Communications of the ACM, Vol. 43, No. 8, pp. 142-151, August, 2000.
[6] Y. Fu, K. Sandhu, and M.-Y. Shih, “A Generalization-Based Approach to Clustering
of Web Usage Sessions,” Revised Papers from the International WEBKDD'99
Workshop, San Diego, CA, USA, pp. 21-38, August 15, 1999.
[7] T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: an efficient data clustering
method for very large databases,” Proceedings of the 1996 ACM SIGMOD international
conference on Management of data, pp. 103-114, Montreal, Quebec, Canada, June 04-06,
1996.
[8] A. Foss, W. Wang, and O. R. Zaïane, “A Non-Parametric Approach to Web Log
Analysis,” in Proceedings of Workshop on Web Mining in First International SIAM
Conference on Data Mining (SDM2001), pp. 41-50, Chicago, IL, April 5-7, 2001.
[9] R. Kohavi, “Mining e-commerce data: the good, the bad, and the ugly,” Proceedings
of the seventh ACM SIGKDD international conference on Knowledge discovery and data
mining, pp. 8-13, San Francisco, California, 2001.
[10] R. Cooley, B. Mobasher, and J. Srivastava, “Data preparation for mining world wide
web browsing patterns,” Journal of Knowledge and Information Systems, Vol. 1, No. 1,
pp. 5-32, 1999.
96
[11] C. Shahabi and F. Banaei-Kashani, “A Framework for Efficient and Anonymous
Web Usage Mining Based on Client-Side Tracking,” Revised Papers from the Third
International Workshop on Mining Web Log Data Across All Customers Touch Points,
pp. 113-144, August 26, 2001.
[12] K. D. Fenstermacher and M. Ginsburg, “Client-side monitoring for web mining,”
Journal of the American Society for Information Science and Technology, Vol. 54, No. 7,
pp. 625-637, 2003.
[13] L. Catledge and J. Pitkow, “Characterizing browsing strategies in the World-Wide
Web,” Journal of Computer Networks and ISDN Systems, Vol. 27, No. 6, pp. 1065-1073,
April, 1995.
[14] M.-S. Chen, J. S. Park, and P.S. Yu, “Data mining for path traversal patterns in a
web environment,” Proceedings of the 16th International Conference on Distributed
Computing Systems (ICDCS '96), pp. 385-393, May 27-30, 1996
[15] H. Mannila and H. Toivonen, “Discovering Generalized Episodes Using Minimal
Occurrences,” Proceedings of the Second International Conference on Knowledge
Discovery and Data Mining (KDD'96), pp. 146-151, Portland, Oregon, August, 1996.
[16] T. W. Yan, M. Jacobsen, H. Garcia-Molina, and U. Dayal, “From user access
patterns to dynamic hypertext linking,” Journal of Computer Networks and ISDN
Systems, Vol. 28, No. 7-11, pp. 1007-1014, 1996.
[17] B. Berendt, B. Mobasher, M. Nakagawa, and M. Spiliopoulou, “The Impact of Site
Structure and User Environment on Session Reconstruction in Web Usage Analysis,”
Lecture Notes in Computer Science, Vol. 2703, pp. 159-179, September, 2003.
[18] Z. Chen, L. Tao, J. Wang, L. Wenyin, and W.-Y. Ma, “A Unified Framework for
Web Link Analysis,” Proceedings of the 3rd International Conference on Web
Information Systems Engineering, pp. 63-72, Washington, DC, 2002.
[19] W.W.W. Consortium, The common log file format. Available at http://www.w3.org.
[20] P. Clerkin and P. Cunningham and C. Hayes, “Ontology Discovery for the Semantic
Web Using Hierarchical Clustering,” Semantic Web Mining Workshop at the 12th
European Conference on Machine Learning (ECML’01) and the 5th European
Conference on Principles and Practice of Knowledge Discovery in Databases
(PKDD’01), Freiburg, Germany, 2001.
[21] M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S.
Slattery, “Learning to construct knowledge bases from the World Wide Web,” Artificial
Intelligence, Vol. 118, No. 1-2, pp. 69-113, April, 2000.
97
[22] A. Maedche, S. Staab, “Discovering conceptual relations from text,” Proceedings of
the 14th European Conference on Artificial Intelligence (ECAI 2000), August 20-25,
2000, Berlin, Germany, Amsterdam, IOS Press (2000) pp. 321–325.
[23] W. A. Bayer, M. L. Stein, T. F. Smith, and S. M. Ulam, "A molecular-sequence
metric and evolutionary trees," Journal of Mathematical Biosciences, vol. 19, pp. 9-25,
1974.
[24] N. R. Dixon and T. B. Martin, “Automatic Speech and Speaker Recognition,” John
Wiley & Sons, Inc., New York, NY, 1979.
[25] P. A. V. Hall and G. R. Dowling, “Approximate String Matching,” ACM Computing
Surveys (CSUR), Vol. 12, No. 4, pp. 381-402, December, 1980.
[26] C. H. Joh, T. A. Arentze, and H. J. P. Timmermans, “A position-sensitive sequence
alignment method illustrated for space-time activity-diary data,” Journal of
Environmental and Planning A, Vol. 33, pp. 313-338, 2001.
[27] M. J. Hunt, M. Lenning, and P. Mermelstein, “Use of Dynamic Programming in a
Syllable-Based Continuous Speech Recognition System.” In “Time Warps, String Edits,
and Macromolecules: The Theory and Practice of Sequence Comparison,” D. Sankoff
and J. B. Kruskal, Eds., pp. 163-188, Addison-Wesley, Reading, Mass., 1983.
[28] C. Shahabi, A. M. Zarkesh, J. Adibi, and V. Shah, “Knowledge discovery from users
Web-page navigation,” Proceedings of the 7th International Workshop on Research
Issues in Data Engineering (RIDE '97) High Performance Database Management for
Large-Scale Applications, pp. 20-31, April 07-08, 1997.
[29] B. Hay, G. Wets, K. Vanhoof, “Web Usage Mining by Means of Multidimensional
Sequence Alignment Methods,” Lecture Notes in Computer Science, Vol. 2703, pp. 5065, September, 2003.
[30] B. Hay, G. Wets, and K. Vanhoof, “Mining Navigation Patterns using a Sequence
Alignment Method,” Journal of Knowledge and Information Systems, Vol. 6, No. 2, pp.
150-163, 2004.
[31] B. Hay, G. Wets, and K. Vanhoof, “Segmentation of visiting patterns on websites
using a Sequence Alignment Method,” Journal of Retailing and Consumer Services, vol.
10, pp. 145-153, 2003.
[32] V. I. Levenshtein, “Binary codes capable of correcting deletions, insertions, and
reversals,” Journal of Soviet Physics-Doklady, vol. 10, pp. 707-710, 1966.
98
[33] W. Wang and O. R. Zaïane, “Clustering Web Sessions by Sequence Alignment,”
Proceedings of the 13th International Workshop on Database and Expert Systems
Applications (DEXA'02), pp. 394-398, Aix-en-Provence, France, 2002.
[34] D. Gusfield, “Algorithms on strings, trees, and sequences: computer science and
computational biology,” Cambridge University Press, New York, NY, 1997.
[35] K. Charter, J. Schaeffer, and D. Szafron, “Sequence alignment using FastLSA,”
Proceedings of International Conference on Mathematics and Engineering Techniques in
Medicine and Biological Sciences, pp. 239-245, Las Vegas, NV, June, 2000.
[36] A. K. Jain, R. C. Dubes, “Algorithms for clustering data,” Prentice-Hall, Inc., Upper
Saddle River, NJ, 1988.
[37] R. O. Duda and P.E. Hart, “Pattern Classification and Scene Analysis,” New York,
Wiley, 1973.
[38] S. Guha, R. Rastogi, and K. Shim, “ROCK: A Robust Clustering Algorithm for
Categorical Attributes,” Proceedings of the 15th International Conference on Data
Engineering, pp. 512-521, March 23-26, 1999.
[39] S. C. Johnson, “Hierarchical clustering schemes,” Journal of Psychometrika, vol. 32,
pp. 241-254, 1968.
[40] J. H. Ward, “Hierarchical grouping to optimize an objective function,” Journal of the
American Statistical Association, vol. 58, pp. 236-244, 1963.
[41] B. S. Everitt, “Cluster Analysis,” London, Edward Arnold, 1993.
[42] K. M. Cunningham, and J. C. Ogilvie, “Evaluation of hierarchical grouping
techniques: a preliminary study,” The Computer Journal, Vol. 15, pp. 209-213, 1972.
[43] F. K. Kuiper and L. Fisher, "A Monte Carlo comparison of six clustering
procedure," Biometric, Vol. 31, pp. 777-783, 1975.
[44] R. K. Blashield and L. C. Morey, “Mixture model test of cluster analysis: Accuracy
of four agglomerative hierarchical methods,” Psychological Bulletin, Vol. 83, pp. 377388, 1976.
[45] J. Cohen, “A Coefficient of Agreement for Nominal Scales,” Educational and
Psychological Measurement, Vol. 20, pp. 37-46, 1960.
[46] J. W. Grzymala-Busse, "LERS: A System for Learning from Examples Based on
Rough Sets. Intelligent Decision Support," in Handbook of Applications and Advances of
the Rough Sets Theory, R. Slowinski, Ed. pp. 3-18, Boston, MA: Kluwer Academic
Publishers, 1992.
99
[47] C.-C. Chan and J. W. Grzymala-Busse, "On the two local inductive algorithms:
PRISM and LEM2," Foundations of Computing and Decision Sciences, Vol. 19, pp. 185203, 1994.
[48] J. W. Grzymala-Busse, “A new version of the rule induction system LERS,”
Fundamenta Informaticae, Vol.31, No.1, pp. 27-39, July, 1997.
[49] Z. Pawlak, “Rough sets: basic notion,” International Journal of Computer and
Information Science, Vol. 11, pp. 344-356, 1982.
[50] Z. Pawlak, J. Grzymala-Busse, R. Slowinski, and W. Ziarko, “Rough sets,”
Communication of ACM, Vol. 38, pp. 89-95, 1995.
[51] J. Han and M. Kamber, “Data mining: concepts and techniques,” Morgan Kaufmann
Publishers Inc., San Francisco, CA, 2000.
[52] I. H. Witten and E. Frank, “Data mining: practical machine learning tools and
techniques with Java implementations,” Morgan Kaufmann Publishers Inc., San
Francisco, CA, 2000.
100
Download