TOWARD BETTER WEBSITE USAGE: LEVERAGING DATA MINING TECHNIQUES AND ROUGH SET LEARNING TO CONSTRUCT BETTER-TO-USE WEBSITES A Dissertation Presented to The Graduate Faculty of The University of Akron In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy Natheer Yousef Khasawneh August, 2005 TOWARD BETTER WEBSITE USAGE: LEVERAGING DATA MINING TECHNIQUES AND ROUGH SET LEARNING TO CONSTRUCT BETTER-TO-USE WEBSITES Natheer Yousef Khasawneh Dissertation Approved: Accepted: _______________________________ Advisor Dr. John Durkin _______________________________ Department Chair Dr. Jose De Abreu-Garcia _______________________________ Committee Member Dr. John Welch _______________________________ Dean of the College Dr. George Haritos _______________________________ Committee Member Dr. James Grover _______________________________ Dean of the Graduate School Dr. George Newkome _______________________________ Committee Member Dr. Yueh-Jaw Lin _______________________________ Date _______________________________ Committee Member Dr. Yingcai Xiao _______________________________ Committee Member Dr. Chien-Chung Chan ii ABSTRACT When users browse a website, they usually try to accomplish a certain task, such as finding information, buying products, registering for classes, and attending classes online. The interaction between the users and the website can give the web engineers insight into the most common user tasks performed on the website. They can learn how most users navigate the website to finish their tasks and what changes can be made to the website structure in order to make the completion of the common tasks easier and faster. Most web servers provide web interaction logs to track the interaction between the users and the website. But such logs are usually designed for debugging purposes and not for the analysis of the website. So there is a need for a deeper conceptual method to analyze the interaction log to reveal information that can be used for enhancing the website structure. In this work, different data mining techniques, along with a rough set learning approach, are presented to enhance website usage. A new active-user-based user identification algorithm was applied to the interaction log to group together records that belong to the same user. The algorithm has a complexity running time of one order faster than other user identification algorithms. Sessions for identified users are found using an ontology-based session identification algorithm, which uses the website ontology in determining the sessions within website users. Different website sessions are then compared using a new Multidimensional Session Comparison Method (MSCM). MSCM iii takes into consideration other session dimensions, such as pages visited, time spent on the pages and the session length. MSCM compares sessions more precisely than other well known session comparison methods, such as the Sequence Alignment Method (SAM), Multidimensional Sequence Alignment Method (MDSAM), and Path Feature Space. Using the comparison results from the MSCM, sessions are clustered by hierarchal and equivalence classes clustering algorithms. The clustering results are used by the rough set learning method and the centroid method to generate rules that are used for both predicting and describing sessions’ clusters. Rules generated using a rough set learning approach predict and describe clusters better than rules generated using centroid method. Each session cluster is considered one task and the cluster centroid is the navigation path for completing the task. So common tasks along with their navigation path are evaluated, suggestions are then made for the website engineer to enhance the website structure to better serve website users. This work shows how data mining techniques along with rough set learning methods can be used to enhance the website structure for better-to-use websites. iv DEDICATION To my parents… v ACKNOWLEDGEMENTS All praises are due to ALLAH (GOD). Every good comes through HIM alone. So praises be to HIM. My profound thanks to my advisor Dr. John Durkin for his support, confidence, and understanding. My deep appreciations to Dr. C.-C. Chan for his constant support and insightful guidance and to Dr. Tom Xiao for the good time I spent with him in the ODOT project that was very helpful in my research. I want to thank Dr. John Welch for his proof readings and the time he spent with me teaching in the "Tools Lab." Dr. James Grover and Dr. Y.-J. Lin also gave me invaluable support throughout my research. My special thanks to the staff in the computer center at the University of Akron for providing the data for this research. My thanks also go to the faculty and the staff of the Department of the Electrical and Computer Engineering for their support. My heartfelt thanks to my brothers in Akron Majsid, Abdul Kareem, Abdul Raheem, Yahya, Hussien, Masoud, Musa and Abdel Ghanee, for their prayers and support. To my dear friends in USA, including, Qasem, Luay, Qais, Ahmad, Mohammad, Hussein, Huthaifa, Faisal, Sami, Samer, and Majed a special thanks for the happy time we spent together. My friends and family in Jordan, including my mother, Mrs. Fairouze Khasaswneh, my father, Mr. Yousef Khasawneh, my sisters Fatemah, Hala, and Dr. Maha, my brothers Dr. Basheer and Dr. Mohammad and their families, have been my strongest support vi system. This project surely would not have been accomplished without their love, care and DU'A (prayers). vii TABLE OF CONTENTS Page LIST OF TABLES…………………………………………………………………. xiii LIST OF FIGURES………………………………………………………………… xv CHAPTER I. INTRODUCTION……………………………...…………………………... 1 2 1.1 Motivation …………….……………...……………………………… 1.2 Previous work ………...................…………………………………… 2 1.3 Proposed WUM system architecture…………………………………. 3 1.4 Main contributions ………………………………...………………... 5 1.5 Research objective……………………………………………………. 6 1.6 Structure of the dissertation…………………………………………... 6 II. WEB LOG DATA PREPROCESSING FOR WEB USAGE MINING …... 7 7 2.1 Introduction…………………………………………………………... 2.2 Previous work………………………………………………………… 9 2.3 Data preprocessing architecture……………………………………… 10 2.3.1 Data cleaning…………………………………………………… 10 2.3.2 User identification…………………………………...…………. 11 2.3.2.1 User identification problem statement………………….. 12 2.3.2.2 A trivial user identification algorithm………………….. viii 12 2.3.2.3 The active user-based user identification algorithm………………………………………... 13 2.3.3 Ontology-based session identification………………………….. 16 2.3.4 Data filtering……………………………………………………. 18 2.4 Experimental results……………………………………………………. 19 2.4.1 Data overview…………………………………………………... 19 2.4.2 Data selection process………………………………………….. 2.4.3 Data cleaning results……………………………………………. 20 2.4.4 User identification results………………………………………. 22 2.4.5 Session identification and data filtering results………………… 24 20 2.5 Modeling website parameters…………………………………………... 25 2.5.1 Distribution functions…………………………………………... 25 2.5.2 Analytical results……………………………………………….. 26 2.5.2.1 Modeling number of records per user………………….. 27 2.5.2.2 Modeling inactive user time……………………………. 28 2.5.2.3 Modeling recorded records per second…………………. 29 2.6 Summary………………………………………………………………... 29 III. MULTIDIMENSIONAL SESSIONS COMPARISON METHOD USING DYNAMIC PROGRAMMING………...……………………....................... 31 3.1 Introduction…………………………………………………………….. 31 3.2 Definitions……………………………………………………………… 32 3.3 Problem statement……………………………………………………… 33 3.4 Related work……………………………………………………………. 33 ix 3.4.1 Exact sequence matching………………………………………. 33 3.4.2 Approximate one dimension sequence matching………………. 33 3.4.2.1 Measuring difference distance………………….……..... 34 3.4.2.2 Measuring similarity distance…………………………... 36 3.5 Previous work…………………………………………………………... 37 3.5.1 Limitations of the previous work………………..……………... 38 3.6 Multidimensional session comparison method (MSCM)………………. 38 39 3.6.1 Assumptions……………………………………………………. 3.6.2 Algorithm construction…………………………………………. 40 3.6.3 Algorithm description…………………………………………... 41 3.6.4 Time complexity analysis………………………………………. 43 3.7 Experimental results and analysis……………………………………… 43 3.8 Summary and conclusion………………………………………………. 45 IV. ENHANCING WEBSITE STRUCTURE BY MEANS OF HIERARCHAL CLUSTERING ALGORITHMS AND ROUGH SET LEARNING APPROACH…………………………………………………. 47 4.1 Introduction…………………………………………………………….. 47 4.2 Clustering analysis……………………………………………………… 49 49 4.2.1 Clustering algorithms……..……………………………………. 4.2.2 Properties of agglomerative hierarchal clustering techniques….. 51 4.3 Clustering web sessions………………………………………………… 52 4.3.1 Definitions……………………………………………………… 52 4.3.2 Problem statement……………………………………………... 54 x 4.3.3 Hierarchal clustering algorithm………………………………… 54 4.3.4 Equivalence classes clustering algorithm………………………. 55 4.3.5 Determining a common termination condition for different sessions lengths………………………………………………… 57 4.3.6 Ward's method improves determining a common termination condition………………………………………………………... 58 4.4 Web sessions' classifiers………………………………………………... 60 4.4.1 The centroid approach………………………………………….. 61 4.4.2 Rough set approach…………………………………………….. 62 4.5 Classifier accuracy estimator…………………………………………… 66 4.6 Experimental results……………………………………………………. 68 68 4.6.1 Choosing the clustering termination conditions………………... 4.6.2 Classifier prediction accuracy results by rules generated from examples using the hierarchal clustering algorithm……………. 70 4.6.3 Classifier prediction accuracy results by rules generated from examples using equivalence classes clustering algorithm……… 72 4.6.4 Cluster description results……………………………………… 73 4.7 Results incorporation…………………………………………………… 74 4.7.1 Identifying the most common tasks…………………………….. 74 4.7.2 Finding how many clicks needed to finish each task…………... 75 4.7.3 Presenting suggestions to enhance the website structure………. 76 4.8 Results discussion………………………………………………………. 76 4.9 Summary and conclusion………………………………………………. 77 V. SYSTEM IMPLEMENTATION………………...………………………..... 79 xi 5.1 Introduction………….……………………………………………….. 79 5.2 Data preparation module…………………………………………….. 80 5.3 Session identification module………………………………………... 81 5.4 Clustering process module…..……………………………………...... 84 5.5 Results presentation and evaluation module…………………………. 88 5.6 Summary….….……………………………………………………….. 90 VI. SUMMARY AND CONCLUSIONS………...…………………………..… 92 REFERENCES…………………………………………………………...………… xii 96 LIST OF TABLES Page Table 2.1 Selected dates for experimental results along with their major activity…... 20 2.2 The percentage of different file types in the selected data set…………….. 21 2.3 Requests status for the records in the web record…………………………. 21 2.4 Correlation coefficient for different models for the number records per user probability……………………………………………………………. 28 2.5 Correlation coefficient for different models for inactive user time probability………………………………………………………………….. 28 2.6 Correlation coefficient for different models for recorded records per second probability…………………………………………………………. 29 3.1 Pairwise scores between different pages…………………………………... 36 3.2 Two sequences si and sj……………………………………………………. 37 3.3 MSCM algorithm major steps……………………………………………... 41 3.4 Matrix used to compute the minimum edit distance, when only the zeroth column and row are filled in ………………………………………………. 42 3.5 Matrix used to compute the minimum edit distance, when the entire cells are filled in ……….……………………………………..…………………. 43 3.6 Distance measure between sessions using different methods…………....... 4.1 Representing clustering results in a form of examples…………………….. 47 4.2 Number of clusters for different session lengths at different iterations…… 4.3 Percentage of the number of the clusters from the initial number of clusters for different session lengths at different iterations………………... 58 xiii 44 58 61 4.4 Example of a rule generated by a web sessions' classifier………………… 4.5 Decision produced by the clustering algorithm……………………………. 63 4.6 Certain rules learned from 4.5 using BLEM2……………………………... 4.7 Inference engine testing examples…………………………...…………….. 67 4.8 Inference engine results along with results from cluster…………………... 68 6.1 Mathematical models for three website parameters……………………….. 93 xiv 65 LIST OF FIGURES Page Figure 1.1 Proposed WUM system architecture…………………………………….. 4 2.1 Data preprocessing architecture…………………………………………. 2.2 Formal user identification problem statement…………………………… 12 2.3 Trivial user identification algorithm……………………………………... 13 2.4 Active user-based user identification algorithm…………………………. 15 2.5 Ontology-based session identification algorithm………………………... 2.6 Monthly record counts recorded in the web log…………………………. 20 2.7 Active user-based user identification script……...……………………… 23 2.8 Histogram for sessions’ lengths after session identification…………….. 24 2.9 Histogram for the sessions’ lengths before filtering…………………….. 25 2.10 Probability of the number of records per user…………………………… 27 2.11 Probability of inactive user time in seconds……………………………... 28 2.12 Probability of recorded records per second……………………………… 4.1 Web usage classification and prediction workflow……………………… 48 4.2 Two well separated clusters with intermediate chain……………………. 51 4.3 Hierarchal clustering algorithm………………………………………….. 55 4.4 Equivalence classes clustering algorithm………………………………... 57 xv 10 18 29 4.5 Percentage of the number of the clusters from the initial number of clusters for a specific session length at different iterations using the average linkage method………………………………………………….. 60 4.6 Percentage of the number of the clusters from the initial number of clusters for a specific session length at different iterations using the Ward’s method…………………………………………………………... 60 4.7 Holdout classifier accuracy estimator…………………………………… 4.8 Percentage of the number of the clusters from the initial number of clusters for different session length groups at different iterations………. 70 4.9 Average accuracy for different session lengths at the 100% number of clusters using examples from the hierarchal clustering algorithm………. 71 4.10 Average accuracy for different session lengths at the 15.69% number of clusters using examples from the hierarchal clustering algorithm……… 72 4.11 Average accuracy for different session lengths at the 15.69% number of clusters using examples from the equivalence classes clustering algorithm………………………………………………………………… 73 4.12 Cluster description length for different session lengths using different classifiers………………………………………………………………… 74 4.13 Seven most common tasks performed on the website…………………… 75 4.14 Sequence length distribution for “Class Search Detail”…………………. 76 5.1 Data flow diagram for the web usage mining system…………………… 80 5.2 Entity relation model for data preparation………………………………. 81 5.3 Use case diagram for session identification……………………………... 82 5.4 Session identification module user interface…………………………….. 84 5.5 Use case diagram for clustering process module………………………... 5.6 UML diagram for the clustering process………………………………… 85 5.7 Sequence diagram for generating dissimilarity matrix…………………... 86 5.8 Sequence diagram finding clusters………………………………………. 86 xvi 67 85 5.9 Clustering module user interface………………………………………… 88 5.10 Dataflow diagram for the results presentation and evaluation module….. xvii 90 CHAPTER I INTRODUCTION The World Wide Web has greatly impacted every aspect of our societies and our lives. This ranges from information dissemination to communication, and from ecommerce to process management. By browsing through a website, users complete different tasks, such as buying products, registering for classes, and attending classes online. Web Usage Mining (WUM), a new field that analyzes the navigation process, has emerged in recent years. WUM is defined as applying data mining techniques to log interactions between users and a website [1]. Analysis of an interaction log file can provide useful information that helps a website engineer in enhancing the website structure in a way that will make the website usage easier and faster in the future. In this dissertation, we are interested in the clustering of web users’ sessions in the context of web applications, such as registration web-based systems, distance-education web-based systems, e-commerce sites, and any other web based applications. Clustering web users’ sessions is to group users with similar navigation behaviors together. Our goal is to use the clustering results to identify dominant browsing behaviors, evaluate a website structure and predict future users’ browsing behaviors to better assist users in their future browsing experiences. 1 1.1 Motivation In the process of designing a web application it is hard to predict how users will use a website in completing different tasks. Usually, web designers have the choice to make a certain task easier to complete than other tasks by constructing the website structure in a certain way. After publishing the website online and having users interact with it for a while, it becomes the time to review certain decisions concerning the website structure. Such decisions can be made by analyzing the interaction log between users and the website. A deep conceptual analysis of the interaction log is required to understand what the most common tasks done over the website are, how the majority of users navigate the website to achieve such common tasks, and what changes can be made to the website structure to make the completion of the common tasks easier and faster. For example, if we have a registration website—where users can do different tasks, such as check grades, add classes, drop classes, and pay tuition fees—there is a need for a system to determine what the most common tasks are, and how easily they can be achieved by users. So, if we find that, at a certain point in time, the grade checking process is the most common task, and it takes a long time for users to finish this task, the website engineer shall be advised to enhance the website structure to make this task easier and faster. 1.2 Previous work Available commercial web usage mining systems, such as Surfaid [2], Net Tracker [3], and WebTrends [4], give statistical information about the website, such as the average usage hits, geographical distribution of users, and the most frequent page hit. These are considered to be statistically significant results rather than conceptual results. For example, if we conclude that during a certain time a given number of users hit a 2 website, this gives no insight into the hidden usage patterns. Other published work on web usage mining used different data mining techniques such as association rules [5], clustering and classification. In our research, we focus on researches that use clustering and classification of data mining techniques. For example, Fu et al. [6] used the BIRCH [7] clustering algorithm to cluster users’ sessions. However, they did not discuss how the closeness between different sessions was defined, and they did not show how they chose the maximum difference allowed between sessions in the same cluster. Foss et al. [8] presented a novel clustering algorithm that clusters users’ sessions. Their clustering algorithm did not require any input parameter from the users, such as the final number of clusters, or the maximum difference allowed between sessions in the same cluster. The way they measure similarity did not consider the order of the pages. For example, they considered a session consisting of pages, say A, B, and C, identical to a session consisting of pages, say A, B, C, and D, or any other number of pages that contains the pages A, B, and C. 1.3 Proposed WUM system architecture As shown in Figure 1.1, the proposed WUM system is divided into four phases: preprocessing, dissimilarity measure, clustering analysis, and results incorporation and evaluation. In the preprocessing phase, the raw web logs are filtered from unrelated web requests, the records that belong to the same user are then grouped together in one set. In the dissimilarity measure phase, the users’ records are divided into one or more sessions and a dissimilarity matrix, which reflects the dissimilarity between different sessions, is constructed. In the clustering analysis phase, sessions with similar browsing behaviors are grouped together. In the last phase—the results incorporation and evaluation phase— 3 clustering results are incorporated to predict future users’ classes, and present suggestions to enhance the website structure to adequately better serve website users in their future visits. Web log data Preprocessing Phase 1 Data filtered Records grouped into users Dissimilarity measure Phase 2 Users divided into one or more sessions Dissimilarity matrix constructed Clustering analysis Phase 3 Sessions with the same browsing behavior are grouped together Results incorporation and evaluation Results are evaluated Results are incorporated to enhance website structure Figure 1.1 Proposed WUM system architecture 4 Phase 4 1.4 Main contributions In each WUM phase presented in Section 1.3, we have one or more new contributions to the WUM field. In the preprocessing phase, we present a fast active user-based user identification algorithm with time complexity O(n). For the session identification phase, we present an ontology-based session identification algorithm that utilizes the website structure and its functionalities in identifying different sessions. In addition, we present extra cleaning steps, such as removing housekeeping pages, removing redundant pages, and grouping sessions with similar session lengths. We also present three mathematical models for the parameters on which our user-identification algorithm depends. In the dissimilarity measure phase, we present a new Multidimensional-Sessions Comparison-Method (MSCM) using dynamic programming. Our method takes into consideration different session dimensions, such as the page list, the time spent on each page, and the length of the session. This is in contrast to other algorithms that treat sessions as sets of visited pages within a time period and do not consider the sequence of the click-stream visitation or the session length. In the web sessions clustering analysis phase, we present two clustering algorithms: a hierarchal clustering algorithm and an equivalence classes clustering algorithm. The equivalence classes clustering algorithm does not depend on the seed starting point of the clustering process. We also present a new method to determine the clustering parameter that in turn determines where the clustering algorithm should stop. In the results incorporation and presentation phase, we present a rough set approach in predicting the future classes and we present the results in a way that can be 5 incorporated in the website server for predicting future users’ classes. We also present an evaluation process that evaluates the accuracy of the predicted classes. Finally, we show how results can be incorporated to enhance the website structure to better serve future website users. 1.5 Research objective Our main objective in this research is to present a WUM system that uses clustering algorithms along with a rough set learning approach. This improved WUM system presents a deep conceptual understanding of the usage behavior for a website, can be used by the website engineer to evaluate and enhance the website structure, and to predict “what the user was trying to do” to better assist users in their future browsing experiences. This should lead to websites that are easier and more convenient for users to navigate. 1.6 Structure of the dissertation The rest of the dissertation is organized as follows. In Chapter 2, we present the data preprocessing phase. In Chapter 3, we present the MSCM method. In Chapter 4, we present the four main steps incorporated in the WUM system: the clustering algorithms, the clustering results presentation, evaluating the results and incorporating the results to improve the website structure. In Chapter 5, we present an overview of the system implementation. In Chapter 6, we present conclusions drawn from the research and recommendations for future work. 6 CHAPTER II WEB LOG DATA PREPROCESSING FOR WEB USAGE MINING Web usage mining is “the application of data mining techniques to large Web data repositories in order to extract usage patterns” [1]. Web log files contain data that need some cleaning since their formats were meant for debugging purposes only [9]. In this chapter, we present new techniques for preprocessing web log data and for identifying unique users and sessions from the data. We present a fast active user-based user identification algorithm with time complexity O(n). The algorithm uses both an IP address and a finite users’ inactive time to identify different users in the web log. For the session identification, we present an ontology-based session identification method that utilizes the website structure and functionalities to identify different sessions. In addition, we present extra cleaning steps such as removing housekeeping pages, removing redundant pages, and grouping sessions with similar session lengths. Finally, we present three mathematical models for the website parameters on which our active user-based user identification algorithm depends. 2.1 Introduction Web usage mining is “the application of data mining techniques to large Web data repositories in order to extract usage patterns” [1]. Data mining techniques—such as association rule mining, sequential patterns or clustering analysis—cannot be applied 7 directly to raw web-log-files data, since the format of this data was designed for debugging purposes [9]. Five preprocessing steps have been identified [10]: 1. Data cleaning: This step removes irrelevant data, such as log records for images, scripts, help files, and cascade style sheets. Only data that is relevant to the mining process is kept. 2. User identification: This consists of grouping together records for a same user. Log records are recorded in a sequential manner as they are coming from different users (i.e., records for a specific user may not necessarily be in consecutive order since they could be separated by records from other users). 3. Session identification: This step divides the pages access of each user into individual sessions. 4. Path completions: This step determines whether there are important accesses that are not recorded in the access log due to caching on several levels. 5. Formatting: This step formats the data to be readable by data mining systems. In this chapter, we present a detailed data processing architecture that includes data cleaning, user identification, session identification and data filtering. Our main contributions in this chapter include a user identification algorithm that is running in a time complexity of O(n), an ontology-based session identification algorithm, and three mathematical models for the website parameters on which our user identification algorithm depends. The rest of the chapter is organized as follows. In Section 2, we survey the previous work on web usage mining preprocessing techniques. In Section 3, we present our data preprocessing architecture along with a detailed description for each step. In Section 4, 8 we present experimental results. Section 5 presents the mathematical models for different website parameters. In Section 6, we present the summary. 2.2 Previous work Previous work on preprocessing web logs emphasized on the caching problem since caching produces incomplete web logs. One solution to this is to collect the data on the client side. For example, Shahabi [11] collected almost all the user interactions with the browser. Fenstermacher and Ginsburg [12] went beyond the browser interaction and recorded the interaction between the user and some other applications. Catledge and Pitkow [13] presented another system that collected data on the client side. The use of these methods imposes the issue of security. Moreover, most of the proposed methods require special browsers and setups. Other works like [14] and [15] assumed that the filtered web server log is a good representation for web usage meaning and that there is no need for any heuristic methods to complete the sequences such as path completion process [10]. In their task of the preprocessing step [16], Yan et al. converted the information in user access logs into a vector representation. The vector representation combines the page access along with the amount of interest a user shows in a page, which was calculated by counting the number of times the page was accessed. Path completion [10] identifies missing records in users’ sessions using a heuristic method, which is based on the web structure. Transactions were identified either by reference length, which is based on the time spent on the page, or by maximal forward reference, which is based on the first backward action (hitting the back button on the browser) after a series of forward actions (normal forward navigation). Other heuristic 9 methods like time-oriented heuristics [17] and navigation-oriented heuristics [18] were used to identify different sessions. It can be concluded that previous work has not identified a specific algorithm for user identification; rather, they assumed that users’ records are readily available in the website log. 2.3 Data preprocessing architecture As shown in Figure 2.1, we identified four steps in the data preprocessing phase: data cleaning, user identification, session identification and data filtering. The following subsections provide details on each step. Records from Web Log Data Mining Technique Data Cleaning User Identification Data Filtering Session Identification Figure 2.1 Data preprocessing architecture 2.3.1 Data cleaning Web logs are designed for debugging purposes in that the web accesses are recorded in the order they arrive [9]. Due to the connectionless nature of the HTTP (i.e., each request is handled in a separate connection), web log records for a single user do not necessarily appear contiguously since they could be interleaved with records from other users. Thus, for each page component—such as an image, a cascading style sheet file, an HTML file, scripting file, or a Java script—a separate record is recorded in the web log file. Usually, each record in the web-log file has the following standard format [19]: 10 • Remotehost which is the remote hostname or its IP address; • Logname which is the remote logname of the user; • Date, which is the date and time of the request; • Request, which is the exact request line as it came from the user; • Status, which is the HTTP status code returned to the client; and • Byte, which is the content-length of the document transferred. Usually, for web mining purposes the only interesting elements are the HTML pages and the scripting pages—such as JSP, ASP or PHP pages—unless other file types are playing a navigation role in the web application and they are part of the web structure. In the cleaning phase, the file types that are related to the navigation structure are kept and other files are eliminated. The status field in the web log can be used to keep the successfully fulfilled requests and to delete the unsuccessful requests. Finally, the mining process can be limited to a certain time or date so that web traffic during such time and date will be only considered. 2.3.2 User identification A user is defined as a unique client to the server during a specific period of time. The relationship between users and web log records is one to many (i.e., each user is identified by one or more records). Users are identified based on two assumptions: 1. Each user has a unique IP address while browsing the website. The same IP address can be assigned to other users after the user finishes browsing. 2. The user may stay in an inactive state for a finite time after which it is assumed that the user left the website. 11 Next, we formally present the problem statement of the user identification, and then we present two different algorithms for user identification: a trivial algorithm and our new active user-based one. 2.3.2.1 User identification problem statement Figure 2.2, shows a formal description of the user identification problem statement. As stated earlier, the user identification algorithm must identify the user’s record based on the assumption that all user’s records have the same IP address and a finite inactive browsing time, β. Given web log record R =< r1 , K, rk >, where k > 0 and k is the total number of records in the web log database. ∀r ∈ R, r is defined as <date_time,c_ip,s_ip,s_port,cs_method,url,url_query,status,s_agent> Find users U =< u1 ,K, u j > ∀u ∈ U , u is defined as u =< c _ ip, last _ date _ time, {rs ,K, re } > ∀r ∈ u , r.c _ ip = c _ ip and r.date _ time ≤ last _ date _ time + β at the time record r is added to the user u where: c_ip is the user’s ip address last_date_time is the date and time when the user accessed the last record β is the maximum user’s idle time rs is the first record the user accessed in a single visit to the website re is the last record the user accessed in a single visit to the website Figure 2.2 Formal user identification problem statement 2.3.2.2 A trivial user identification algorithm Figure 2.3, shows the trivial user identification algorithm. The figure shows that the algorithm has two loops: an outer loop; and inner. The outer loop has time complexity of n, where n is the size of the total records. The inner loop has time complexity of i, where i is the total number of current users. In the worst case, each user has one record which 12 leads to two loops, the outer and the inner, each of size n. Hence, the overall time complexity of the algorithm is O(n × n) = O(n 2 ) . Assumption : n number of records in the web log Define R : Website records U : Users' records R(i ) : ith Weblog record Initialize U = φ u first .c _ ip = R(1).c _ ip u first .last _ date _ time = R(1).date _ time u first .r = R(1) U = U U {u first } for each record r ∈ R for each record u ∈ U if(r.c _ ip = u.c _ ip AND r.date _ date ≤ u.last _ date _ time + β ) u.r = u.r U r if (r.date _ date > u.last _ date _ time) u.last _ date _ time = r.date _ date endif else u new .c _ ip = r.c _ ip u new .last _ date _ time = r.date _ time u new .r = r U = U U {u new } endif endfor endfor Figure 2.3 Trivial user identification algorithm 2.3.2.3 The active user-based user identification algorithm The algorithm shown in Figure 2.4 is a modified version of the algorithm described in Section 2.3.2.2. We limited the inner loop search to the active users only. Active users 13 are defined as users who did not exceed the maximum inactive time, and hence they are considered to be still browsing the website and more records are likely to be added to their navigation records. Time complexity analysis shows that there are two loops: outer and inner loops. The outer loop has time complexity of n, where n is the size of the total records. The inner loop has time complexity of i where i is the total number of active users. According to the assumptions given at the beginning of the algorithm, the maximum number of active records cannot exceed (m • k • t ) in the worst case, where m is the number of records per user, k is the rate of the recorded records in the web log, and t is the inactive browsing time. So, the algorithm breaks down to two loops: the outer loop with the size of i, and the inner loop with the size of (m • k • t ) , which is constant. Therefore, the overall complexity of the algorithm becomes O((m • k • t ) • n) = O((cons.) • n) = O(n) . 14 Assumption n records in the web log m number of records per user k records/second recorded in the web log t inactive time in seconds for user β maximum inactive time in seconds for the user Define U A : Active users; users who still browsing U I : Idle users; users who stopped browsing Initialize U A = first record in R Initialize U I = φ for each record r ∈ R do [starting at the 2nd record] for each record u ∈ U A do if (r.date_time > u.last_date_time + β ) U A = U A − {u} //Remove record from active users list U I = U I ∪ {u} //Add record to the idle users list else if(r.c_ip = u.c_ip AND r.date_time ≤ u.last_date_time + β) u.r = u.r U r if(r.date_time > u.last_date_time) u.last_date_time = r.date_time endif else u new .c_ip = r.c_ip u new .last_date_time = r.date_time u new .r = r U A = U A U {u new } //Add new record to the active users list endif endfor U I = U I U U A //Add the remaining records to the idle users list Figure 2.4 Active user-based user identification algorithm 15 2.3.3 Ontology-based session identification A session is defined as the stream of mouse clicks whereby a user is trying to perform a specific task. In our research, we compare different task specific browsing behavior. For example, assume two users, A and B, performed the following tasks. User A searched for classes and checked his grade; whereas, user B paid his tuition fees and searched for classes. The user identification process identifies users A and B as two separate users with totally different behaviors. However, if we divide each user into different sessions, user A will have two sessions: searching for classes and checking for grades; user B will also have two sessions: searching for classes and paying tuition. This shows that users A and B are partially similar in searching for classes rather than being totally different if using only user identification process. We identify different sessions in a single user visit using the website ontology. We also assume that the website ontology is already available through methods of retrieving website ontology like the ones in [20-22]. The website ontology is defined as W = ( P, L, F ) where: P : Website pages, L : Website links, F : Website functionalities, P = 〈 p1 ,K, p k 〉 where k is the number of pages in the website. 16 Links L is the group of links for the web application. Each link l = 〈 p s , p d 〉 is defined by two pages: the source page ( p s ) where a link starts from, and the destination page ( p d ) where links ends. Define functionalities F as F =< f o , f 1 , K, f n −1 > , web where ∀f ∈ F , f =< p s ,K, p e > . Each web functionality, f, consists of at least two pages: a start page and an end page. There can be zero or more pages between the start and end pages. The session identification algorithm will divide the users identified in the Section 2.3.2 into different sessions using the website functionalities. From the website functionalities, we can identify pages that are considered as the breaking points for the session, such as the sign-in or the sign-out pages. Figure 2.5 shows the ontology-based session identification algorithm, where B is the set of the breaking pages. The algorithm splits each user into one or more sessions and returns a final list of sessions S. The time complexity analysis of the algorithm shows two loops: an inner and an outer loop. The inner loop depends on the number of records per user, m, and the outer loop depends on the total number of users, j. It can be easily concluded that m • j = n , where n is the total number of records. So, the overall time complexity of the algorithm is O(n). 17 for each user u ∈ U do for each p ∈ u do if ( p ∈ B) split u at p location s new = the first part of u u=the remaining part of u endif S = S ∪ {s new } endfor S = S ∪ {u} endfor Figure 2.5 Ontology-based session identification algorithm 2.3.4 Data filtering After we identify different users’ sessions, filtering is done based on removing the housekeeping pages. The housekeeping pages are the pages that are necessary for the web application to run properly. They are not called directly by the user; rather, they are called internally by the requested page. These pages are identified by the website engineer and can be found using the website ontology. Removing the housekeeping pages can result in redundant pages which can be misleading to the sequence comparison method, which we present in Chapter 3. To illustrate this, consider the following two sequences: Sequence 1: p 2 → p1 → p3 → p 4 → p5 → p1 → p5 → p 6 , Sequence 2: p 2 → p 7 → p8 → p 4 → p5 → p 6 It can be seen that these two sequences are totally different. However, assuming that pages p1 , p3 , p 7 , and p8 are housekeeping pages and applying the following two step filtering process, sequence 1 and 2 reduces to: Step 1: Removing the housekeeping pages 18 Sequence 1 becomes p 2 → p 4 → p5 → p5 → p 6 Sequence 2 becomes p 2 → p 4 → p5 → p 6 Step 2: Removing the redundant pages Sequence 1 becomes p 2 → p 4 → p5 → p 6 Sequence 2 becomes p 2 → p 4 → p5 → p 6 The outcome of the two-steps filtering process shows how sequences that might look strikingly different they are actually similar. 2.4 Experimental results For experimental results, we used data obtained from the University of Akron registration website log files during the period from October 2003 to September 2004. In this section, we present the results from each preprocessing step mentioned in the previous section. 2.4.1 Data overview The total number of records recorded in the web server for the time period mentioned was 28,294,229 records. Each record in the web log file represents a page request processed by the web server. Figure 2.6 shows the traffic volume on the web server over the selected time period. The figure shows high web traffic during the months when there was a major activity such as beginning registration, release of final grades, or beginning of a semester. For example, it is clear from the figure that there was a high traffic volume in January, which is the time just after the release of the final grades for fall semester and just before the spring semester beginning. 19 Sep-04 Aug-04 Jul-04 Jun-04 May-04 Apr-04 Mar-04 Feb-04 Jan-04 Dec-03 Nov-03 Oct-03 Records Count in Thousands 450 400 350 300 250 200 150 100 50 0 Date Figure 2.6 Monthly record counts recorded in the web log 2.4.2 Data selection process For experimental purposes, we selected the data records for the days in which there was major activity on the web server. Table 2.1 shows the selected dates along with the major activity. The total number of records for the selected dates was 1,582,292, which represents 5.6% of the total records. Table 2.1 Selected dates for experimental results along with their major activity Date Major Activity Monday, December 15, 2003 Teachers upload students grades Tuesday, December 16, 2003 Final grades due for fall semester 2003 Monday, May 10, 2004 Teachers upload students grades Tuesday, May 11, 2004 Final grades due for spring semester 2004 Friday, February 20, 2004 Summer semester registrations begin Friday, October 24, 2003 Spring semester registrations begin Friday, April 02, 2004 Fall semester registration begin 2.4.3 Data cleaning results Table 2.2 shows the percentage of different file types in the selected data set. Since we are interested in the scripting files that imply a direct request by the user, we kept the ASP and HTML file types and we removed other file types. 20 Table 2.2 The percentage of different file types in the selected data set File Type HTML DLL No extension PHP HTM TXT ICO JPG ASP JS XML Total Count 33307 186 2148 2 6938 68 645 1510 1537485 1 2 1582292 Percentage 2.10% 0.01% 0.14% 0.00% 0.44% 0.00% 0.04% 0.10% 97.17% 0.00% 0.00% 100.00% The second cleaning step was to remove the uncompleted requests. This can be tracked using the status code in the log file described in Section 2.3.1. Table 2.3 shows the status code, the code description, the total number of records, and the percentage of records with specific status code. We kept the records with an HTML status code of Ok and removed the other records, so we are left with 76% of the total records. Table 2.3 Requests status for the records in the web record Status Code Code Description Page Count Percentage 206 Unknown 20 0.00% 207 Unknown 3 0.00% 304 No Change 17826 1.13% 302 Not Found 322728 20.40% 400 Bad request 74 0.00% 200 Ok 1206982 76.28% 403 Forbidden 17491 1.11% 404 Matching not found 1536 0.10% 501 Facility not supported 10 0.00% 500 Unexpected condition 15622 0.99% 1582292 100.00% Total 21 2.4.4 User identification results We loaded the selected data into a single table in an SQL database. Out of the six fields described in Section 2.3.1, we selected three fields (Remotehost, Date, and URL). Then, we ran the session identification script shown in Figure 2.7, which is based on the algorithm described in Section 2.3.2. To show the effectiveness of the active user based algorithm we ran both the active user based and the trivial user identification algorithm on the same records and we repeated the experiment using different web log sizes. The active user based algorithm shows much better performance over the other algorithm even for small web log sizes. For example, for 100 web log records, the trivial algorithm took 527 seconds to identify the users’ sequences, while the active user-based algorithm took 8 seconds. For the full log size (1,582,292 records), the trivial algorithm ran for about two days and was aborted by the operating system apparently because of the memory build up without giving any result; whereas the active user-based algorithm took only three hours and 33 minutes to yield the results. 22 DECLARE @RecordId int DECLARE @Date DateTime DECLARE @IPAddress varchar(255) DECLARE @FoundUser_id int DECLARE @NewUserId int DECLARE UserCursor CURSOR FOR SELECT id,date,c_ip FROM guest.weblog order by date OPEN UserCursor FETCH NEXT FROM UserCursor INTO @RecordId, @Date, @IPAddress WHILE @@FETCH_STATUS = 0 BEGIN --Delete old (more than 30 minutes old) records from active users DELETE FROM guest.active_users WHERE DATEADD(minute, 30, date) < @Date --See if there is active users with the same ip iddress SET @FoundUsers_id = (select top 1 user_id from guest.open_users where c_ip = @IPAddress) IF LEN(@FoundUser_id) > 0 BEGIN --if yes, update the max time UPDATE guest.active_users SET date = @date where c_ip = @IPAddress AND user_id = @FoundUser_id --and insert new value into the users table, INSERT INTO guest.users(user_id, id) VALUES(@FoundUser_id, @RecordId) END ELSE BEGIN --if NO, insert new item in the open user table, --get the new user id and then ... INSERT INTO guest.active_users(c_ip, date) VALUES (@IPAddress, @Date) SET @NewUserId = @@IDENTITY -- Insert in the users table as well INSERT INTO guest.users(user_id, id) VALUES(@NewUserId, @RecordId) END FETCH NEXT FROM UserCursor INTO @RecordId, @Date, @IPAddress END CLOSE UserCursor DEALLOCATE UserCursor Figure 2.7 Active user-based user identification script 23 2.4.5 Session identification and data filtering results Figure 2.8 shows the histogram for sessions’ lengths after session identification. The session identification was done based on two breaking pages sign in and sign out. Frequency 15000 10000 5000 0 0 10 20 30 40 50 60 70 80 90 Session length Figure 2.8 Histogram for sessions’ lengths after session identification Grouping the sessions according to their length is important since some learning algorithms require fixed session lengths. This will be illustrated more in Chapter 4. Figure 2.9 shows the histogram for the sessions’ lengths before filtering. It is clear that most of the sessions are of length 15 or less and sessions with larger lengths are considered outliers. So, we grouped sessions into 15 different sessions’ groups, where each session group has sessions with the same length. 24 Frequency 10000 5000 0 0 10 20 30 40 50 60 Session length Figure 2.9 Histogram for the sessions’ lengths before filtering 2.5 Modeling website parameters In this section, we discuss the statistical analysis methods applied to the experimental results described earlier in Section 2.4. In modeling, we emphasize on the website parameters on which our active user-based user identification algorithm described in Section 2.3.3 depends. 2.5.1 Distribution functions We selected three distribution functions as candidates to adequately represent the three website parameters on which our active user-based user identification depends. These are number of records per user, inactive user time, and recorded records per second in the web log. These three distribution functions are. 1. The power fit, which is given by F ( x) = ax b 2.1 2. The reciprocal quadratic, which is given by F ( x) = 1 a + bx + cx 2 25 2.2 3. The geometric fit, which given by F ( x) = axb x 2.3 2.5.2 Analytical results One measure of the "goodness of fit" is the correlation coefficient. To explain the meaning of this measure, we consider the standard error, which quantifies the spread of the data around the mean, as follows St = ∑i =po1 int s ( y − yi ) n 2.4 where the average of the data points ( y ) is simply given by y= 1 n po int s ∑ n po int s i =1 2.5 yi The quantity S t considers the spread around a constant line (the mean) as opposed to the spread around the regression model. This is the uncertainty of the dependent variable prior to regression. We also consider the deviation from the fitting curve as S r = ∑i =po1 int s ( y i − f ( xi ) ) n 2 2.6 Note the similarity of equation 2.6 to the standard error of the estimate given in equation 2.4. This quantity measures the spread of the points around the fitting function. Thus, the improvement (or error reduction) due to describing the data in terms of a regression model can be quantified by subtracting the two quantities presented in equations 2.4 and 2.6. Because the magnitude of the difference is dependent on the scale of the data, this difference is normalized to yield r≡ St − S r St 2.7 26 where r is the correlation coefficient. As the regression model better describes the data, the correlation coefficient will approach unity. For a perfect fit, the standard error of the estimate will approach S r = 0 and the correlation coefficient will approach r=1. Next we model three website parameters number of records per user, inactive user time and recorded records per second in the web log. These three parameters determine the speed of the active user based algorithm presented in Section 2.3.2.3. 2.5.2.1 Modeling number of records per user Figure 2.10 shows the probability of the number of records per user fitting. Table 2.4 shows the correlation coefficient for different models for the number of records per user probability. The power fit model is the most appropriate to fit the data since it has Probability the closest value to one. 1.00E+02 1.00E+01 1.00E+00 1.00E-01 1 1.00E-02 1.00E-03 1.00E-04 1.00E-05 1.00E-06 1.00E-07 1.00E-08 10 100 1000 10000 Experimental data Power fit model Number of records per user Figure 2.10 Probability of the number of records per user 27 Table 2.4 Correlation coefficient for different models for the number records per user probability Model Correlation coefficient Power fit 0.41 Reciprocal quadratic ≈0.0 Geometric fit ≈0.0 2.5.2.2 Modeling inactive user time Figure 2.11 shows the probability of inactive user time in seconds. Table 2.5 shows the correlation coefficient for different models for inactive user time probability. The reciprocal quadratic model is the most appropriate model to fit the data since its correlation coefficient is the closest to the value of one. 1.00E+00 1.00E-01 1 100 10000 1000000 1.00E-02 Probability 1.00E-03 Experimental data 1.00E-04 1.00E-05 Reciprocal quadratic model 1.00E-06 1.00E-07 1.00E-08 1.00E-09 1.00E-10 Inactive user time (seconds) Figure 2.11 Probability of inactive user time in seconds Table 2.5 Correlation coefficient for different models for inactive user time probability Model Correlation coefficient Power fit ≈0.0 Reciprocal quadratic 0.92 Geometric fit 0.11 28 2.5.2.3 Modeling recorded records per second Figure 2.12 shows the probability of recorded records per second. Table 2.6 Correlation coefficient for different models for recorded records per second probability. The geometric fit model is the most appropriate to fit the data since its correlation coefficient has the value of one. Probability 0.2 0.15 Experimental data Geometric fit 0.1 0.05 0 0 20 40 60 80 100 Recorded records per second Figure 2.12 Probability of recorded records per second Table 2.6 Correlation coefficient for different models for recorded records per second probability Model Correlation coefficient Power fit 0.89 Reciprocal quadratic 0.42 Geometric fit 1.00 2.6 Summary In this chapter, we presented new techniques for preprocessing web log data including identifying unique users and sessions. We presented a fast active user-based user identification algorithm with time complexity of O(n). For session identification we 29 presented an ontology-based session identification algorithm that uses the website structure to identify users’ sessions. We showed that the user identification algorithm depends on three website parameters: number of records per user, inactive user time and number of recorded records per second in the web log. In the chapters to follow, the output of this preprocessing step will be used as an input for different data mining techniques. In our research we focus on the clustering algorithm along with different learning algorithms for clustering results presentation. 30 CHAPTER III MULTIDIMENSIONAL SESSIONS COMPARISON METHOD USING DYNAMIC PROGRAMMING In this chapter, we present a new Multidimensional Sessions Comparison Method (MSCM) using dynamic programming. Our method takes into consideration different session dimensions, such as the page list, the time spent on each page and the length of each session. This is in contrast to other algorithms that treat sessions as sets of visited pages within a time period and don’t consider the sequence of the click-stream visitation or the session length. 3.1 Introduction The problem of sequence comparison, which is defined as the measure of how much two or more sequences are similar to each other, has attracted researchers in different fields such as molecular biology [23], speech recognition [24], string matching [25] and traffic analysis studies [26]. In molecular biology, macromolecules are considered as long sequences of subunits linked together sequentially. Comparing these sequences helps to answer important questions in biology. In speech recognition studies, speech is converted to a vector function of time, which is considered a continuous sequence. Sequence comparison can be used in 31 different applications such as recognizing an isolated work selected from limited vocabulary. String matching represents each string as a sequence of characters. Sequence comparison can be used in spell checking in word processing applications. In the context of web usage mining, measuring similarities between web sequences, or simply sessions, is an important step in the clustering process since the clustering process is grouping together similar web sessions. In this chapter, we introduce a new method for measuring dissimilarities between web sessions that takes into account the sequence of events in a click stream visitation, the time spent in each event, and the length of the sessions. This method is used in our clustering method discussed in the next chapter. In the next section, we provide some necessary definitions. In Section 3, we present the problem statement. In Section 4, we present related work on sequence comparison. In Section 5, we present previous work done on session comparison. In Section 6 we present our new Multidimensional Session Comparison Method (MSCM). In Section 7, we present experimental results and analysis. In Section 8, we present summary conclusions. 3.2 Definitions We define the list of sessions S = 〈 s1 ,K , s k 〉 where s i = {〈 p1 , K, p m 〉 , 〈t1 , K, t m 〉} is defined by two lists of entities: the first list consists of m pages 〈 p1 ,K , p m 〉 and the second list consists of m time values 〈t1 ,K , t m 〉 . The list of pages represents the pages visited by the user and the list of time values represents the time spent at each page. We also define the operator |s| that returns the number of items in the sequence s. We also 32 define two functions s.p(x) and s.t(x) that return the page and the time spent at position x, respectively. 3.3 Problem statement The objective is to find a distance function, D , defined over S × S , where D( si , s j ) is a numeric value that shows the extent to which sessions si and s j are similar. 3.4 Related work In this section, we present the well-established algorithms that can be used in the context of session comparison. Most of these algorithms were presented in the field of string matching. 3.4.1 Exact sequence matching In this method, the distance function is defined as a Boolean function where the function returns true when there is an exact match between si and s j and returns false otherwise, or expressed as follows: ⎧ true if si . p( x) = s j . p( x) and si .t ( x) = s j .t ( x) and si = s j , ∀x ≤ max(si , s j )⎫ D(si , s j ) = ⎨ ⎬ 3.1 otherwise ⎭ ⎩ false In most sequence matching problems, the sequences do not exactly match; rather, they show similarities to a certain extent. Consequently the method of equation 3.1 returns no matches and it will not recognize any similarities between sequences. For this reason such a method is considered impractical and is not often used. 3.4.2 Approximate one dimension sequence matching The idea behind approximate one dimensional sequence matching is based on limiting the sequence definition to one list of entities. For the web sessions case, the 33 sequence definition is limited to the list of pages. The matching is defined by a numeric value that is greater than or equal to zero. A value of zero represents an exact match, and the value increases as the difference between the sequences increases. There are two major ways to compare the sequences. One is based on measuring the differences of the sequences’ items, while the other is based on measuring the similarities of the sequences’ items. 3.4.2.1 Measuring difference distance The distance between sessions si and s j is defined by the number of edit operations needed to transform si to s j . These operations are insertion (I), deletion (D), replacement (R), or no operation (M). For example, if we have two sequences s1 and s 2 given as s1 = 〈 p 0 , p 2 , p5 , p3 〉 s 2 = 〈 p0 , p1 , p 2 , p5 , p 4 〉 . To transform s1 to s 2 , the following operations need to be applied: M: no operation since p 0 ’s are matched in the first position of s1 and s 2 . I: insertion of p1 into the second position of s1 . M M: p 2 ’s, p5 ’s match on the third and fourth positions. R: replacement of p3 at last position of s1 by p 4 . The dynamic programming method [27] can be used to find the minimum number of operations. For sl and s k , D(i, j ) is defined to be the edit distance (number of edit operations) to convert the first ith characters of s l to the first jth characters of s k . The recursion base conditions are 34 D(i,0) = i 3.2 D(0, j ) = j 3.3 and The recurrence relation for D(i, j ) is D(i, j ) = min[ D(i − 1, j ) + 1, D(i, j − 1) + 1, D(i − 1, j − 1) + t (i, j )] 3.4 where t (i, j ) is defined to have value of 1 if and only if the ith character of s l and the jth character of s k are different, otherwise it has a value of 0. In this approach, D(i, j ) is first computed for the smallest possible values for i and j .Typically, this computation is organized with a dynamic programming table of size (n + 1) × (m + 1) . The table holds the values of D(i, j ) for all the choices of i and j . In the table, the vertical axis represents s l and the horizontal axis represents s k . Because i and j begin at zero, the table has a zeroth row and a zeroth column. The values in the zeroth row and the zeroth column are filled indirectly from the base conditions for D(i, j ) . After that, the remaining n × m subtable is filled in one row at time, in order of increasing i . Within each row, the cells are filled in order of increasing j . Except for the base case, the other D(i, j ) ’s are known once D(i − 1, j − 1) , D(i, j − 1) , and D(i − 1, j ) have been computed. So the entire table can be computed in one row at a time, in the order of increasing i , and in each row the values can be computed in order of increasing j . 35 3.4.2.2 Measuring similarity distance In a pairwise scores matrix, s ( x, y ) denotes the score obtained by aligning pages in s i and s j . The alignment is done by inserting spaces (or no action) between the pages. A pairwise scores matrix sets the score s ( x, y ) to be greater than or equal to zero when the pages are the same, and less than zero if they mismatch. The alignment value A is defined as ( l ) A = ∑ s S1' (i ), S 2' (i ) 3.5 i =1 where S1' (i ) and S 2' (i ) denote s i and s j after alignment. For example, if we have P = 〈 p 0 , p1 , p 2 , p3 , p 4 〉 , we can define the pairwise scores between different pages as shown in Table 3.1. If we have two sequences si and s j as described in Table 3.2, the alignment value, A , is calculated using equation 3.5, which gives the following result: A = −2 + 2 − 1 + 0 − 3 − 2 − 4 + 2 = −8 Table 3.1 Pairwise scores between different pages s p1 p 2 p3 p 4 p0 p0 p1 p2 1 -2 -4 -2 -2 -1 2 -3 -1 -3 -2 2 -1 -2 -3 0 -4 -4 1 -1 p3 p4 - 0 36 Table 3.2 Two sequences s i and s j si p0 p2 - p3 sj p1 p2 p0 p3 p2 - p0 p4 p3 - p1 p1 3.5 Previous work In this section, we present an overview of the previous work on session comparison done for web usage mining. Most of the similarity measures used to compare sessions in web usage mining were simply based on intersections between the sets, such as the cosine measure or the Jaccard coefficient [8]. For example, Foss et al. [8] applied the Jaccard coefficient, which basically measures the degree of common visited pages in the compared sessions. This method does not take into consideration the sequence of events. So, the algorithm does not differentiate between the following two situations, p0 is visited before page p1, or p1 is visited before p0. Path Feature Space [28] were used to represent all the navigation paths. The similarity between each two paths was measured by the definition of path angle. In the path angle method, each navigation path is represented as a vector and the similarity between paths is the cosine similarity between the vectors. A non-Euclidean distance measure was presented using the sequence alignment method (SAM) [29-31], which is derived from the Levenshtein [32] approach, and it takes into account the weight of different operations. The distance formula is defined by D( si , s j ) = min( D * wd + I * wi + R * wr ) 3.6 where D, I , R are the number of deletion, insertion and replacement operations, respectively, needed to convert s i to s j ; and wd , wi , wr are the weights of these operations, respectively. Unlike the Levenshtein method, SAM is done in two steps. 37 First, it reorders the common elements such that the common elements in the two sequences appear in the same order. In the second step, it inserts the uncommon elements in both sequences so they appear the same. The multidimensional sequence alignment method MDSAM [29] is a modified version of the sequence alignment method where it ultimately finds the set of operations inducing the possibly smallest sum of multidimensional operational costs. The full algorithm description can be found in [29]. 3.5.1 Limitations of the previous work The algorithms that use the Euclidean distance for vector or the cosine measure have several limitations [33]: 1. The transferred space could be of very high dimension. 2. The original click stream is naturally a click sequence which cannot be fully represented by a vector or a set of URLs where the order of clicks is not considered; 3. Euclidean distance has been proven in practice to be not suitable for measuring similarity in categorical vector space. The multidimensional algorithms do not solve the inter-attribute relationship problem, which is defined as the problem of considering the relationship between the attributes in different dimensions. 3.6 Multidimensional session comparison method (MSCM) In this section, we present our new Multidimensional Session Comparison Method (MSCM). Next, we present assumptions on which we based our algorithm. After constructing the algorithm, we present a detailed description of the algorithm. Finally, we present a time complexity analysis for the algorithm. 38 3.6.1 Assumptions We assume that three edit operations are allowed to convert one session to another: deletion, insertion and swap. The deletion operation D(x) is defined as deleting an event in the session at position x. The insertion operation I(x) is defined as inserting a new event in the session at position x. The swap operation S(x) is defined as swapping between events in the session at positions x and x+1. We present the following assumptions about the sessions and our algorithm: 1. For the two dimension—the pages list and time list—we assume that the page list is a primary dimension and the time list is a secondary list. 2. The navigation behavior for the user is determined mainly by the primary dimension, which is the page list. 3. The first dimension, page list, is a nominal attribute and the other dimension, time value list, is a continuous attribute. So, the difference between pages p1 and p 2 is considered the same as the difference between pages p1 and p100 , but the difference between t = 1 and t = 7 is not the same as the difference between t = 1 and t = 100 . 4. The distance d mscm ( si , s j ) between two sessions si and s j is directly proportional to the minimum number of edit operations needed to convert si to s j . 5. The distance between two sequences is inversely proportional to the maximum length of the compared sequences. 6. The weight of the swap operation is directly proportional with the time spent on the first page. 39 3.6.2 Algorithm construction Based on the first four assumptions, we present a one dimensional distance function defined by d mscm ( si , s j ) ∝ min( D • wd + I • wi + S • ws ) 3.7 where: d mscm is the edit distance based on MSCM D is the number of deletion operations I is the number of insertion operations S is the number of swap operations wd is the weight of the deletion operation wi is the number of insertion operation ws is the weight of swap operation Based on the fifth assumption, the total distance divided by the maximum length of both sequences is d mscm ( s i , s j ) ∝ min( D • wd + I • wi + S • ws ) max( si , s j ) 3.8 where s is the length of the sequence. Based on the final assumption, the weight of the swap operation multiplied by the Heaviside step function Φ (t ) , which is defined as ⎧0 t ≤ 0⎫ Φ (t ) = ⎨ ⎬ ⎩1 t > 0⎭ where t is the time spent on the page on which swap operation is performed, s given as 40 3.9 d mscm ( s i , s j ) = min( D • wd + I • wi + S • ws • Φ (t )) max( s i , s j ) 3.10 The distance function defined in equation 3.10 takes into consideration the two dimensions in the web session and gives a proper solution to the inter-attribute relationship, unlike other algorithms where the inter-attribute relationship is solved by computing trajectory between the first and the second attributes. Also, the distance calculated in equation 3.10 is considered to be the absolute distance since it is relative to the maximum length of the sequences. 3.6.3 Algorithm description The algorithm used for finding the minimal edit operation is based on the dynamic programming in [34, 35]. Table 3.3 summarizes the MSCM algorithm’s major steps to find the minimum number of edit operations. Table 3.3 MSCM algorithm major steps Step Description 1 Set n to be the length of s1 . Set m to be the length of s 2 . If n = 0, return m and exit. If m = 0, return n and exit. Construct a matrix containing 0..m rows and 0..n columns. 2 Initialize the first row to 0..n. Initialize the first column to 0..m. 3 Examine each character of s1 (i from 1 to n). 4 Examine each character of s 2 (j from 1 to m). 5 If s1 [i] equals s 2 [j], the cost is 0. If s1 [i] doesn't equal s 2 [j], the cost is 1. 6 Set cell d[i,j] of the matrix equal to the minimum of: a. The cell immediately above plus 1: d[i-1,j] + 1. b. The cell immediately to the left plus 1: d[i,j-1] + 1. c. The cell diagonally above and to the left plus the cost: d[i-1,j-1] + cost. 7 After the iteration steps (3, 4, 5, 6) are complete, the distance is found in cell d[n,m]. 41 For example, if we have two web sessions: s 0 = {〈 p14 , p11 , p6 , p10 , p1 , p11 , p8 〉, 〈1,2,4,3,2,4,5〉} , and s1 = {〈 p15 , p 6 , p 7 , p10 , p 7 , p1 , p11 〉, 〈 2,4,6,3,1,3,5〉} Table 3.4 shows the matrix to be used to compute the minimum edit distance between sequences s0 . p and s1 . p . Table 3.4 Matrix used to compute the minimum edit distance, when only the zeroth column and row are filled in d mscm p14 p11 p6 p10 p1 p11 p8 0 1 2 3 4 5 6 7 j i 0 0 1 2 3 4 5 6 7 1 1 p15 p6 2 2 p7 3 3 p10 4 4 p7 5 5 p1 p11 6 7 6 7 The (i,j) cell in Table 3.4 stands for the minimum edit operations to convert the first i pages in sequence s 0 to the first j pages in sequence s1 . Taking into consideration the zero length sequence, the table size will be the sequence length plus 1. So for the example in Table 3.4, we have table of size (7 + 1) × (7 + 1) . In the table, the vertical axis represents s 0 and the horizontal axis represents s1 . The values in row zero and column zero are filled in directly from the base conditions in equations 3.2 and 3.3, respectively. After that, the remaining cells in the table are filled in one row at time, in order of increasing i and then in order of increasing j, using the recurrence equation 3.4. Within 42 each row, the cells are filled in order of increasing j . Table 3.5 shows when the entire cells are filled in. The minimum edit distance is the last cell value, which in our example equals to 5. Table 3.5 Matrix used to compute the minimum edit distance, when the entire cells are filled in d mscm p14 p11 p6 p10 p1 p11 p8 0 1 2 3 4 5 6 7 j i 0 0 1 2 3 4 5 6 7 1 1 1 2 3 4 5 6 7 p15 2 2 2 2 2 3 4 5 6 p6 p7 3 3 3 3 3 3 4 5 6 p10 4 4 4 4 4 3 4 4 5 p7 5 5 5 5 5 4 4 5 6 p1 p11 6 7 6 7 6 7 6 6 6 7 5 6 4 5 5 4 6 5 3.6.4 Time complexity analysis The time complexity analysis for the dynamic programming table for computing the minimum edit distance is reported to be O(m • n) [34], where m is the length of the first session and n is the length of the second session. This is straightforward since it needs (n • m) steps to fill the table for sessions of length n and m, respectively. 3.7 Experimental results and analysis In this section we present a few experiments that show how MSCM provides better results than other sessions comparison methods. We compare the results with other three methods: Euclidean based distance (Path Feature Space [28]) method, sequence alignment method SAM [30], and Multidimensional Sequence Alignment Method 43 MDSAM [29]. More experimental results are presented in the next chapter, which is the output of the clustering algorithm that adopts MSCM as its distance function. For our experimental results, assume we have the following sessions: s 0 = {〈 p15 , p 7 , p6 , p10 , p1 , p 7 , p11 〉 , 〈3,0,2,4,0,4,5〉} s1 = {〈 p15 , p 6 , p 7 , p10 , p 7 , p1 , p11 〉, 〈3,4,1,4,1,1,6〉} s 2 = {〈 p14 , p11 〉 , 〈 4,5〉} s3 = {〈 p5 , p 7 〉 , 〈5,4〉} s 4 = {〈 p15 , p 7 , p6 , p10 , p1 , p 7 , p11 〉 , 〈3,2,1,4,3,4,5〉} s 5 = {〈 p15 , p 7 , p6 , p10 , p1 , p3 , p1 〉 , 〈 2,3,2,4,3,7,5〉} Table 3.6 summarizes the distance between different sessions using different session comparison methods. Table 3.6 Distance measure between sessions using different methods Method Path Feature SAM MDSAM MSCM Sessions Space 4 9 0 s 0 , s1 5 2 + 4 2 = 6 .4 s 2 , s3 22 + 22 = 2 2 4 1 s 4 , s5 2 2 + 3 2 = 3 .6 2 5 0.29 Sessions s 0 and s1 are almost the same except for the swap between pages p 6 , p 7 and pages p1 , p 7 . However, the time spent on p 7 and p1 in s1 is zero. This distance—i.e., time spent on a page is zero—arise because the time resolution for recording a web record is one second. Therefore, we cannot tell which page is loaded first, and so the order of these pages should be ignored. The other three algorithms do not recognize this, even the multidimensional ones, and all of them indicate that there is a difference between the sessions where there is actually no difference as indicated from MSCM. 44 As for sessions s 2 and s 3 , it is obvious that they are not similar at all. The other three methods measure the difference based on the edit distance and they give results that do not reflect the complete mismatch between the sequences. On the contrary, MSCM computes the absolute difference, which is the edit distance divided by the maximum sessions’ length. Thus, for sessions s 2 and s 3 , MSCM returns the value of one, which indeed reflects the complete mismatch. The usefulness of measuring the absolute value can also be seen when we consider the results of comparing s 2 and s 3 versus comparing s 4 and s5 . The other three methodss show almost the same degree of differences when comparing sessions s 2 and s 3 versus comparing sessions s 4 and s 5 but this is not true at all because sessions s 4 and s 5 are almost the same except for the last two pages, while sessions s 2 and s 3 are completely different. On the other hand, MSCM algorithm recognizes that the degree of difference between sessions s 2 and s 3 (returning a value of 1) is not the same as the degree of difference between sessions s 4 and s 5 (returning a value of 0.29). 3.8 Summary and conclusion In this chapter we presented a new Multidimensional Session Comparison Algorithm (MSCM), which is based on dynamic programming. Unlike other methods, MSCM takes into consideration other dimensions in the session, such as the time spent on the page and the total session length. The methods showed more accurate results than other known methods in comparing web sessions, such as Sequence Alignment Method (SAM), Multidimensional Sequence Alignment Method (MDSAM), and Path Feature Space. The output of the MSCM is presented in the form of dissimilarity matrix, which 45 can be used by different clustering techniques, such as the hierarchal, the k-mean, and the equivalence classes clustering algorithms. 46 CHAPTER IV ENHANCING WEBSITE STRUCTURE BY MEANS OF HIERARCHAL CLUSTERING ALGORITHMS AND ROUGH SET LEARNING APPROACH 4.1 Introduction In this chapter, we present a new way to enhance the website structure by means of hierarchal clustering algorithms and rough sets learning approach. Figure 4.1 shows the system workflow. The workflow starts by clustering the web sessions into different clusters using the dissimilarity matrix. The clustering results are then presented in the form of examples as shown in Table 4.1, where each web session along with its clustering result represents one example in the examples table. Table 4.1 Representing clustering results in a form of examples Example No. 1st Page 2nd Page 3rd Page Clustering Result 1 p0 p1 p4 C1 2 p2 C2 p3 p5 … N … p2 … p3 … p5 … Ck The examples are then divided into two independent sets. The first set is used by different classifiers to learn rules that describe the system; the rules are presented in the if then format. For example the following two rules can be learned from the first two examples in Table 4.1. if 1st page = p 0 and 2nd page= p1 and 3rd Page= p 4 then Cluster= C1 47 if 1st page = p 2 and 2nd page= p3 and 3rd Page= p5 then Cluster= C 2 The second set along with the learned rules from the first set are used in the inference engine to estimate the accuracy of the classification process. The clustering results along with the generated rules are then incorporated to enhance the structure of the website. Web Sessions Dissimilarity Matrix Clustering Process Examples Classifier Rules Inference Engine Classifiers Results Estimation Results Incorporation Figure 4.1 Web usage classification and prediction workflow The rest of the chapter is organized as follows. In Section 2, we present an overview of clustering analysis. In Section 3, we present two algorithms for clustering web sessions. In Section 4, we present two different classifiers to describe and predict a web session’s classes. In Section 5, we present a method to estimate the accuracy of different 48 classifiers. In Section 6, we present experimental results. In Section 7, we show how the results are incorporated in enhancing the website structure. In Section 8, we present results discussion. In Section 9, we present summary and conclusion. 4.2 Clustering analysis Clustering is a useful technique for grouping objects such that objects within a single group have similar characteristics, while objects in different groups are dissimilar. In the context of web usage mining, the objects are users’ sessions. Each session contains the pages visited by the user at a certain time. Clusters can be used to cluster the users such that users with the same browsing behavior are in single cluster. For example, one cluster may consist of predominantly freshman students who register for classes, while another may consist of professors who upload their classes’ grades. The clusters can then be used to identify dominant browsing behaviors, evaluate the website structure and predict users’ browsing behavior. Clustering web usage sessions’ is an example of clustering where objects are of a non-numeric data type such as nominal or categorical data type. 4.2.1 Clustering algorithms Clustering algorithms can be classified into partitional clustering and hierarchal clustering [36, 37]. Partitional clustering algorithms divide n objects into k clusters that satisfy two conditions (1) each cluster contains at least one object, and (2) each object exactly belongs to one cluster. Equation 4.1 shows one of the commonly used criterion functions k r r E = ∑ ∑ d ( x , mi ) 4.1 r i =1 x∈Ci 49 r r r In equation 4.1, mi is the centroid of cluster C i , while d ( x , mi ) is the Euclidean r r distance between x and mi defined in equation 4.2 r r d ( x , mi ) = ( r r 2 ∑i =1 (x − mi ) d ) 1 2 4.2 The criterion function E attempts to minimize the distance of every object from the mean of the cluster to which the object belongs. One of the common approaches to minimize the criterion function is the iterative k-means method. While the use of the kmeans method could yield satisfactory results for numeric attributes, it is not appropriate for data sets with categorical attributes [38], as it is the case in web sessions. Hierarchical clustering algorithms work by grouping data objects into a tree of clusters. A hierarchical method can be classified as agglomerative or divisive. Agglomerative hierarchical clustering, which is the most common strategy, starts by placing each object in one cluster then merges similar objects together until forming one cluster that has all the objects in it or some other termination condition exists. Divisive hierarchical clustering starts with all objects in one cluster and divide them up until each object forms a cluster by itself or some other termination condition is met. At the first step of the agglomerative method, the dissimilarity matrix can be used to determine how close the objects are to one another. Once the first step is completed and the first level of clusters is generated, there will be a need to compare the clusters rather than comparing the objects. Next, we present the five most common techniques to measure the difference between clusters: • Single linkage: the distance between any two clusters is the shortest distance from any object in one cluster to any object in the other [39]. 50 • Complete linkage: the distance between any two clusters is the farthest distance from any object in one cluster to any object in the other. • Average linkage: the distance between any two clusters is the average distance from all objects in one cluster to all individuals in the other. • Ward’s method: the distance between two clusters is the sum of squares of the distance between all objects in both clusters [40]. • Centroid method: the distance between two clusters is the distance between their centroids. 4.2.2 Properties of agglomerative hierarchal clustering techniques The single linkage method tends to have the chaining property [41]. As shown in Figure 4.2, chaining is producing two well separated clusters with an intermediate chain of data. 6 5 y 4 3 2 1 0 0 0.5 1 1.5 2 x Figure 4.2 Two well separated clusters with intermediate chain Previous empirical investigations indicate that the average linkage method and the Ward’s method had superior performance. For example, Cunningham and Ogilvie [42] compared several hierarchal techniques and found that the average linkage method 51 performs most satisfactory for the data sets they considered. Kuiper and Fisher [43] investigated six hierarchal techniques and found that the Ward’s method classifies the data very well. Finally, Blashfiled [44] compared the single linkage method, the complete linkage method, the average linkage method, and the Ward’s method using a quantifying statistical method explained in [45]. They found that the Ward’s method performed very well over the other methods. Later in Section 4.3.6, we explain why we use the Ward’s method over the average linkage method. 4.3 Clustering web sessions Clustering web sessions is to group similar website usage behavior together. The clustering results can be used to identify dominant browsing behavior, evaluate the website structure and predict users’ browsing behavior. We present two clustering algorithms: hierarchal and equivalence classes clustering. Both algorithms have the ability to deal with nominal attributes, such as session with different sizes, and both have the ability to adopt different dissimilarity matrixes. Unlike the hierarchal clustering algorithm, clusters generated from the equivalence classes clustering algorithm do not depend on the seed where the clustering process starts. In this section, we next present several definitions along with a formal description of the problem statement. At the end of the section, we present the two proposed algorithms along with their running time complexity analysis. 4.3.1 Definitions Web sessions S is defined as S = 〈 s1 , K , s k 〉 , where k is the number of sessions. Each web session si is defined as s i = 〈 pi.1 ,K , pi.n 〉 , where n is the number of pages in session si , and pi.k is the k th page in session si . Web sessions clusters C is defined 52 as C = 〈 c1 , K , c m 〉 , where m is the number of clusters. Each cluster ci is defined as ci = 〈 si.1 ,K , si.l 〉 , where l is the number of sessions in cluster ci . The dissimilarity function δ i, j is defined over S × S , where 0 ≤ δ i , j ≤ 1 ; a value of one implies perfect dissimilarity, and a value of zero implies perfect similarity. A detailed description of this function was presented in Chapter 3. The cluster centroid cei is defined as the session in the middle of the cluster ci . The centroid defined in equation 4.3 uses the Ward’s method in defining the minimum distance. In Section 4.3.6, we explain why we use the Ward’s method over other methods, such as the average linkage method, in defining the minimum distance for the centroid. cei = si.k where si.k ⎛ j =n 2⎞ = min⎜⎜ ∑ (δ j ,k ) ⎟⎟, ∀si.k ∈ ci ⎝ j =1 ⎠ 4.3 We also define an overloaded version of the dissimilarity function defined earlier in equation 3.10 to be applied to the clusters. The overloaded version of δ i, j shown in equation 4.4 accepts clusters as inputs and uses the centroid-method in determining the difference between clusters. δ i , j = ∑cluster i ∑cluster j δ ce ,ce i 4.4 j Finally, we define the threshold λ as the maximum difference allowed between sessions in the same cluster. 53 4.3.2 Problem statement Given web sessions S, dissimilarity function δ i, j , and threshold value λ , the objective is to find web session’s clusters C λ ’s such that for ∀c k , c k ∈ C λ it is true that δ i, j ≤ λ for ∀si , s j ∈ c k . 4.3.3 Hierarchal clustering algorithm Figure 4.3, shows the clustering algorithm used to cluster the web sessions. Initially, each session is placed in a cluster by itself and the threshold value λ is initialized to zero. Then, the dissimilarity value between all pairs of clusters is checked. If the dissimilarity value is less than or equal the threshold value, the two clusters are merged to form a new cluster. The new number of clusters is then checked. If it is found to be less than or equal to the minimum number of clusters the algorithm exits of the loop and the set of clusters is then returned (in Section 4.3.5 we discusses how to choose the minimum number of clusters). Otherwise, the threshold value λ is incremented and the process is repeated again. 54 Intialize C := each cluster has one session Intialize λ = 0 for i := 1 to C do{ for j := i + 1 to C do{ if δ i , j ≤ λ merger _ clusters(ci , c j ) } increment λ if C ≤ minimum number of clusters break } return C Figure 4.3 Hierarchal clustering algorithm The time complexity analysis of the algorithm in Figure 4.3 shows that the algorithm has two loops, an inner and an outer loop. The outer loop has the worst case of n cycles in the event the initial number of the clusters is n, while the inner loop has the worst case of (n-1) cycles. In our work, we assumed that the dissimilarity function is provided in the form of a matrix, where element ij contains the dissimilarity between sessions i and j . Therefore, the time complexity for finding the dissimilarity value is fixed with a constant time c that is independent of the initial number of clusters n. Hence the overall complexity of the algorithm is O(n(n − 1) * c) = O(c * n 2 ) = O(cn 2 ) = O(n 2 ) . 4.3.4 Equivalence classes clustering algorithm We define an equivalence relation called belongs to, ~, on C × C that satisfies reflexive, symmetric and transitive properties. The reflexive property implies that ci ~ ci for ∀ci ∈ C . The ci ~ c j then c j ~ ci for ∀ci , j ∈ C . symmetric The transitive 55 property property implies implies that that if if ci ~ c j and c j ~ c k then c j ~ c k for ∀ci , j ,k ∈ C . The first two properties (reflexive and symmetric) are satisfied in the hierarchal algorithm defined in Section 4.3.3. So, to achieve the equivalence relation defined earlier, we modify the algorithm described in Figure 4.3 to satisfy the equivalence relation. Figure 4.4 shows the equivalence classes clustering algorithm that satisfies the three properties for the equivalence relation mentioned earlier. The algorithm is a modified version of the one showed in Figure 4.3 in which the merging between clusters is not done unless all pairs of sessions in both clusters has a dissimilarity value less than or equal to the threshold value λ . The time complexity analysis for this algorithm is the same as the one for the hierarchal clustering algorithm shown in Figure 4.3, except for an extra third inner loop, which increases the complexity by one degree to become O(n 3 ) . 56 Intialize C := each cluster has one session Intialize λ = 0 for i := 1 to C do{ for j := i + 1 to C do{ boolean merge = false if δ i , j ≤ λ merge = true for k := 1 to C do{ if ( ((δ i , k ≤ λ and δ j , k > λ ) or (δ j , k ≤ λ and δ i , k > λ )) or (δ i , k > λ and δ j , k > λ and δ j , k ≠ δ i , k ) ){ merge = false break } } increment λ if (merge) merger _ clusters (c i , c j ) } if C ≤ minimum number of clusters break } return C Figure 4.4 Equivalence classes clustering algorithm 4.3.5 Determining a common termination condition for different sessions lengths Prior to applying the web sessions to the clustering algorithms described earlier, the sessions are grouped in a way sessions with the same session length are grouped together. In the clustering algorithms shown in Figure 4.3 and Figure 4.4, the termination condition depends on the minimum number of clusters. To choose a minimum number of clusters that is common for all session’s length groups, the number of clusters is normalized to the number of clusters in the first iteration for each session length group. To illustrate this idea, consider for example Table 4.2 that shows how a group of web sessions divided 57 according to length of the session into two groups. The first group shows the number of clusters for session length of 3 at different clustering iterations. The second group shows the number of clusters for session length of 4 at different clustering iterations. Table 4.2 Number of clusters for different session lengths at different iterations Iteration 1 2 3 4 5 6 7 8 9 10 11 Session length 3 255 39 35 9 6 6 6 5 1 4 479 141 115 64 35 26 24 15 5 3 1 It is clear that the initial number of clusters is different in the two set of the group sessions. So, in order to choose the same termination condition we calculate the percentage of the number of the clusters from the initial number of clusters for different session lengths at different iterations. Table 4.3 shows the percentage of the number of the clusters from the initial number of clusters for different session lengths at different iterations. So, for example, if we choose to stop the clustering process when the percentage of the number of clusters is 14% from the initial number of clusters, then we stop at the iteration where the value of the percentage is the closest to 14%, which is, in this case, iterations 3 and 4 for session lengths 3 and 4, respectively. Table 4.3 Percentage of the number of the clusters from the initial number of clusters for different session lengths at different iterations Iteration 1 2 3 4 5 6 7 8 9 10 11 Session length 3 100% 15% 14% 3% 2% 2% 2% 1% 0% 4 100% 29% 24% 13% 7% 5% 5% 3% 1% 0% 0% 4.3.6 Ward’s method improves determining a common termination condition The reason we use Ward’s method in defining the centroid in equation 4.3 over other methods, like average linkage, is because it shows a slower convergence, which 58 helps in determining a common termination condition for all session length groups more accurately. To illustrate why we want a slow convergence, first consider Figures 4.5 and 4.6, which show the percentage of the number of the clusters from the initial number of clusters for a specific session length at different iterations using the average linkage method and the Ward’s method, respectively. Next, assume we want to stop the clustering process when the number of clusters is 20% from the initial number of clusters (in Section 4.6 we explain how to choose these percentage points). Using the average linkage method, Figure 4.5 shows that the 20% point occurs between iterations 6 and 7, where the percentage is 23.50% and 14.53%, respectively. So, the closest iteration is iteration 6 giving a percentage of 23.50% and an error of 3.5%. When using the Ward’s method, as shown in Figure 4.6, the 20% point occurs between iterations 53 and 54, where the percentage is 20.5% and 19.7%, respectively. So, the closest iteration is 54 giving a percentage of 19.7% and an error of 0.3%. The error value represents how accurate the clustering process is with a common termination condition for all sessions with the different session lengths. 59 Percentage of the number of the clusters from the initial number of clusters 100.00% 90.00% 80.00% 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00% 0 2 4 6 8 10 12 14 Iteration Percentage of the number of the clusters from the initial number of clusters Figure 4.5 Percentage of the number of the clusters from the initial number of clusters for a specific session length at different iterations using the average linkage method 100.00% 90.00% 80.00% 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00% 0 50 100 150 200 250 Iteration Figure 4.6 Percentage of the number of the clusters from the initial number of clusters for a specific session length at different iterations using the Ward’s method 4.4 Web sessions’ classifiers We present two methods for cluster classification. The first method is based on the centroid and the second is based on the inductive learning program BLEM2. The rules learned from both classifiers are used in both predicting and describing web sessions clusters. Table 4.4 shows an example of a rule generated by a web sessions’ classifier. 60 Table 4.4 Example of a rule generated by a web sessions’ classifier Page 1 Page 2 Page 3 Cluster p 45 p84 p 204 C1 To predict web sessions cluster, the rule in Table 4.4 can be presented in an if then statement such as If 1st Page= p 45 and 2nd Page= p84 and 3rd page= p 204 then Cluster= C1 So, for any new session in the web log the above rule can be applied in order to predict the session’s cluster. We use the classifier accuracy estimator described in Section 4.5 to estimate the accuracy of the prediction. Experimental results based on this method are presented in Section 4.6. To describe a cluster, the physical page-name lookup-table is used to find the physical page name. For example, the physical name for pages in Table 4.4 is 〈 signon, BeforeClassSearch, ClassSearch〉 So, the cluster can then be described as Class Search cluster, meaning it contains a group of users who are searching for classes. The evaluation of the cluster description is based on the length of the description. For this example, the length of the description is 3. In Section 4.6.4, we present experimental results for the description length using different classifiers. 4.4.1 The centroid approach The centroid approach is based on using the cluster centroid to describe clusters and to predict users’ classes based on past behavior. As defined in equation 4.3 the centroid is the session that has the minimum average square distance between all other sessions in the same cluster. 61 To illustrate this, assume that cluster c k has session si as its centroid, which represents the following page sequence: s i = 〈 p 45 , p84 , p 204 〉 . From the page-name lookup-table assume we found that the sequence has the following physical page names: s i = 〈 signon, BeforeClassSearch, ClassSearch〉 . The cluster can then be described as the Class Search cluster, meaning it contains a group of users who are searching for classes. Beside using the centroid for describing the clusters, this description is also used to predict users’ classes. From the above example, the following rule can be generated if si.1 = p 45 and si.2 = p84 and si.3 = p 204 then s i ∈ c k The generated rules, like the one above, may be applied to an inference engine to predict the class of the future coming sessions. 4.4.2 Rough set approach In this subsection, we present the use of the rough set learning program BLEM2 in classifying different users’ sessions. BLEM2 is an implementation of one of the LERS [46, 47] family learning programs which was introduced by Grzymala-Busse [48]. We use the information system notion presented by Pawlak [49,50] in which information system S is defined as a pair S = (U , A) , where U is a nonempty finite set of attributes in A, i.e., a vector of attribute values that denotes each object. Each attribute in A is associated with a set of values called the domain of attribute. 62 Both hierarchal clustering algorithms described earlier produce a special case of an information system called the decision table. In a decision table, there is a designated attribute called the decision or class attribute, and other attributes called condition attributes. Table 4.5 shows an example of a decision table produced by the clustering algorithm where the universe U consists of 16 examples. In Table 4.5, the attribute Cluster No. is the decision attribute and the attributes Page1, Page2 and Page3 are the condition attributes. The set of the decision attribute is {0, 1, 2 ,3 ,4} and the set of the condition attribute A is {45, 58, 84, 108, 120, 160, 186, 194, 204, 241, 251, 444, 463}. Table 4.5 Decision table produced by the clustering algorithm Session No. Page 1 Page 2 Page 3 Cluster No. 1 45 444 108 0 2 45 444 108 0 3 45 444 84 0 4 45 444 463 0 5 45 160 241 1 6 45 160 108 1 7 45 160 241 1 8 45 80 194 2 9 45 80 251 2 10 45 80 251 2 11 45 120 58 3 12 45 120 186 3 13 45 120 160 3 14 45 84 204 4 15 45 84 204 4 16 45 84 204 4 The partition on U determined by the decision attribute Cluster No. is C 0 = [1,2,3,4] , C1 = [5,6,7] , C 2 = [8,9,10] , 63 C 3 = [11,12,13] , and C 4 = [14,15,6] where C k is the set of sessions that belongs to cluster k. The rough set theory presents the concepts of lower and upper approximations in case of inconsistency (i.e., having more than one decision for the same condition value). Let A = (U , R) be an approximation space, where U is set of objects and R is an equivalence relation defined on U. Let X be a nonempty subset of U. Then, the lower approximation of X by R in A is defined as R X = {e ∈ U | [e] ⊆ X } 4.5 and the upper approximation of X by R in A is defined as R X = {e ∈ U | [e] ∩ X ≠ Φ} 4.6 where [e] denotes the equivalence classes containing e. The boundary set of X is defined as BN R ( X ) = R X − R X 4.7 A subset X of U is said to be R-definable in A if and only if R X = R X . The ( ) pair R X , R X defines a rough set in A, which is a family of subsets of U with the same lower and upper approximations as R X and R X . From Table 4.5, the lower and upper approximation of C 0 , C1 , C 2 , C 3 , and C 4 are: 64 AC 0 = AC 0 = {1,2,3,4}, AC1 = AC1 = {5,6,7}, AC 2 = AC 2 = {8,9,10}, AC 3 = AC 3 = {11,12,13}, AC 4 = AC 4 = {14,15,6}, and BN A (C 0 ) = BN A (C1 ) = BN A (C 2 ) = BN A (C 3 ) = BN A (C 4 ) = Φ. As of the case in the previous example, the clustering algorithms presented earlier do not produce inconsistent rules and so the upper approximation is the same as the lower approximation; i.e., R X = R X , and the boundary set is BN R ( X ) = R X − R X = Φ . We use BLEM2 to learn rules from the lower approximation AX i since it is the same as the upper approximation AX i and the boundary set BN A ( X i ) is an empty set. The rules learned from the lower approximation are called certain rules. Table 4.6 shows the certain rules learned from Table 4.5 using BLEM2. In the rules table, the entry -1 denotes a “do not care” condition. The support column denotes the number of examples covered by the rule. The certainty column denotes the ratio of the examples that match both the rule and its decision value. The strength column is the support of the rule over the entire training set. The coverage column is the ratio of the decision value class covered by the rule. Page 1 -1 -1 -1 -1 Table 4.6 Certain rules learned from Table 4.5 using BLEM2 Page 2 Page 3 Cluster Support Certainty Strength Coverage 2 -1 1 288 1 0.0974 1 3 -1 2 118 1 0.0399 1 4 -1 3 938 1 0.3173 1 5 -1 4 47 1 0.0159 1 The rules in Table 4.6 are applied to the inference engine in two different ways. The first way is simply by applying all certain rules. The second way is by applying the rules with the maximum support value. So, for each set of rules that predict the same cluster, 65 only the rule with the maximum support is applied to the inference engine. For example, if there is more than one rule that predicts cluster no. 1, then the rule with the maximum support value will be used and the other rules will be disregarded. The advantage of the maximum support method is that it describes a system with less number of rules. 4.5 Classifier accuracy estimator We apply the holdout classifier accuracy estimator [51] to estimate the accuracy of the different classifiers used. As shown in Figure 4.7, the examples generated from the clustering process were randomly partitioned into two independent sets: α percent of the data were used as the training set, and the rest, i.e., (1- α) percent of the data, were used as a testing set. The training set was used to generate rules either by the centroid method or by the BLEM2 classifier. The testing set along with the generated rules were applied to an inference engine to predict the classes for the testing sets based on the generated rules. The overall average accuracy result is the percentage of correctly predicted classes out of the overall testing set. 66 Clustering Process 1-α Examples from Clustering Process α Classifiers Inference Engine Rules Accuracy Results Figure 4.7 Holdout classifier accuracy estimator To illustrate how the overall average accuracy is calculated consider the following two rules and the testing examples in Table 4.7 if 2nd Page= p13 then Cluster= C1 if 3rd Page= p85 then Cluster= C 2 Example No. 1 Table 4.7 Inference engine testing examples 1st Page 2nd Page 3rd Page p 45 p13 p50 Cluster C1 2 p 45 p13 p 66 C1 3 p 45 p12 p85 C2 4 p 45 p9 p85 C2 5 p 45 p12 p85 C2 6 p 45 p13 p33 C3 7 p 45 p10 p85 C4 The inference engine predicts the classes of the examples in Table 4.7 using the two rules presented earlier. The predicted classes are compared to the given classes in Table 67 4.7. Table 4.8 shows the inference engine clustering prediction along with classes given in Table 4.7. Because we have 5 matches out of 7 examples, the overall average accuracy for the classifier that generated the rules is 5 = 0.71 . 7 Table 4.8 Inference engine results along with results from cluster Example No. Cluster from Cluster prediction Match examples from the inference 1≡match engine 0≡Non-match 1 1 C1 C1 2 1 C1 C1 3 1 C2 C2 4 1 C2 C2 5 1 C2 C2 6 0 C3 C1 7 C4 C2 0 4.6 Experimental results In this section, we present the experimental results for the Web Usage Mining (WUM) clustering and learning algorithms described in this chapter. First, we present the choice of the termination condition for the two clustering algorithms. Next, we present the accuracy of the prediction sessions’ clusters using the rules generated using different classifiers described in Section 4.4. Finally, we present the experimental results for using the rules for describing the clusters by presenting the average cluster description length using different classifiers. 4.6.1 Choosing the clustering termination conditions The clustering algorithm termination condition depends on the number of clusters or, more precisely, on the percentage of the number of the clusters from the initial number of clusters. The first termination condition we chose is when all sessions in the 68 same cluster have the exact same sequence. This occurs at the first iteration when the percentage number of clusters is 100%. For the second termination condition, we choose the session with the shortest length to determine the iteration at which we stop the clustering. Then, we find the percentage of the number of clusters at that point. For the rest of the session length groups, we stop at the point where the percentage of the number of the clusters is closest to the percentage of the number of clusters for the shortest session length. The reason we choose the session with the shortest length is because it is most sensitive session to the threshold. To illustrate this, consider the following pair of sessions, where the first pair of sessions is of length 3: p 0 , p1 , p 2 p3 , p 4 , p5 , and the second pair of sessions is of length 10: p 0 , p1 , p 2, p 0 , p1 , p 2, p 0 , p1 , p 2, p 0 p3 , p 4 , p5 , p1 , p5 , p 6, p 7 , p8 , p9, p10 . For the previous two pairs of sessions, if we have a threshold of 3 of differences, the first pair of sessions, which have the length of 3, will be a 100% match; whereas, the second pair of sessions, which have the length of 10, will be a 30% match according to the distance equation 3.10 presented in Chapter 3. Even though the two pair of sessions have a complete mismatch, the threshold value of 3 caused the session of length 3 to give a 100% match, and the session with length 10 to give a reasonable difference of 30%. In our experiment, a session with length 3 is considered to be the session with the shortest length. As shown in Figure 4.8, for session length 3, the percentage of the number clusters dropped to a value close to zero after 8 iterations. So our choice for the 69 termination condition was between iterations 1 and 8. Iteration 1 was chosen for the first set of the experiment. From the rest of the iterations, we choose iteration 2 since the rest of the iterations show a small percentage of number of clusters. At iteration 2, the percentage of number of clusters was 15.69%. For other session lengths, we stop Percentage of the number of the clusters from the initial number of clusters iterations where the percentage of number of clusters has the closest value to 15.69%. 100.00% 80.00% 60.00% 40.00% 20.00% 0.00% 1 3 5 7 9 11 Iteration 13 15 17 19 Length 3 Length 4 Length 5 Length 6 Length 7 Length 8 Length 9 Length 10 Length 11 Length 12 Length 13 Length 14 Length 15 Figure 4.8 Percentage of the number of the clusters from the initial number of clusters for different session length groups at different iterations 4.6.2 Classifier prediction accuracy results by rules generated from examples using the hierarchal clustering algorithm In this subsection, we present the classifier prediction accuracy results where the rules are generated from the examples using the hierarchal clustering algorithm described in Section 4.3.3. The first test set is performed on the examples generated using the clustering algorithm, where the termination condition is 100% of number of clusters. Figure 4.9 shows the average accuracy where BLEM2 (all) refers to all rules generated using BLEM2, BLEM2 (max) refers to BLEM2 rules with the maximum support only, 70 and centroid refers to rules generated using the centroid method. The accuracy was Average accuracy constant with a value of 1 for all three different classifiers. 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Centroid BLEM2 (all) BLEM2 (max) 3 5 7 9 11 13 15 Session length Figure 4.9 Average accuracy for different session lengths at the 100% number of clusters using examples from the hierarchal clustering algorithm The second test set was performed on the clusters generated when the number of clusters was 15.6%. Since the clustering results depend on the seed starting point for clustering, the experiments were repeated five times for each session length group. The overall average accuracy results are shown in Figure 4.10. The results clearly show that the average accuracy using BLME2 is better than the centroid method. 71 Average accuracy 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Centroid BLEM2 (all) BLEM2 (max) 3 5 7 9 11 13 15 Session length Figure 4.10 Average accuracy for different session lengths at the 15.69% number of clusters using examples from the hierarchal clustering algorithm 4.6.3 Classifier prediction accuracy results by rules generated from examples using equivalence classes clustering algorithm In this subsection, we present the classifier prediction accuracy results where the rules were generated using the examples from the equivalence classes clustering algorithm described in Section 4.3.4. The first test set was performed on the examples generated from the clustering algorithm where the termination condition was at 100% of number of clusters. The results were same results, for the hierarchal clustering algorithm accuracy, shown in Figure 4.9—where the accuracy was a constant value of 1 for all three different classifiers. The second test set was performed using the same termination condition used in the hierarchical clustering algorithm, which is 15.6% of number of clusters. Since the clustering results are independent of the seed starting point, the experiments were performed only once and the accuracy results are shown in Figure 4.11. As the case for the hierarchal clustering, the results show that the average accuracy using BLME2 is better than the centroid method. 72 Average accuracy 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 BLEM2 (all) BLEM2 (max) Centroid 0 5 10 15 20 Session length Figure 4.11 Average accuracy for different session lengths at the 15.69% number of clusters using examples from the equivalence classes clustering algorithm 4.6.4 Cluster description results As described in Section 4.4, rules learned using the centroid method and BLEM2 methods are used to describe the clustering results. The cluster description length is defined as the number of conditions in the if part of the if-statement that represents the rule. For example, if we have the following two rules that describe clusters C1 and C 2 , respectively If 2nd Page= p13 and 3rd Page = p14 then Cluster= C1 If 1st Page = p 45 and 2nd Page= p33 and 3rd Page = p85 then Cluster= C 2 , then clusters C1 ’s description length is 2 and cluster C 2 ’s description length is 3. Thus, the average description length is 2+3 = 2.5 . Figure 4.12 shows the cluster description 2 length using the different classifiers. The BLEM2 classifier shows a short constant cluster description length over different session lengths, while the centroid method shows a linear increasing cluster description length. 73 Average description length 16 14 12 10 BLEM2 (all) BLEM2 (max) Centroid 8 6 4 2 0 3 5 7 9 11 13 15 Session length Figure 4.12 Cluster description length for different session lengths using different classifiers 4.7 Results incorporation Incorporating the results to enhance the structure of the website is done in three steps. The first step is to identify the most common tasks. The second step is to find how many clicks it takes to finish each task. The last step is to present suggestions for enhancing the structure of the website so that the common tasks can be done easier and faster. 4.7.1 Identifying the most common tasks Each web session cluster represents one task and the number of the sessions in the cluster reflects how common the task is. For example, Figure 4.13 shows the seven most common tasks performed on the University of Akron registrar website. The task description was identified by the cluster description described in Section 4.4. 74 11% 20% Enrollment request to add class Enrollment application 12% 12% 18% Class search Account due Account view Class search detail Class roster 13% 14% Figure 4.13 Seven most common tasks performed on the website 4.7.2 Finding how many clicks needed to finish each task We assume that each page in the session represents one click for the user to move from one page to another, so the total number of clicks needed to finish each task is the same as the sequence length. Figure 4.13 shows the distribution of different number of clicks for “Class search detail” task. From the figure, it can be concluded that in 61% of the time the “Class search detail” was finished in 5 clicks, while in 24% of the time the task was finished in 3 clicks. From the cluster centroid, the following page sequence was clicked to finish the task in 5 clicks 〈 signon, BeforeClassSearch, ClassSearch, BigClassSearch Re sult , ClassSearchDetail 〉 When the task was completed in 3 clicks the following page sequence was clicked 〈 signon, BigClassSearch Re sult , ClassSearchDetail 〉 75 15% 24% Sequence length 3 Sequence length 5 Others 61% Figure 4.14 Sequence length distribution for “Class Search Detail” 4.7.3 Presenting suggestions to enhance the website structure By studying how different tasks are completed, recommendations can be made to change the website structure to permit common tasks to be completed in a shorter time and with lesser number of clicks. For example, for the “Class search detail” task example presented earlier, it can be seen from the clusters centroids that the users, who finished the task in 5 clicks, went through regular “class search” first before going to the “class search detail”; whereas, the users, who finished the task in 3 clicks, directly checked the “class search detail”. Conclusions can be made to the website engineer to have a shortcut for the “class search detail” on the homepage so users can directly access it rather being forced to go through the “class search” first. 4.8 Results discussion The choice of a common termination condition for all session length groups must be based on the session with the shortest length. As shown in Figure 4.8, the session with length 3 was used to determine the common termination condition by founding the percentage of number of sessions at the second iteration for session length 3. 76 Figure 4.9 shows that when sessions in the same cluster have the same exact page sequence, the prediction accuracy is 1 for all different classification methods. When the threshold is increased, Figure 4.10 and Figure 4.11 show that the rough set based BLEM2 rules predict the classes for sessions more accurately. This is for rules generated using examples from hierarchal and equivalence classes clustering algorithms. Results shown in Figure 4.12 illustrate not only do BLEM2 rules predict the cluster description more accurately, but it also presents a shorter description for the clusters. Figure 4.12 shows that the cluster description using the centroid method is linear and increases as the session length increases, while the cluster description based on rules learned using BLEM2 is almost constant and has a length around 2. Figures 4.13 and 4.14 show how the clustering and learning results may be used to give insightful information about the website, like what are the most common tasks, and how these tasks are commonly achieved. Finally, Section 4.7.3 shows how this information can be used in enhancing the website structure to make the users’ task achieved faster and easier. 4.9 Summary and conclusion In this chapter, we presented two different clustering algorithms to generate examples that can be used by different classifiers. We used both the centroid and BLEM2 classifiers to learn rules from the examples generated using the clustering algorithms. We applied the holdout classifier accuracy estimator to measure the accuracy of the classifiers. Rules generated by BLEM2 show a better cluster prediction and shorter cluster description. 77 The rules generated by different classifiers were used to present a deep conceptual understanding of the usage behavior of the website, which can be used by the website engineer to evaluate and to enhance the website structure and predict future users’ browsing behavior to better assist users in their future browsing experiences. The work presented in this chapter—including generating examples, learning rules, and testing the results—can be applied to sequence clustering methods in other fields, such as bioinformatics, which is the area of analyzing genomic research data. 78 CHAPTER V SYSTEM IMPLEMENTATION 5.1 Introduction In this chapter, we present the implementation of the web usage mining system presented in the previous chapters. As shown in Figure 5.1, the implementation is divided into four modules: data preparation, session identification, clustering process, and result presentation and evaluation. The data preparation module performs data filtering and user identification. The session identification module performs session identification and further data filtering. The clustering process module generates the dissimilarity matrix and performs different clustering algorithms; including hierarchal and equivalence classes clustering algorithms. The result presentation and evaluation module performs the learning process along with accuracy estimation for learning results. The reset of the chapter is organized as follows. In Section 2, we present the implementation of the data preparation module. In Section 3, we present the implementation of the session identification module. In Section 4, we present the implementation of the clustering process module. In Section 5, we present the implementation of the result presentation and evaluation module. Finally, in Section 6, we present a summary. 79 Data Preparation Web log records Data filtering and cleaning Filtered web log records Users User identification Session Identification Session identification Sessions Results Presentation and Evaluation Find Centroid Generate dissimilarity matrix Dissimilarity matrix Clustering Process Clusters Clustering Inference engine Evaluation results Learn from example Centroid rules All rules Maximum support rules Figure 5.1 Data flow diagram for the web usage mining system 5.2 Data preparation module The data preparation module performs both data filtering and user identification. The implementation is done using MS SQL server. Figure 5.2 shows the entity relation 80 model (EM) for the database design. The Web_Records table contains all the raw data collected from the web server. The filtered records are then stored in the Web_Log table. The Open_Users and Users tables are used by the active user-based user identification script shown in Figure 2.7. The final results of the user identification process are stored in the Users table. Figure 5.2 Entity relation model for data preparation 5.3 Session identification module As shown in the use case diagram in Figure 5.3, the session identification module allows the user to perform different tasks, such as loading user records, to perform session identification, to perform further data filtering, and to export the results to different platforms. 81 Figure 5.3 Use case diagram for session identification Figure 5.4 shows the session identification module user interface. The first step in using the program is to load the sequence file, which is the output from the user identification step. Next, the user is asked to load the page lookup file which matches the page numbers used in the sequence file with its physical name. The page lookup file also indicates if the page is a housekeeping page or not. Several filtering options are available: • Remove housekeeping pages: this option removes the housekeeping pages as identified by the user in the page lookup file. The user can setup these files by pressing the “Setup House Keeping Pages” button. • Remove redundant pages: this option removes the redundant pages that resulted from removing the housekeeping pages. • Session identification based on break pages: this option splits users’ records into one or more sessions based on break pages. To do this, the user needs to enter the page breaks. Then, the program will run the algorithm described in Section 2.3.3. 82 For our case we chose the sign-in page as a break page and the sessions were identified based on that. • Specific session length range: this option filters sessions that have a specific number of records. After these operations are completed the results can be exported into different formats: • Space delimited: this format is a general purpose format that can be read by many learning tools including ours. • Weka [52]: this format can be read by the popular open source machine learning program Weka. Weka, as many other machine learning programs, requires a fixed sequence length, so this option can’t be used unless the data is filtered to fixed length. • Result statistic: this format gives statistical results about the distribution of the length of the sessions after filtering. 83 Figure 5.4 Session identification module user interface 5.4 Clustering process module In the clustering module, users first prepare the session data for clustering by generating the dissimilarity matrix. Hierarchal and equivalence class clustering algorithms can then be applied to the sessions using the generated dissimilarity matrix. Finally, users can pick clusters at a certain level of threshold. Figure 5.5 shows the use case diagram for the clustering process module. 84 Generate dissimilarity matrix «uses» «uses» Apply clustering algorithm «uses» WUM::User Export clustering results Figure 5.5 Use case diagram for clustering process module Figure 5.6 shows the UML diagram for the clustering process module. The multiplicity shows that the dissimilarity matrix is generated based on the sessions. Each cluster has one or more sessions, and each session belongs to one cluster only. Finally, the clusters are generated using the hierarchal class. Each cluster consists of one or several cluster levels. Figure 5.6 UML diagram for the clustering process Figure 5.7 shows the first sequence diagram in the clustering process. This diagram shows how the dissimilarity matrix is generated by passing different messages between Session, Dissimilarity Matrix and Distance classes. 85 Session Distance Dissimilarity Matrix Create Measure Distance Return Distance Done Figure 5.7 Sequence diagram for generating dissimilarity matrix Figure 5.8 shows the second sequence diagram in the clustering process. This diagram shows how the clusters are found by passing different messages between Cluster, Hierarchal and Dissimilarity Matrix classes. Cluster Hierarchal Dissimilarity Matrix doClustering Check Distance Return Distance Return Cluster Level Figure 5.8 Sequence diagram finding clusters 86 Figure 5.9 shows the clustering module user interface. The user starts by loading the session file prepared earlier by the session identification process. Then, the dissimilarity matrix can either be generated, by choosing the “Generate Dissimilarity Matrix” button, or loaded directly from a text file, using the “Load Similarity Matrix” button. Once both the session and the dissimilarity matrix files are ready, the user can perform clustering process by choosing “Run Clustering” button. The user can then export the clustering result at any level, by providing the clustering level in the “Clustering Level” space, and then by clicking “Export Clustering Results at Certain Level” button. 87 Figure 5.9 Clustering module user interface 5.5 Results presentation and evaluation module Figure 5.10 shows the dataflow diagram for the results presentation and evaluation module. The figure shows the programs used at different steps in the learning and evaluation process. Split.java is used to split the clusters into two dependent sets. Raff2Lem.ext is used for exporting the result to BLEM2 format and then using Lers7.exe for the learning process to generate the first set of rules using BLEM2. MaxSupport.java 88 is used in filtering out rules that have maximum support. FindCentroid.java is used in learning rules using the centroid method. OpMVClassifierCF3.tcl is the inference engine that tests the accuracy of the classifiers. Source code for Split.java, FindCentroid.java, MaxSupport.java and OpMVClassifierCF3.tcl are available upon request. 89 Clusters Split.java 83% Clustering Examples 17% Clustering Examples Raff2Lem.exe FindCentroid.java Clustering Examples in BLEM2 format Lers7.exe Rules Learned Using Centroid Method All Rules Learned Using BLEM2 MaxSupport.java Rules Learned Using BLEM2 with Max. Support OpMVClassifierCF3.tcl Evaluation Results Figure 5.10 Dataflow diagram for the results presentation and evaluation module 5.6 Summary In this chapter, we presented the implementation of the web usage mining system presented in this dissertation. The system implantation was accomplished using mixture 90 of different programming environments such as SQL, Java and TCL. The source code for the implementation is available upon request. 91 CHAPTER VI SUMMARY AND CONCLUSIONS In this work, we presented a complete Web Usage Mining (WUM) system using data mining techniques and a rough set learning approach. The system architecture covered the major parts of the WUM system including data preprocessing, data cleaning and filtering, session comparison, clustering analysis, and results presentation and results incorporation. The goal of this system is to give a deep conceptual understating of the usage behavior of a website. This conceptual understanding can be used by the website engineer to evaluate and to enhance the website structure to better assist users in their future browsing experiences. In the data preprocessing phase, we presented new techniques for preprocessing web log data including identifying unique users and sessions. We developed a fast active userbased user identification algorithm which has a time complexity of O(n). For session identification we presented an ontology-based session identification algorithm that uses the website structure to identify users’ sessions. We showed that the user identification algorithm depends on three parameters: number of records per user, web log records recording rate on the web log, and the maximum inactive time for users. Table 6.1 shows the mathematical models along with correlation coefficients for the three website parameters on which our active user-based user identification algorithm depends. These models can be used in simulating future website usage activity. 92 Table 6.1 Mathematical models for three website parameters Parameter Mathematical model Correlation coefficient Records per user probability Power fit 0.41 User navigation time probability Reciprocal quadratic 0.92 Records per second probability Geometric fit 1.00 In the session comparison phase, we presented a new multidimensional session comparison method (MSCM), which is based on dynamic programming. Unlike other algorithms, MSCM takes into consideration other dimensions in the session, such as the time spent on the page and the total session length. The algorithm provided more accurate results than other known algorithms in comparing web sessions, such as Sequence Alignment Method (SAM), Multidimensional Sequence Alignment Method (MDSAM) and Path Feature Space. The output of the MSCM is presented in the form of dissimilarity matrix, which can be used by different clustering techniques, such as hierarchal, k-mean and equivalence classes clustering algorithms. In the clustering phase, we presented two clustering algorithms. The first is a hierarchal clustering algorithm and the other is an equivalence classes clustering algorithm. Unlike other clustering algorithms, the equivalence classes clustering algorithm does not depend on the seed starting point of the clustering process. So, we didn’t have to repeat the clustering process several times and to take the average; rather, it was sufficient to perform the clustering process once. We, also, presented a new method for choosing a common termination condition for clustering algorithms in the process of clustering different session length groups. The new method showed that the shortest session length must be used to determine the termination condition for other session length groups. 93 In the learning phase, the clustering results, which were presented in the form of examples, were used by two classifiers to generate rules. These rules were used in predicting the clusters for prospective users and to describe the cluster itself. We presented two classification approaches: the centroid approach and the rough set approach BLEM2. The accuracy for predicting the clusters for prospective sessions were measured using the holdout accuracy estimator method. The results showed that the rough set approach, BLEM2, is more accurate in predicting prospective sessions’ clusters. For the cluster description, we based our measure on the length of the description. The rough set approach BLEM2 showed shorter description length for clusters. In summary, the rules generated using the rough set approach BLEM2 better predict and describe web sessions’ clusters. In the results incorporation phase, we used the clustering results along with the learned rules to present a deep conceptual description for a website usage. We presented the most common tasks that were performed on the website. In addition, we presented what a navigation path was most common used to complete these tasks. We showed how the clustering and learning results can be used in presenting suggestions to the website designer to enhance the website structure to better assist users in their future browsing experiences. WUM research is an emerging field and there remains much to be learned from the interaction between users and different websites. Future work needs to be done to automate the process of the WUM. This can be carried out by incorporating the generated rules with the web log and clustering users into different clusters on the fly. Additional 94 work can also be done by dynamically adjusting the website structure according the WUM results. Web sessions are a special case of string sequences, so as a future work the techniques presented in this dissertation—in particular the multi-dimensional sequence comparison algorithm, the two clustering algorithms, and the learning approaches—can be applied to other sequence comparison research areas, such as bioinformatics, which is the area of analyzing genomic research data. 95 REFERENCES [1] R. Cooley, P.-N. Tan, and J. Srivastava, “Discovery of Interesting Usage Patterns from Web Data,” Revised Papers from the International Workshop on Web Usage Analysis and User Profiling, pp. 163-182, August 15, 1999. [2] Surfaid analytics, http://surfaid.dfw.ibm.com. [3] Sane Solutions. Analyzing website traffic, 2000. http://www.sane.com/. [4] Webtrends log analyzer. http://www.webtrends.com. [5] B. Mobasher, R. Cooley, J. Srivastava, “Automatic personalization based on Web usage mining,” Communications of the ACM, Vol. 43, No. 8, pp. 142-151, August, 2000. [6] Y. Fu, K. Sandhu, and M.-Y. Shih, “A Generalization-Based Approach to Clustering of Web Usage Sessions,” Revised Papers from the International WEBKDD'99 Workshop, San Diego, CA, USA, pp. 21-38, August 15, 1999. [7] T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: an efficient data clustering method for very large databases,” Proceedings of the 1996 ACM SIGMOD international conference on Management of data, pp. 103-114, Montreal, Quebec, Canada, June 04-06, 1996. [8] A. Foss, W. Wang, and O. R. Zaïane, “A Non-Parametric Approach to Web Log Analysis,” in Proceedings of Workshop on Web Mining in First International SIAM Conference on Data Mining (SDM2001), pp. 41-50, Chicago, IL, April 5-7, 2001. [9] R. Kohavi, “Mining e-commerce data: the good, the bad, and the ugly,” Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 8-13, San Francisco, California, 2001. [10] R. Cooley, B. Mobasher, and J. Srivastava, “Data preparation for mining world wide web browsing patterns,” Journal of Knowledge and Information Systems, Vol. 1, No. 1, pp. 5-32, 1999. 96 [11] C. Shahabi and F. Banaei-Kashani, “A Framework for Efficient and Anonymous Web Usage Mining Based on Client-Side Tracking,” Revised Papers from the Third International Workshop on Mining Web Log Data Across All Customers Touch Points, pp. 113-144, August 26, 2001. [12] K. D. Fenstermacher and M. Ginsburg, “Client-side monitoring for web mining,” Journal of the American Society for Information Science and Technology, Vol. 54, No. 7, pp. 625-637, 2003. [13] L. Catledge and J. Pitkow, “Characterizing browsing strategies in the World-Wide Web,” Journal of Computer Networks and ISDN Systems, Vol. 27, No. 6, pp. 1065-1073, April, 1995. [14] M.-S. Chen, J. S. Park, and P.S. Yu, “Data mining for path traversal patterns in a web environment,” Proceedings of the 16th International Conference on Distributed Computing Systems (ICDCS '96), pp. 385-393, May 27-30, 1996 [15] H. Mannila and H. Toivonen, “Discovering Generalized Episodes Using Minimal Occurrences,” Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD'96), pp. 146-151, Portland, Oregon, August, 1996. [16] T. W. Yan, M. Jacobsen, H. Garcia-Molina, and U. Dayal, “From user access patterns to dynamic hypertext linking,” Journal of Computer Networks and ISDN Systems, Vol. 28, No. 7-11, pp. 1007-1014, 1996. [17] B. Berendt, B. Mobasher, M. Nakagawa, and M. Spiliopoulou, “The Impact of Site Structure and User Environment on Session Reconstruction in Web Usage Analysis,” Lecture Notes in Computer Science, Vol. 2703, pp. 159-179, September, 2003. [18] Z. Chen, L. Tao, J. Wang, L. Wenyin, and W.-Y. Ma, “A Unified Framework for Web Link Analysis,” Proceedings of the 3rd International Conference on Web Information Systems Engineering, pp. 63-72, Washington, DC, 2002. [19] W.W.W. Consortium, The common log file format. Available at http://www.w3.org. [20] P. Clerkin and P. Cunningham and C. Hayes, “Ontology Discovery for the Semantic Web Using Hierarchical Clustering,” Semantic Web Mining Workshop at the 12th European Conference on Machine Learning (ECML’01) and the 5th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD’01), Freiburg, Germany, 2001. [21] M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery, “Learning to construct knowledge bases from the World Wide Web,” Artificial Intelligence, Vol. 118, No. 1-2, pp. 69-113, April, 2000. 97 [22] A. Maedche, S. Staab, “Discovering conceptual relations from text,” Proceedings of the 14th European Conference on Artificial Intelligence (ECAI 2000), August 20-25, 2000, Berlin, Germany, Amsterdam, IOS Press (2000) pp. 321–325. [23] W. A. Bayer, M. L. Stein, T. F. Smith, and S. M. Ulam, "A molecular-sequence metric and evolutionary trees," Journal of Mathematical Biosciences, vol. 19, pp. 9-25, 1974. [24] N. R. Dixon and T. B. Martin, “Automatic Speech and Speaker Recognition,” John Wiley & Sons, Inc., New York, NY, 1979. [25] P. A. V. Hall and G. R. Dowling, “Approximate String Matching,” ACM Computing Surveys (CSUR), Vol. 12, No. 4, pp. 381-402, December, 1980. [26] C. H. Joh, T. A. Arentze, and H. J. P. Timmermans, “A position-sensitive sequence alignment method illustrated for space-time activity-diary data,” Journal of Environmental and Planning A, Vol. 33, pp. 313-338, 2001. [27] M. J. Hunt, M. Lenning, and P. Mermelstein, “Use of Dynamic Programming in a Syllable-Based Continuous Speech Recognition System.” In “Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison,” D. Sankoff and J. B. Kruskal, Eds., pp. 163-188, Addison-Wesley, Reading, Mass., 1983. [28] C. Shahabi, A. M. Zarkesh, J. Adibi, and V. Shah, “Knowledge discovery from users Web-page navigation,” Proceedings of the 7th International Workshop on Research Issues in Data Engineering (RIDE '97) High Performance Database Management for Large-Scale Applications, pp. 20-31, April 07-08, 1997. [29] B. Hay, G. Wets, K. Vanhoof, “Web Usage Mining by Means of Multidimensional Sequence Alignment Methods,” Lecture Notes in Computer Science, Vol. 2703, pp. 5065, September, 2003. [30] B. Hay, G. Wets, and K. Vanhoof, “Mining Navigation Patterns using a Sequence Alignment Method,” Journal of Knowledge and Information Systems, Vol. 6, No. 2, pp. 150-163, 2004. [31] B. Hay, G. Wets, and K. Vanhoof, “Segmentation of visiting patterns on websites using a Sequence Alignment Method,” Journal of Retailing and Consumer Services, vol. 10, pp. 145-153, 2003. [32] V. I. Levenshtein, “Binary codes capable of correcting deletions, insertions, and reversals,” Journal of Soviet Physics-Doklady, vol. 10, pp. 707-710, 1966. 98 [33] W. Wang and O. R. Zaïane, “Clustering Web Sessions by Sequence Alignment,” Proceedings of the 13th International Workshop on Database and Expert Systems Applications (DEXA'02), pp. 394-398, Aix-en-Provence, France, 2002. [34] D. Gusfield, “Algorithms on strings, trees, and sequences: computer science and computational biology,” Cambridge University Press, New York, NY, 1997. [35] K. Charter, J. Schaeffer, and D. Szafron, “Sequence alignment using FastLSA,” Proceedings of International Conference on Mathematics and Engineering Techniques in Medicine and Biological Sciences, pp. 239-245, Las Vegas, NV, June, 2000. [36] A. K. Jain, R. C. Dubes, “Algorithms for clustering data,” Prentice-Hall, Inc., Upper Saddle River, NJ, 1988. [37] R. O. Duda and P.E. Hart, “Pattern Classification and Scene Analysis,” New York, Wiley, 1973. [38] S. Guha, R. Rastogi, and K. Shim, “ROCK: A Robust Clustering Algorithm for Categorical Attributes,” Proceedings of the 15th International Conference on Data Engineering, pp. 512-521, March 23-26, 1999. [39] S. C. Johnson, “Hierarchical clustering schemes,” Journal of Psychometrika, vol. 32, pp. 241-254, 1968. [40] J. H. Ward, “Hierarchical grouping to optimize an objective function,” Journal of the American Statistical Association, vol. 58, pp. 236-244, 1963. [41] B. S. Everitt, “Cluster Analysis,” London, Edward Arnold, 1993. [42] K. M. Cunningham, and J. C. Ogilvie, “Evaluation of hierarchical grouping techniques: a preliminary study,” The Computer Journal, Vol. 15, pp. 209-213, 1972. [43] F. K. Kuiper and L. Fisher, "A Monte Carlo comparison of six clustering procedure," Biometric, Vol. 31, pp. 777-783, 1975. [44] R. K. Blashield and L. C. Morey, “Mixture model test of cluster analysis: Accuracy of four agglomerative hierarchical methods,” Psychological Bulletin, Vol. 83, pp. 377388, 1976. [45] J. Cohen, “A Coefficient of Agreement for Nominal Scales,” Educational and Psychological Measurement, Vol. 20, pp. 37-46, 1960. [46] J. W. Grzymala-Busse, "LERS: A System for Learning from Examples Based on Rough Sets. Intelligent Decision Support," in Handbook of Applications and Advances of the Rough Sets Theory, R. Slowinski, Ed. pp. 3-18, Boston, MA: Kluwer Academic Publishers, 1992. 99 [47] C.-C. Chan and J. W. Grzymala-Busse, "On the two local inductive algorithms: PRISM and LEM2," Foundations of Computing and Decision Sciences, Vol. 19, pp. 185203, 1994. [48] J. W. Grzymala-Busse, “A new version of the rule induction system LERS,” Fundamenta Informaticae, Vol.31, No.1, pp. 27-39, July, 1997. [49] Z. Pawlak, “Rough sets: basic notion,” International Journal of Computer and Information Science, Vol. 11, pp. 344-356, 1982. [50] Z. Pawlak, J. Grzymala-Busse, R. Slowinski, and W. Ziarko, “Rough sets,” Communication of ACM, Vol. 38, pp. 89-95, 1995. [51] J. Han and M. Kamber, “Data mining: concepts and techniques,” Morgan Kaufmann Publishers Inc., San Francisco, CA, 2000. [52] I. H. Witten and E. Frank, “Data mining: practical machine learning tools and techniques with Java implementations,” Morgan Kaufmann Publishers Inc., San Francisco, CA, 2000. 100