ACCTG 6910, Spring 2003 DESB, University of Utah Project Milestone 5 (April 3 – 17) Question 1 (75%): Discover access patterns in web logs. The supervisory council for University of Utah’s web portal has contacted the e.bis Research Lab to discover user access patterns from its web logs. As a volunteer in the Lab, you have been asked to perform association rule and sequential pattern mining tasks on a small sample web log. It contains 4736 users, 10000 sessions and 11042 visit with the following attributes: 1-5 7-11 13-17 user id session id URL id Step 1: Download from the Project section in the class website the data set – weblog.txt and a text file – urlmapping.txt that describes mappings of URL codes in weblog.txt to URLs in UU’s web site. Step 2: Use IBM Intelligent Miner to mine the data set for large item sets, association rules and large sequential patterns. Use 0.3 % for support level for association rule and sequential pattern mining and 50 % for confidence level for association rule mining. Mine the data set again using two different support levels for both association rule and sequential pattern mining. Step 3: Report and analyze the results. Please identify 10 interesting association rules and 10 large sequential patterns respectively. Use the urlmapping.txt to help find the URLs that match URL ids in the rules/patterns Write up a short (one to two paragraphs) of analysis of these rules/patterns and any actions you recommend the supervisory council to consider. 1. Select five (instead of ten) different association rules with multiple items on the left-hand-side to interpret. If you don't have enough qualified rules, please adjust your support and/or confidence to find sufficient rules. Repeat steps 2 and 3 for each select rule. 2. Think of an explanation why the rule might exist (e.g., for A, B -> C, think of why would UU website visitors tend to access page C if they visit A and B. 3. Discuss your assessment of whether your explanation is or is not interesting. 4. Select five (instead of ten) large 3-sequences. If you don't have enough large 3-sequencies, please adjust the support level to find sufficient qualified sequences. For each select large sequence, repeat steps 2 and 3. At 0.3% support and 50% confidence level, IM discovered 76 rules and 139 item-sets. At 0.2% support and 50% confidence level, IM discovered 128 rules and 235 item-sets. At 0.1% support and 50% confidence level, IM discovered 1072 rules and 781 item-sets. Note: The objective of this milestone is for you to better understand what you may or may not expect from data mining and the efforts and domain knowledge required to interpret and leverage data mining results. Think about how much time it took you to work on the milestone 5 (A software like Link Selector can save a lot of web master’s time to interpret user access patterns for website redesign decisions.) Was it hard to interpret the results and make recommendations without some knowledge of website design/administration? When the data mining task is somewhat data-driven initially, you must find experts or acquire the relevant knowledge to analyze and leverage the patterns. Here are some relevant knowledge and ways to interpret and analyze association rules and sequential patterns from a web log: Web content management decisions include which links to be included in a page (especially the portal page). An association rule, A, B -> C may suggest the following linkage because visits to A and B tend to go thru C: A C B A sequential pattern, <{A}, {B}, {C}> may suggest the following linkage because users tend to reference these urls throughout sessions: A A B B C C Interpretations of interestingness of association rules and sequential patterns: 1. Uninteresting if the patterns are induced by the design of a website only. 2. Somewhat interesting if the patterns show common user interests that are well recognized and supported by the website design because they validate the effectiveness of the design. 3. Interesting if the patterns show common user interests that are not well recognized and supported by the website design because some redesign actions may follow. Association Rule Analysis: 1) [06155]+[06165]+[06128]==>[06153] support = 0.3552% confidence = 100% lift = 266.68 corresponding URLs are [/upap/main.html]+[/upap/top.html]+[/upap/left.html]==>[/upap] This rule exists because when users visited page /upap (utah physian assistant program) in utah website, the website would automatically load /upap/main.html, upap/top.html, and /upap/left.html and combine them into one html page as response. Uninteresting. 2) [00489]+[02049] ==> [02091] support = 0.3158% confidence = 80% lift = 29.37 corresponding URLs are [/academics/index.html]+[/graduate_school/admissions.html] ==>[/graduate_school/index.html] The rule exists because if users visited academic program index page (/academics/index.html) and admissions information of graduate school (/graduate_school/admissions.html), they most probably used graduate school index page (/graduate_school/index.html) to navigate. Somewhat interesting. 3) [02087]+[02085] ==> [02091] support = 0.2368% confidence = 85.71% lift = 31.47 corresponding URLs are [/graduate_school/graduate_handbook/handbook.html]+[/graduate_school/graduate_h andbook/grad.degrees.html]==>[/graduate_school/index.html] The rule exists because users clicked the link in the graduate school home page [/graduate_school/index.html] to browse the graduate handbook [/graduate_school/graduate_handbook/handbook.html], they then clicked the hyperlink in the handbook page to view the degrees available in UU. Somewhat interesting. 4) [00489]+[00680] ==> [00687] support = 0.2171% confidence = 68.75% lift = 31.96 corresponding URLs are [/academics/index.html]+[/calendar/index.html]==>[/calendar/oct2002.html] The rule exists because users may visit the page of event of October 2002 [/calendar/oct2002.html] by clicking event calendar link [/calendar/index.html] in the academics program [/academics/index.html] home page. Somewhat interesting. 5) [04560]+[00566] ==> [03773] support = 0.1184% confidence = 50% lift = 32.07 [/students/index.html]+[/alumni_visitors/index.html]==>[/quicklinks/index.html] The rule exists because user may switch between the student index page [/students/index.html] and alumni visitor index page [/alumni_visitors/index.html] through the quick links index page [/quicklinks/index.html]. However, the quick links index page is removed now, and user will be redirected to the UU homepage if they still visit that page. The UU home page also contains the links to the student and alumni visitor homepages Somewhat interesting. Under minimum support 0.1%, IM mined 773 sequential patterns. Five of them are selected to explain as follow. 1)<{[04560]}, {[00489]}, {[00472]}> support = 0.313 corresponding URLs are <{[/students/index.html]}, {[/academics/index.html]}, {[/a_z/index.html] }> The pattern exists because users may check student information by student index page and browse academics information by academics index page, then using a-z index to quick locate specific information they may be interested in. Interesting. 2) <{[04560]}, {[04560]}, {[05955]}> support = 0.281% corresponding URLs are <{[/students/index.html]}, {[[/students/index.html]}, {[/unews/releases/02/oct/cauldron.html]}> The pattern exists because users may notice and visit the hot news link in the student index page after they visit the student index page twice. Interesting. 3) <{[00489]}, {[00489]}, {[00489]}> support = 8.13% corresponding URLs are <{[/academics/index.html]}, {[/academics/index.html]}, {[/academics/index.html]}> The pattern exists because some users used to use academics index page to locate academics information in the different sessions. 3) <{[00472]}, {[00489]}, {[04560]}> support =0.219% corresponding URLs are <{[/a_z/index.html ]}, {[/academics/index.html]}, {[/students/index.html]}> The pattern exists because some users may find it not easy to find the information they required through a-z index page. Therefore, they may choose to use academic and student index page to locate the information of their interest in the following sessions. Interesting. 4) <{[00472]}, {[00472]}, {[00680] [00687]}> support =0.188% corresponding URLs are <{[[/a_z/index.html]}, {[[/a_z/index.html]}, {[/calendar/index.html] [/calendar/oct2002.html]}> The pattern exists because users may like to review the news of UU after they browse the specific information of their interest through a-z index page. We can also derive that the web logs may be collected in October, 2002 since users visit the news index page of that time. Interesting. 5) <{[00489]}, {[01681]}, {[00489]}> support =0.188% corresponding URLs are <{[/academics/index.html]}, {[/employment/index.html]}, {[/academics/index.html]}> The pattern exists because students may want to find a job in UU to support their academics learning. Interesting. Question 2 (25 %): If the data file includes referrer and visit duration information for each visit, please discuss how you might use clustering to help identify clusters in the data file. Note: Clustering uses more than one attributes and doesn’t specify what the clusters (e.g., clusters of urls with average visit duration longer than 1 minute) should be. The following clustering could produce interesting results of pages, users or sessions with similar patterns. To further analyze how they are similar, additional data mining such as association rule and sequential pattern, more granular clustering and classification may be applied. Cluster by url with attributes - average visit duration, top 3 referrers, average # of visitors/day, # of in-links, and # of out-links Cluster by visitor with attributes – location (e.g., zip code, coordinates, or wireless cell), frequent visit time of day (e.g., early am, mid am, late am, early pm, mid pm and late pm), average session duration, average page visit duration, top n large-item sets and top n sequential patterns. Cluster by session with attributes – location of visitor, average session duration, time of day, average number of links, and top n large-item sets