Presented by Andrew Wong 9th Annual IUG meeting at HKU Library 8 December 2009 • • • • • • • Definitions Motivations Architecture of Logs Miner Logs Miner User Interface Logs Miner reports Benefits Future development 2 Web data mining -- “application of data mining methodologies, techniques, and models to variety of data forms, structures, and usage patterns that comprise the World Wide Web” (Markov, Z. & Larose, D. T. 2007) Three scopes of Web data mining: Web content mining Web structure mining Web log mining 3 Web log mining • • • Discover user access patterns from Web usage logs Is also called web usage mining Three processing stages: 1. Pre-processing 2. Pattern discovery 3. Pattern analysis 4 • • • • Identify and classify different group of patrons Understand search patterns by different group of patrons Adapt web-user interfaces to suit users need Statistical data for collection management 5 • Web logs provide huge information on user action lbz000.ust.hk - - [16/Nov/2009:12:03:26 +0800] "GET /catalog/ HTTP/1.1" 200 20283 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5 (.NET CLR 3.5.30729)“ lbnxyz.ust.hk - - [16/Nov/2009:12:03:27 +0800] "GET /catalog/?s=brandy&feed=rss HTTP/1.1" 304 - "" "Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 1 subscribers; feedid=10486796160015392754)" lbz222.ust.hk - - [16/Nov/2009:12:03:30 +0800] "GET /stream/xml/stream.xml HTTP/1.1" 304 - "-" "Mozilla/5.0 (Windows; U; Windows NT 6.0; zh-TW; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5“ lbz333.ust.hk - - [16/Nov/2009:12:03:33 +0800] "GET /catalog/?s=brandy HTTP/1.1" 304 - "-" "Mozilla/5.0 (Windows; U; Windows NT 6.0; zh-TW; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5" lbz444ust.hk - - [16/Nov/2009:12:03:35 +0800] "GET /stream/xml/stream.xml HTTP/1.1" 304 - "-" "Mozilla/5.0 (Windows; U; Windows NT 6.0; zh-TW; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5" 6 lbz000.ust.hk - - [16/Nov/2009:12:03:26 +0800] "GET /catalog/ HTTP/1.1" 200 20283 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5 (.NET CLR 3.5.30729)“ Fields Value Remote host field lbz000.ust.hk Date/Time field [16/Nov/2009:12:03:26 +0800] HTTP request “GET /catalog/ HTTP/1.1“ Status code field 200 Transfer Volume (Bytes) Field 20283 User agent field "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5 (.NET CLR 3.5.30729)“ 7 Common Log Format – usually used by Apache Web server logs, Apache Tomcat Logs e.g. Library web server, INNOPAC, SmartCAT, Institutional Repository lbz000.ust.hk - - [16/Nov/2009:12:03:26 +0800] "GET /catalog/ HTTP/1.1" 200 20283 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5 (.NET CLR 3.5.30729)“ Microsoft IIS Log Format e.g. ILLiad, Class Registration Form 2009-07-20 01:22:44 GET /ce/ - 66.249.71.201 HTTP/1.1 Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html) - 401 1891 0 Include: • Remote host field • Date field •Time field • HTTP request field • Status code field • Transfer Volume (Bytes) • Referrer field • User agent field 8 Microsoft Streaming Server e.g. Streaming video 143.89.160.133 2009-09-02 10:21:20 - /arc-open/oudpa/OUDPA-2008-Sobel-Adventures_in_Science_Writing.wmv 0 6 5 200 {3300AD50-2C39-46c0-AE0A-41B7139D4722} 11.0.5721.5251 en-US WMFSDK/11.0.5721.5251_WMPlayer/11.0.5721.5268 - wmplayer.exe 11.0.5721.5145 Windows_XP 5.1.0.2600 Pentium 3816 216613290 2830093 rtsp TCP - - - 2244972 2244972 398 398 0 0 0 0 0 0 1 1 100 143.89.105.168 lbms07.ust.hk 1 0 245 file://C:\wmhome\hkust\arc-open\oudpa/OUDPA-2008-Sobel-Adventures_in_Science_Writing.wmv mms://stream.ust.hk/arc-open/oudpa/OUDPA-2008-Sobel-Adventures_in_Science_Writing.wmv OUDPA-2008-SobelAdventures_in_Science_Writing.wmv - - 0 Fields only for streaming server: • Video codec • Audio codec • Duration • Client’s player 9 Tools used to analyze web access logs • AccessWatch v1.33 • Analog 6.0 • Pwebstats • RefStats 1.2 • INNOPAC Millennium Web Report – Search Statistics Others: • AWStats • Sawmill Analytics • Webalizer 10 • • Create a portal for storing, analyzing all different web access logs. Interface for querying web access logs to generate dynamic statistical report 11 • Ability to analyze different log formats including Apache NCSA combined log files, IIS log files (W3C), streaming servers log files • Feasible to analyze non-standardized log format • Support works from command line and from a browser as CGI • Build a web interface to query the data (Logs Miner) • Pre-process the raw log data, running large scale query in cron job 12 • Unlimited log file size • Report number of unique visit and visit • Provides Plug-in to expand the functionality • Open source 13 • • Web logs files: raw data must be contained web logs components such as client IP address, status code, HTTP Request field…… Any OS platform which supporting PERL 14 • • • • • PC-level workstations CentOS release 5.4 Apache web server 2.0 PERL v.5.8.8 AWStats 6.9 15 Logs Miner UI AWStats Raw logs: Library web server, INNOPAC, SmartCAT, Institutional repository, Digital archives ….. Preprocessing AWStats reports Access statistics Customized report Pattern discovery, pattern analysis 16 • A portal for mining web access log data and retrieve information about usages of multiple web applications. • Built on top of AWStats, an open source logs analyzer. • Currently set up to analyze more than 20 library servers and applications including Library Web Server, INNOPAC, Institutional Repository, Digital Archives, SmartCAT, ILLiad, Streaming Video Server, etc. 17 URL: https://lbnx16.ust.hk/mining Includes 20+ applications Provides three types of report Filtered by URL or Host Generates Yearly or monthly report Query box which supporting regular expression 18 URL: https://lbnx16.ust.hk/mining Tips for construct query string 19 • • AWStats reports Access statistics - filtered by URL / Host • Customized reports 20 21 22 Report the number of - number of unique visitors - number of visits - These number are exclude the visit from the Robot 23 24 Created by plugins: geoip 25 Work in progress HKUST's iPhone Application for receiving Library information and searching on SmartCAT 26 Query box which supporting regular expression 27 28 29 Database title: Cambridge Journals Online URL: http://library.ust.hk/cgi/db/cambridge.pl?subscribedTo Server name: library.ust.hk (Library web server) Parameters /cgi/db/cambridge.pl?subscribedTo Include pattern: cgi\/db\/cambridge\.pl.+ 30 31 32 Document Long, Jiafu 2005, Autoinhibition of X11/Mint scaffold proteins revealed by the closed …… URL: http://repository.ust.hk/dspace/bitstream/1783.1/2496/1/nsmb958.pdf Server name: repository.ust.hk (HKUST Institutional Repository) Parameters /dspace/bitstream/2496/1/nsmb958.pdf Include pattern: \/1783\.1\/2496\/1\/nsmb958\.pdf 33 34 35 Number of access on Library web page from Library public workstations Library web page URL: http://library.ust.hk/ Server name: library.ust.hk (Library web server) Client’s name convention OPAC workstation (lbb[nnn].ust.hk) IC workstation (lbc[nnn].ust.hk) Computer Lab (lba[nnn].ust.hk Include pattern: lb(a|b|c)[\d]+\.ust.hk\.hk 36 37 38 Number of access on Digital Archives from HKUST campus but exclude HKUST Library Staff Digital university archives URL: http://archives.ust.hk/ Server name: archives.ust.hk (Digital Archives) Client’s name convention Library staff workstation (lbz[nnn].ust.hk) 39 Include pattern: ^.+\.ust\.hk$ Exclude pattern: lbz.+\.ust.hk\.hk 40 41 • • • A virtual visit is defined as a user’s request on the library’s website in order to use one of the services provided by the library. One Key Performance Indicator – Virtual visits per capita Includes main web applications: - Library web server - Innopac - SmartCAT (Next generation Catalogs) - HKUST Institutional Repository - Digital Archives - HKUST ILLiad 42 Report the number of • Visits - a unique IP accesses a page, and requests three other pages without an hour between any of the requests 43 Request within an hour Request within an hour Count as a visit Request within an hour 44 Applications unique visit visit page visit/visitor pages/visit Library web server 413,324 1,018,811 60,78,913 2.46 5.96 IR 94,596 133,458 632,256 1.41 4.73 Digital Archives 1497 3,511 90,489 2.34 25.77 E-Journal 21,833 42,768 376,473 1.95 8.8 E-theses 25,848 34,956 116,664 1.35 3.33 HKUST ILLiad 8,039 18,548 138,109 2.3 7.44 SmartCat 4,202 9,398 288,787 2.23 30.72 Streaming Videos 778 1,233 4,073 1.58 3.30 Total 570,117 1,262,683 7,725,764 2.21 6.11 2.21 6.11 Virtual Visit in 2009 1,262,683 45 • Built-in customized reports to provide a full picture of page visit figures of similar pages From HKUST Library Web Server (http://library.ust.hk) • Sitemap • Databases List • Course Guides • Database Guides • Subject Guides 46 SubSet: • Sitemap • Databases List • Course Guides • Database Guides • Subject Guides 47 HKUST library web sitemap 48 49 Add more customized reports template • • • E-Journal list Library Forms …… 50 • • • • Central place for storing, processing and analyzing Web Logs data Combined usage data from different server logs Statistics report can be generated dynamically. Flexible querying interface enabling users to construct their own statistical reports in real-time 51 • • • From web access logs, individual client’s action can be tracked Protected by firewall, file permission, user authentication Logs Miner User Interface can be only accessed from library network IMPORTANT: As data retrieved in your searches or reports may contain usage patterns of our users, please be careful not to re-distribute such information outside of the HKUST Library. 52 • • Include more web applications such as HKUST PowerSearch server (federated search to Library’s subscription resources) Create more customized report template such as E-journal list 53 Han, J., & Kamber, M. 2006. Data mining :Concepts and techniques (2nd ed.). Amsterdam: Morgan Kaufmann. Liu, H., & Keselj, V. 2007. Combined mining of web server logs and web contents for classifying user navigation patterns and predicting users' future requests. Data knowledge engineering, 61(2): 304. Markov, Z., & Larose, D. T. 2007. Data mining the web :Uncovering patterns in web content, structure, and usage. Hoboken, N.J.: Wiley-Interscience. 54 Email address: lbandrew@ust.hk 55