Web Analytics Xuejiao Liu INF 385F: WIRED Fall 2004 Outline Introduction What is Web Analytics Why Web Analytics matter Secondary readings Log files analysis Web usage mining Data preparation KDD process Document access in repositories Log File Lowdown (Michael Calore, 2001 ) Log file What are in log file Traffic Audience Browsers/Platforms Errors Referers Log File Lowdown Sample Log File adsl-63-183-164.ilm.bellsouth.net - - [09/May/2001:13:42:07 -0700] "GET /about.htm HTTP/1.1" 200 3741 “http://www.e-angelica.com“ "Mozilla/4.0 (compatible; MSIE 5.0; Windows 98)" Log File Analyzers WebTrends, Sawmill, Analog, Webalizer, HTTP-analyze WebTrends log file analyzer Advantages Fast and effective User-friendly interface Feature-rich Support different operating systems Disadvantages Not free WebTrends The KDD Process for Extracting Useful Knowledge from Volumes of Data (Fayyad, U., G. Piatetsky-Shapiro, et al. 1996) KDD: Knowledge Discovery in Databases The value of data Definitions KDD Data mining The KDD Process The KDD process 1.Creating a target dataset 2.Preprocessing and data cleaning 3.Data reduction and projection 4.Data mining Choosing the data mining function Choosing the data mining algorithm 5.Interpretation and evaluation The KDD Process Data Mining Data mining involves fitting models to or determining patterns from observed data Data mining algorithms The model The preference criterion The search algorithm The KDD Process Data Mining Model functions Classification Regression Clustering Dependency modeling Link anlysis Goals of Data Mining Predictive and descriptive Data Preparation for Mining World Wide Web Browsing Patterns (Cooley, R. W., B. Mobasher, et al. 1999) Web Usage Mining vs. data mining The WEBMINER process Preprocessing Mining algorithms Pattern Analysis Data Preparation Preprocessing Data cleaning User identification Session identification Path completion Formatting Data Preparation Data Preparation R2 Tracking the Growth of a Site ( Nielsen, Jakob, 1998) Exponential growth of the web and the internet Statistical method Logarithmic convert to get linear regression Statistical analysis Hypothesis: the site is growing (number of pageviews and date are correlated) R2 and significance Tracking the Growth of a Site R2 = 0.96, p<0.001 Tracking the Growth of a Site Predict growth rate Clean noise Confident interval Predicting Document Access in Large, Multimedia Repositories (by Recker, M. R. and J. E. Pitkow, 1996) patterns of document requests in networkaccessible multimedia databases Main idea Two related domains: Human memory and libraries Borrow models and research results from them Predicting Document Access The model – human memory (Anderson and Schooler) The relationship of recency and performance is a power function The relationship of frequency and performance is a power function Tow parameters for performance Need probability p and Need odds p/(1-p) The linear function: Log(Need odds) = a Log(Frequency) + b Predicting Document Access Apply Human Memory Analysis in Document Requests Model Dataset: log file of Georgia Tech WWW repository A dynamic information ecology Frequency analysis Regression equation: Log(Need Odds) = .99 Log (Frequency) – 1.30 Recency analysis Regression equation: Log(Need Odds) = -1.15 Log(days) + .41 Combining recency and frequency Predicting Document Access Conclusion Recency and frequency of past document access are strong predictors of future document access Recency probed to be a stronger predictor than frequency Applications for the design of information systems Determine optimal ordering Inform design decisions Design of retrieved items of caching algorithms