Web Analytics Xuejiao Liu INF 385F: WIRED Fall 2004

advertisement
Web Analytics
Xuejiao Liu
INF 385F: WIRED
Fall 2004
Outline

Introduction
 What
is Web Analytics
 Why Web Analytics matter

Secondary readings





Log files analysis
Web usage mining
Data preparation
KDD process
Document access in repositories
Log File Lowdown
(Michael Calore, 2001 )
Log file
 What are in log file

 Traffic
 Audience
 Browsers/Platforms
 Errors
 Referers
Log File Lowdown

Sample Log File
adsl-63-183-164.ilm.bellsouth.net - - [09/May/2001:13:42:07 -0700]
"GET /about.htm HTTP/1.1" 200 3741
“http://www.e-angelica.com“
"Mozilla/4.0 (compatible; MSIE 5.0; Windows 98)"

Log File Analyzers
 WebTrends,
Sawmill, Analog, Webalizer,
HTTP-analyze
WebTrends


log file analyzer
Advantages
 Fast
and effective
 User-friendly interface
 Feature-rich
 Support different operating systems

Disadvantages
 Not
free
WebTrends
The KDD Process for Extracting Useful
Knowledge from Volumes of Data
(Fayyad, U., G. Piatetsky-Shapiro, et al. 1996)

KDD: Knowledge Discovery in Databases
 The
value of data
 Definitions


KDD
Data mining
The KDD Process
The KDD process
1.Creating a target dataset
2.Preprocessing and data cleaning
3.Data reduction and projection
4.Data mining
Choosing the data mining function
Choosing the data mining algorithm
5.Interpretation and evaluation
The KDD Process

Data Mining
 Data
mining involves fitting models to or
determining patterns from observed data
 Data mining algorithms



The model
The preference criterion
The search algorithm
The KDD Process

Data Mining
 Model
functions
Classification
Regression
Clustering
Dependency modeling
Link anlysis
 Goals
of Data Mining
Predictive and descriptive
Data Preparation for Mining World Wide
Web Browsing Patterns
(Cooley, R. W., B. Mobasher, et al. 1999)


Web Usage Mining vs. data
mining
The WEBMINER process



Preprocessing
Mining algorithms
Pattern Analysis
Data Preparation

Preprocessing
 Data
cleaning
 User identification
 Session identification
 Path completion
 Formatting
Data Preparation
Data Preparation
R2
Tracking the Growth of a Site
( Nielsen, Jakob, 1998)


Exponential growth of the web and the internet
Statistical method
 Logarithmic
convert to get linear regression
Statistical analysis
 Hypothesis: the site is growing (number of pageviews
and date are correlated)
 R2 and significance
Tracking the Growth of a Site
R2 = 0.96, p<0.001
Tracking the Growth of a Site

Predict growth rate
 Clean
noise
 Confident interval
Predicting Document Access in Large,
Multimedia Repositories
(by Recker, M. R. and J. E. Pitkow, 1996)
patterns of document requests in networkaccessible multimedia databases
 Main idea

 Two
related domains: Human memory and
libraries
 Borrow models and research results from
them
Predicting Document Access

The model – human memory (Anderson and
Schooler)
 The
relationship of recency and performance is a
power function
 The relationship of frequency and performance is a
power function
 Tow parameters for performance

Need probability p and Need odds p/(1-p)
 The linear function:
 Log(Need odds) = a Log(Frequency) + b
Predicting Document Access

Apply Human Memory Analysis in Document
Requests Model





Dataset: log file of Georgia Tech WWW repository
A dynamic information ecology
Frequency analysis
 Regression equation:
 Log(Need Odds) = .99 Log (Frequency) – 1.30
Recency analysis
 Regression equation:
 Log(Need Odds) = -1.15 Log(days) + .41
Combining recency and frequency
Predicting Document Access

Conclusion
 Recency
and frequency of past document access are
strong predictors of future document access
 Recency probed to be a stronger predictor than
frequency

Applications for the design of information
systems
 Determine optimal ordering
 Inform design decisions
 Design
of retrieved items
of caching algorithms
Download