Web Usage Mining - Computer Science & Engineering

advertisement
Web Usage Mining: An
Overview
Lin Lin
Department of Management
Lehigh University
Jan. 30th
Agenda
•
•
•
•
Web Usage Mining: Definition
Research Issues in Web Usage Mining
Current Research in Web Usage Mining
Going Forward
Web Usage Mining: A Definition
• The process of applying data mining techniques to
the discovery of usage patterns from Web data,
targeted towards various applications
• Different from content mining &
structure mining
(Adamic, L. A., and Adar, E. 2003.
Friends and neighbors on the web. Social Networks 25(3):211–230.)
Web Usage Mining: Data Source
• Typical data sources for web usage mining are:
– Web structure data
(site map, links, etc.)
– Web content data
– User profile
(may not be available)
– Web log
(web usage data,
clickstream data)
Web Usage Mining: Procedure
Preprocessing: Challenges
• WHO are the users?
– IP vs. real people
• HOW LONG did the users stay?
– Measuring session time
(L. Catledge and J. Pitkow. Characterizing browsing behaviors on
the world wide web. Computer Networks and ISDN Systems, 27(6), 1995)
(Berendt, B. Mobasher, M. Nakagawa, and M. Spiliopoulou. The impact of site structure and user environment on session reconstruction in
web usage analysis. In Proceedings of the 4th WebKDD 2002 Workshop, at the ACM-SIGKDD Conference on Knowledge Discovery in
Databases (KDD’2002), Edmonton, Alberta, Canada, July 2002.
• WHERE did the users go?
– Server side vs. Client side
• WHAT did the users view?
– Content processing
Moe, Wendy W. 2003. Buying,
searching, or browsing:
Differentiating between online shoppers
using in-store navigational click-stream. J. Consumer Psych. 13(1, 2) 29–40.
--------------------------------------------------------------------------------------For the best review on preprocessing methods, refer to: R. Cooley, B. Mobasher, J. Srivastava, Data preparation for mining world wide web
browsing patterns, Knowledge and Information Systems 1 (1) (1999) 5–32
Usage Pattern Discovery: Methods
• Statistical Methods (including dependency modeling and stochastic
modeling)
• Association Rule Mining
• Clustering (user cluster vs. page cluster)
• Classification
Usage Pattern Discovery:
Research Streams
• Why am I interested in web usage mining? (a.k.a., what’s the big deal?)
–
–
–
Blattberg, Robert C. and John Deighton (1991), "Interactive Marketing: Exploring the Age of Addressability," Sloane
Management Review, 33 [1), 5-14
Ghosh, S. 1998. Making business sense of the Internet. Harvard Business Review 76(2) 126–135
Bucklin R. E., Lattin, J. M., Ansari, A., Bell, D., Coupey, E. Gupta, S., Little, J. D. C., Mela, C. Montgomery, A. Steckel, J.
Choice and the Internet: From Clickstream To Research Stream. Marketing Letters, 2002,Vol. 13, No. 3, pp. 245-258
Usage Pattern Discovery:
Research Streams
• Lin’s two cents on current research streams
– Build a better site:
• For everybody – system improvement
(caching & web design)
• For individuals – personalization
• For search engines – SEO
– Know your visitors better:
• Customer behavior
– Be a better business
Build a Better Site:
System Improvement
• Server-side caching of web pages (association rules)
– Y.-H. Wu, A.L.P. Chen, Prediction of web page accesses by proxy server log, World Wide
Web 5 (1) (2002) 67–88
–
–
–
–
Preprocessing:
Method:
Data:
Contribution:
No IP discussion, sessions split by time-based heuristics
Sequential pattern mining
Usage
Use frequent sequence to predict candidate page,
“personalize” based on user maturity
Build a Better Site:
System Improvement
• Improvement of general web design (AR, SP, MM)
– Fang, X. and O. R. L. Sheng (2004). Link Selector: A web mining approach to hyperlink
selection for web portals. ACM Transactions on Internet Technology 4, 209–237
–
–
–
–
Preprocessing:
Method:
Data:
Contribution:
No IP distinguished, sessions split by 25.5 minutes
Association mining
Usage & Structure
Combine structure info. and usage info. to optimize portal
page design
• Where are we headed: adaptive web design
–
Y. Fu, M. Creado, C. Ju, Reorganizing web sites based on user access patterns, in: Proceedings of the Tenth International Conference on
Information and Knowledge Management, ACM Press, 2001, pp. 583–585 (usage & content)
Build a Better Site:
Personalization
• Personalize the web site based on usage patterns (AR,
Clustering)
– A key research domain: recommender systems*
– Content clustering vs. users clustering vs. hybrid approach
– C. Shahabi and F. Banaei-Kashani. Ecient and anonymous web usage mining for web
personalization. INFORMS Journal on Computing, Special Issue on Data Mining, 2002
– Method:
– Data:
Clustering of sessions
Client side usage data
• Where are we headed: incorporate time and web 2.0
–
*: Refer to Adomavicius, G., & Tuzhilin, A. (2005). Toward the next generation of recommender systems: A survey of
the state-of-the-art and possible extensions. IEEE Transactions on Knowledge and Data Engineering, 17(6), 734–749
for a good review on recommender systems
Build a Better Site:
SEO
• Adding usage information into PageRank
– Kalyan Beemanapalli, Ramya Rangarajan, Jaideep Srivastava, “Usage-Aware Average
Clicks”, In Proc. Of WebKDD 2006: KDD Workshop on Web Mining and Web Usage
Analysis, in conjunction with the 12th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining (KDD 2006), August 20-23 2006
– Method: Association rule in spirit
Know your visitors better:
Customer behavior
• A favorite research stream by marketers and MIS researchers
– Statistical models are used most of the time
– “Macro-level” behavior is often the focus
– Interesting questions related to firm performance and profitability
Know your visitors better:
Customer behavior
•
Johnson, E. J., Wendy Moe, Peter S. Fader, Steven Bellman, and Jerry Lohse. "On the Depth
and Dynamics of Online Search Behavior," Management Science, Vol. 50, No. 3, March 2004,
pp. 299–308
– model an individual’s tendency to search as a logarithmic process
– hierarchical Bayesian model with Depth of Search , dynamics of search and
activity of search
– interested in the number of unique sites searched by each household within a
given product category
– Preprocessing:
– Method:
– Data:
Households identified by client-side programs, session is
month-based
Statistical Modeling (log model)
Usage (search)
Know your visitors better:
Customer behavior
•
Moe, Wendy W. 2003. Buying, searching, or browsing: Differentiating between online
shoppers using in-store navigational clickstream. J. Consumer Psych. 13(1, 2) 29–40
– WHY do the customers visit?
– Preprocessing:
– Method:
– Data:
– Conclusion:
Content Processing
Clustering of sessions by visiting behavior parameters and
content parameters
Usage & Content
Know your visitors better:
Customer behavior
•
Moe, Wendy W. 2003. Buying, searching, or browsing: Differentiating between online
shoppers using in-store navigational clickstream. J. Consumer Psych. 13(1, 2) 29–40
Know your visitors better:
Customer behavior
•
Sismeiro, Catarina, Randolph E. Bucklin. 2004. Modeling Purchase Behavior at an ECommerce Web Site: A Task Completing Approach. Journal of Marketing Research. 41 (3),
306-323
– How do the customers visit?
– Predicts online buying by linking the purchase decision to what visitors
do and to what they are exposed while at the site.
–
–
–
–
Preprocessing:
Method:
Data:
Conclusion:
Content Processing
Statistical Modeling
Usage & Content
Know your visitors better:
Customer behavior
•
Sismeiro, Catarina, Randolph E. Bucklin. 2004. Modeling Purchase Behavior at an ECommerce Web Site: A Task Completing Approach. Journal of Marketing Research. 41 (3),
306-323
–
–
–
–
–
browsing behavior (i.e., time and page views)
repeat visitation to the site (return and total number of sessions)
use of interactive decision aids
Data input effort and information gathering and processing
a series of page specific characteristics
Know your visitors better:
Customer behavior
•
My Research: Online Customer Lifetime
– predict an individual’s tendency to stay with an e-tailer
– Hybrid BG/NBD model + Neural Networks
– interested in the relationship between online customer lifetime and firm
profitability
– Preprocessing:
– Method:
– Data:
Households identified by client-side programs, session is
month-based
Statistical Modeling & Classification
Usage
Know your visitors better:
Customer behavior
•
My Research: Online Customer Lifetime
• Given N customers with visiting history (Xi, txi, T )
– T : the observed time period
– Xi : number of visits customer i made during T
– txi: time of the last visit made by customer i
• Find the best fit for the following maximum likelihood equation to
estimate the four parameters r, a, b and
B(a, b  x) (r  x) r
B(a  1, b  x  1) (r  x) r
[
  x 0
]

rx
rx
B(a, b) (r )(  T )
B ( a, b)
(r )(  t x )
i 1
N
Know your visitors better:
Customer behavior
• Given r, a, b and
, we can predict:

– Total number of visits during a time period t (starting from time 0)
a  b 1
 r
t
[1  (
) F (r , b; a  b  1;
)]* N
a 1
 T
 t
– Number of visits an individual will make in the future t time units Y(t)
(from T+1 to T+t)
a  b  x 1
  T rx
t
[1  (
) F (r  x, b  x; a  b  x  1;
)]
a 1
 T t
 T t
a
  T rx
1   x 0
(
)
b  x 1   tx
Know your visitors better:
Customer behavior
•My Research: Online Customer Lifetime
Product Type
Search Goods
Experience Goods
Company
Number of visitors
Amazon
1267
BMG
177
Columbia
304
Drugstore
57
Ticketmaster
179
landsend
32
doldNavy
88
victoriassecret
54
Calibration period Testing period Mean Lifetime
Percentage Right censored
B. Acc
5
1
125.8
44.91%
75.27%
4
2
136
42.86%
72.45%
5
1
123.93
23.73%
67.23%
4
2
128.16
38.98%
79.10%
5
1
131.31
43.75%
88.16%
4
2
126.96
47.70%
80.59%
5
1
74.16
28.08%
84.21%
4
2
72.28
21.06%
78.95%
5
1
100.96
30.34%
80.89%
4
2
102.97
18.44%
78.77%
5
1
45.6
6.25%
71.88%
4
2
51.5
9.38%
87.50%
5
1
101.41
39.77%
79.31%
4
2
113.85
39.77%
78.41%
5
1
52.56
14.82%
74.07%
4
2
63.92
12.96%
72.22%
Acc.
77.82%
74.90%
77.40%
70.62%
78.62%
70.07%
78.95%
75.43%
74.72%
70.95%
81.25%
75.00%
80.68%
72.73%
77.78%
79.63%
Web Usage Mining: The Future
Download