Automatic Blog Monitoring and Summarization

advertisement
Efficient Monitoring, Mining and
Analysis of User-generated Content
Ka Cheung “Richard” Sia (UCLA)
kcsia@cs.ucla.edu
Sept 8 2008
1
Explosion of user-generated content

Doubling every 5 months – by Technorati
2
Characteristics of content

About 97%-98% daily content are new

50 words shingles

62% weekly content are new on the web (“Whats new on the Web on the
Web? The Evolution of the Web from a Search Engine Perspective”, by
Ntoulas et.al., WWW 20004)
3
Characteristics of content

Mostly consist of current event chatter

Politics

Technology

Entertainment

Sports
4
The Yahoo! Buzz service
5
Agenda

Introduction: growth and characteristics of usergenerated content

Three aspects

Monitoring: How to deliver fresh content to users

Aggregation: How to efficiently deliver personalized results to
users

Analysis of tagging data: Making tagging data useful for
advertisers
6
Framework

Pull model: A central server monitors data source changes
and provides digested content to users

Push model: data sources notify server for updates
7
Overview

New challenges
 Content update more frequently with recurring
pattern
 More time-sensitive requirements

Modeling of post update
Definition of delay
Strategies for allocation and scheduling


8
How updates are changed?

Homogeneous Poisson model
λ(t) = λ at any t

Periodic inhomogeneous Poisson model
λ(t) = λ(t-nT), n=1,2,…
9
Definition of metrics

Delay of a data source
sum of elapsed time for every post
D
(ti)j ti
k
D
(O
)
D
(ti)
i
1

Delay experienced by the aggregator
n
D
(A
)
w
(O

iD
i)
i
1
10
Approach

Resource allocation


How often to contact data sources?
O1 is more active than O2, how much more often should we
contact O1 than O2?
mi  wii

Retrieval scheduling


When to contact a data source?
2 retrievals are allocated for O1, when should these 2
retrievals be located?
11
Single retrieval per period example


λ(t) = 1, t [0,1], λ(t)=0, t [1,2]
Periodicity T=2



τ = 0.5, expected delay = 0.75
τ = 1, expected delay = 0.5
τ = 2, expected delay = 1.5
12
Multiple retrievals per period

m retrievals per period are allocated, when
scheduled at time τ1, …, τm, the expected delay is
given by:
m
D
(
O
)
(
t)(
i1
t)
dt


i
1
i
1
i
m
T



1
1
Criteria
for
optimalit
j
(j)(
j)
 (
t
)
dt
j

1
j
1

   
13
Example
Criteria
for
optim

6 retrievals for λ(t)=2+2sin(2πt)
j

(
j)(
j1
j)

(
t)
dt


j
1
14
Experiment

Data – 10k RSS feeds from syndic8.com collected during Oct
– Dec 2004

Typical power law distribution – good for resource allocation
15
Performance


CGM03 (“Effective page refresh policy for Web crawlers”, by Cho
and Garcia-Molina in ACM TODS 2003)

Homogenous Poisson model

Optimize for “age” metrics
Ours – both resource allocation and retrieval scheduling
16
Size of estimation window


Resource constraint: 4 retrievals per day per feeds on average
2 weeks seems an appropriate choice
17
Consistency of posting rate

90% of the RSS feeds post consistently
18
Summary

Resource allocation is aggressive

Retrieval scheduling optimizes within individual data source




Significantly improved freshness of content
Also considered user browsing pattern
“Efficient Monitoring Algorithm for Fast News Alert”, with Junghoo Cho, HyunKyu Cho, in IEEE TKDE 2007
“Monitoring RSS Feeds based on User Browsing Pattern”, with Junghoo Cho,
Koji Hino, Yun Chi, Shenghuo Zhu and Belle L. Tseng in ICWSM 2007
19
Agenda

Introduction: growth and characteristics of usergenerated content

Three aspects

Monitoring: How to deliver fresh content to users

Aggregation: How to efficiently deliver personalized
results to users

Analysis of tagging data: Making tagging data useful for
advertisers
20
Aggregate query over blogs

User-generated content in Blogosphere and Web 2.0 services contain
rich information of recent events

Aggregation of individual user opition to show current popular trends
21
Motivation


Global aggregation (examples from blogpulse.com)

Recent news got picked up quickly
 “Dark Knight” in the week of July 18
 “Olympics” related phrases in the week of August 8

Potential drawbacks
 What if a user not interested in entertainment at all?
 Groups of bloggers collaborated to promote advertisement
videos
Personal aggregation

Users selectively aggregate from different sources

Efficient strategy to handle large number of users and sources
22
From global to personal aggregation
Michael Phelps
performance in
Olympics is
awesome...
Finished
watching
Michael Phelps
in Olympics, let
me try the
WALL-E DVD...
Dark Knight is
great, more
entertaining
than watching
Olympics and
shows in Las
Vegas!
Um.. it will be
good if there is
a free show of
Dark Knight
and WALL-E
bloggers
items
(phrases)
Olympics
Dark Knight
Michael Phelps
Las Vegas
WALL-E
23
Matrix forumulation

Endorsement matrix (E) - e.g. the number of times a blogger
mentions an object (keywords / links) in his posts.

Trust matrix (T) - e.g. how often a user reads from a blog

Personalized score (TE) – weighted endorsement score by a
user’s trust vector
E
o1
o2
O3
T
b1
b2
b3
b4
TE
o1
o2
o3
b1
3
2
0
u1
0.8
0.8
0
0
u1
2.4
4.0
0.0
b2
0
3
0
u2
0.2
0.2
0.6
0.6
u2
1.8
2.2
2.4
b3
b4
1
1
0
2
1
3
u3
0
0
0.5
0.5
u3
1.0
1.0
2
Total
5
7
4
24
Baseline implementations

Endorsement (blog_id, iterm, score), Trust (user_id, blog_id, score)

Personal Aggregate Query
SELECT t.item, sum(t.score*e.score) As p_score
FROM Endorsement e, Trust t
WHERE e.blog_id = t.blog_id AND
t.user_id = <user id>
GROUP BY t.items
ORDER BY p_score DESC LIMIT 20
On-the-fly (OTF)
View
25
Optimizing the query

Identify “template” users

Typical users interested in sports / politics / technology / ...

Results of template users are pre-computed

Results of individual users are combined from partially
computed results
26
Using NMF to discover user groups


Factorize trust matrix
Decompose T into two sub-matrices W and H





Non-negative matrix factorization
W: <individual users : template users> relationship
H: <template users : blogs> relationship
User 2’s trust vector is expressed as linear combination of the
trust vectors of template user 1 and 2
NMF as an approximation of original trust matrix
27
Reconstruction of results

PersonalizedEndorsement score of template users are pre-computed,
results of individual users are computed on request
(HE) is maintained as sorted-lists for all template users

W * (HE) is the personal aggregation result


Computed using Threshold Algorithm (by Fagin et.al. PODS 2001)

Top-K list

(HE) are sorted lists

W * (HE) is weighted linear combination
28
Partition of trust matrix

Decomposition is useful when matrix is dense

Real life data is often skewed (by Akshay et.al. ICWSM 2007)

Hybrid method: uses decomposition only when it is effective
2.7M subscription pairs
2. VIEW
Blogs with
more subscribers
1. OTF
Users with >30 subscriptions
Feeds with >30 subscribers
10k feeds, 24k users
~1M subscription pairs
3. NMF
Users with more subscription
29
Experiments

Bloglines.com : online RSS reader

Trust matrix T (1-0 version): subscription profile



91K users

487K RSS feeds
Endorsement matrix E: blog – keywords occurrence

Feed content collected between Nov 2006-Jul 2007

Keywords filtered by nouns with high tf-idf values
Platform

Python implementation of proposed scheme

MySQL server on linux with data stored on RAID
30
How different is personalization?

Week 2007 Jan 7 – 2007 Jan 13
major event: iphone released
Global
sales
iphone
apple
manager
iraq
management
development
software
business
phone

2007-01-07 to 2007-01-13
User 90439
User 90550
cattle
brazil
beef
iguazu
iphone
reuters
chicago
search
iraq
vegas
bush
argentina
apple
kibbutz
companies
video
prices
cathartik
quarter
google
User 91017
yorker
iraq
bush
president
views
avenue
dept
troops
saddam
iran
Personal aggregation results differ from global aggregation
31
How different is personalization?


Overlap comparison of global aggregation and personal aggregation

LG – global top 20 items

Li – individual top 20 items of user i
Personal aggregation results also differ among users
Overlap degree with
global aggregation result
L G∩ Li
Pair-wise among users
Li∩ L j
32
Approximation accuracy


Dense region of subscription matrix

>30 subscribers: 10152 feeds

>30 subscriptions: 24340 users
L2 norm comparison
Rank
SVD
NMF
80
848.5
856.9
90
841.6
850.1
100
835.1
844.6
110
829.0
837.9
120
823.2
833.0

Sparsity of W (23%), H (13%)

NMF approximation is close to SVD with sparseness adv.
33
Approximation accuracy


How many items are approximated by NMF in top 20 list?

Ti – top 20 items of user i computed by OTF

Ai – top 20 items of user i computed by NMF
70% approximation and more accurate for higher rank items
| Ai  Ti | / | Ti |
Correlation with rank
34
Efficiency of proposed method

Update cost (for 1 week data)


OTF (222K) < NMF (3.2M) < VIEW (23.6M)
Query response time

Average over 1000 users with highest number of subscription

OTF: execute SQL query on MySQL server

NMF: phython implementation of Threshold Alogrithm that interface MySQL server
Method

avg
std
max
min
OTF
2.05s
3.60s
84.42s
0.037s
NMF
0.46s
0.53s
2.84s
0.007s
Average query response time reduced by 75%, eliminated outliers of
significant delay
35
Summary


Deliver tailored results to users by personal aggregation
Proposed a model for personal aggregate queries
Optimization by NMF & Threshold Algorithm
Real life dataset study shows query response time can be reduced by
significantly with acceptable approximation accuracy




“Efficient Computation of Personal Aggregation Queries on Blogs”, with
Junghoo Cho, Yun Chi, and Belle L. Tseng, in SIGKDD 2008
“Capturing User Interest by Both Exploitation and Exploration”, with Shenghuo
Zhu, Yun Chi, Koji Hino, and Belle L. Tseng, in UM 2007
36
Agenda

Introduction: growth and characteristics of usergenerated content

Three aspects

Monitoring: How to deliver fresh content to users

Aggregation: How to efficiently deliver personalized results to
users

Analysis of tagging data: Making tagging data useful for
advertisers
37
More than just tag-cloud
38
LDA of tagging data

d – bookmark, w – tag (merged all users), z – topics

Sample of topics
ID
Topics
Tags (top 10 p(w|z))
6
Environmental
Protection
environment energy green ibm item sustainability oil power solar alternative
27
Politics
politics government war iraq usa activism bush military political terrorism
38
Literature
books book literature ebooks free reading poetry ebook publishing writing
49
Copyrights
law copyright legal drm sony rights remote vnc ethics creativcommons
55
Health
health fitness medicine regex medical drugs exercise running diet training
63
Linux
linux unix ubuntu debian os sysadmin shell kernel bash livecd
69
Photography
photography photo photos flickr camera gallery images pictures digital
photoblog
39
Change of entropy

Tags with increasing popularity in a period
Correspond to well-established
topics where users have common
consensus
Correspond to developing topics
where users are willing to explore
new pages
40
Change of topic association
•
•
“Programmers” at Oct 2005
– programming, development, code,
patterns, dev, coding, algorithms,
scheme, software, ...
“Programmers” at Jan 2006
– programming, development, code,
lisp, dev, coding, algorithms, scheme,
software, cs, ...
– work, jobs, career, job, shell, sleep,
uml, regex, scripting, bash, ...
41
Specificity of word semantics

Entropy vs idf metrics
42
Summary

Features can be combined to build a classifier for words

Tag entropy change rate

KL-divergence of topic distribution

Entropy of semantic

Assist advertisers to select better keywords for advertisement

“Exploring Social Annotations for Word Usage Evolution” –
work in progress
43
Thank you!
44
Definition of metrics

τj – retrieval time
λ(t) – posting rate

Expected delay

Homogeneous Poisson model

(
j
j1)2
D
(
O
)
2

Inhomogeneous Poisson model
j
D
(
O
)
(
t
)(
j
t
)
dt

j
1
45
Resource allocation

Consider n data source O1, …, On





λi – posting rate of Oi
wi – weight of Oi
N – total number of retrievals per day
mi – number of retrievals per day allocated to Oi
Optimal allocation
mi  wii
46
Single retrieval per period

For a data source with posting rate λ(t) and period T,
the expected delay when retrieved at time τ is given
by:






T
D
(
)

t
)(

t
)
dt

t
)(
T


t
)
dt
(
(
0
Criteria
for
optimalit
T
1
d
()
()
 (
t
)
dt
and
0
0
T
dt





47
Blogs becoming inactive

Detection of abandoned blog to save resource
[2] D.R. Cox “Regression models and life-tables (with discussion)”
Journal of the Royal Statistical Society, B(34), 1972
[3] Gina Venolia “A Matter of Life or Death: Modeling Blog Mortality”
Technical report, Microsoft Research
48
More examples
49
Major posting patterns

K – means clustering
50
Threshold algorithm

Proposed by Fagin et.al. [2001]
Efficient computation of top-K items from multiple lists with a monotone
aggregate function
blogs
user groups
users
51
Illustration of matrix partition
User with more subscriptions
2 subscribers
Feeds
with
More
subscribers
9 subscribers
2 subscriptions
8 subscriptions
52
Download